CN112949620B

CN112949620B - Scene classification method and device based on artificial intelligence and electronic equipment

Info

Publication number: CN112949620B
Application number: CN202110534639.8A
Authority: CN
Inventors: 郭卉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-05-17
Filing date: 2021-05-17
Publication date: 2021-07-30
Anticipated expiration: 2041-05-17
Also published as: CN112949620A

Abstract

The application provides a scene classification method, a scene classification device, electronic equipment and a computer-readable storage medium based on artificial intelligence; the method comprises the following steps: aiming at the image sample of the first scene type, acquiring a second scene type of the image sample; the second scene category of the image sample is a sub-scene category of the first scene category of the image sample; jointly training a first scene classification model, a second scene classification model and a third scene classification model based on the first scene classification and the second scene classification; the first scene classification model is used for identifying a first scene category, the second scene classification model is used for identifying a second scene category, and the third scene classification model is obtained based on the combination of the first scene classification model and the second scene classification model; and carrying out scene classification processing on the images to be classified through the trained third scene classification model to obtain a first scene category corresponding to the images to be classified. By the method and the device, the scene classification accuracy can be improved through fine-grained scene identification.

Description

Scene classification method and device based on artificial intelligence and electronic equipment

Technical Field

The present application relates to artificial intelligence technologies, and in particular, to a scene classification method and apparatus based on artificial intelligence, an electronic device, and a computer-readable storage medium.

Background

Artificial Intelligence (AI) is a theory, method and technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results.

The typical task of image understanding is to identify image scenes, for example, for videos, scenes in which a scenario occurs in the videos need to be identified, tags of the videos are determined by understanding the scenes in which the scenario occurs in the videos, and therefore efficient video recommendation is performed, for example, for photos taken by a mobile phone, scenes of the photos need to be identified, and the tags of the photos need to be determined by understanding the scenes of the photos, so that efficiency of sorting and storing the photos is improved.

But the scene recognition granularity of the related art cannot meet the scene recognition requirement of high accuracy in practical application.

Disclosure of Invention

The embodiment of the application provides a scene classification method and device based on artificial intelligence, an electronic device and a computer-readable storage medium, which can improve the accuracy of scene recognition on an image.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a scene classification method based on artificial intelligence, which comprises the following steps:

for a plurality of image samples of a first scene class, acquiring at least one second scene class of each image sample;

wherein the at least one second scene class of each of the image samples is a sub-scene class of the first scene class of the image sample;

jointly training a first scene classification model, a second scene classification model and a third scene classification model based on a first scene class of the plurality of image samples and a second scene class of the plurality of image samples;

wherein the first scene classification model is used for identifying different first scene classes, the second scene classification model is used for identifying different second scene classes, and the third scene classification model is obtained based on the combination of the first scene classification model and the second scene classification model;

and carrying out scene classification processing on the image to be classified through the trained third scene classification model to obtain a first scene class corresponding to the image to be classified in different first scene classes.

The embodiment of the application provides a scene classification device based on artificial intelligence, includes:

the acquisition module is used for acquiring at least one second scene category of each image sample aiming at a plurality of image samples of a first scene category;

a training module, configured to jointly train a first scene classification model, a second scene classification model, and a third scene classification model based on a first scene classification of the plurality of image samples and a second scene classification of the plurality of image samples;

and the application module is used for carrying out scene classification processing on the image to be classified through the trained third scene classification model to obtain a first scene class corresponding to the image to be classified in different first scene classes.

In the foregoing solution, the obtaining module is further configured to: acquiring a first classification characteristic of each image sample through the first scene classification model; performing the following for each of the first scene categories: acquiring image samples belonging to the first scene category from the plurality of image samples to serve as target image samples, and performing clustering processing on the plurality of target image samples based on first classification characteristics of the plurality of target image samples to obtain at least one cluster in one-to-one correspondence with at least one second scene category; determining a second scene category for each of the image samples based on the at least one cluster.

In the foregoing solution, the obtaining module is further configured to: performing the following for each of the image samples: extracting a first convolution feature of the image sample, and performing pooling processing on the first convolution feature of the image sample to obtain a first pooling feature of the image sample; performing residual iteration processing on the first pooled features for multiple times to obtain a residual iteration processing result of the image sample; performing maximum pooling on the residual iterative processing result of the image sample to obtain a maximum pooling processing result of the image sample; and performing first embedding processing on the maximum pooling processing result of the image sample to obtain a first classification characteristic of the image sample.

In the foregoing solution, the obtaining module is further configured to: composing the plurality of target image samples into a set of target image samples; randomly selecting N target image samples from the set of target image samples, taking first classification features corresponding to the N target image samples as initial centroids of a plurality of clusters, and removing the N target image samples from the set of target image samples, wherein N is the number of second scene classes corresponding to the first scene class, and N is an integer greater than or equal to 2; initializing the iteration number of clustering processing to be M, and establishing a null set corresponding to each cluster, wherein M is an integer greater than or equal to 2; performing the following processing during each iteration of the clustering processing: updating each set of clusters, executing centroid generation processing based on the updating processing result to obtain a new centroid of each cluster, adding a target image sample corresponding to the initial centroid to the target image sample set again when the new centroid is different from the initial centroid, and updating the initial centroid based on the new centroid; determining each cluster set obtained after M times of iteration as a cluster processing result, or determining each cluster set obtained after M times of iteration as a cluster processing result; the centroids of the clusters obtained after iteration M times are the same as those of the clusters obtained after iteration M-1 times, M is smaller than M, M is an integer variable, and the value of M is larger than or equal to 2 and smaller than or equal to M.

In the foregoing solution, the obtaining module is further configured to: performing the following for each of the target image samples in the set of target image samples: determining a similarity between a first classification feature of the target image sample and an initial centroid of each of the clusters; determining an initial centroid corresponding to the maximum similarity as belonging to the same cluster as the target image sample, and transferring the target image sample to a set of clusters corresponding to the maximum similarity initial centroid, wherein the maximum similarity initial centroid is the initial centroid corresponding to the maximum similarity; and averaging the first classification characteristics of each target image sample in each cluster set to obtain a new centroid of each cluster.

In the foregoing solution, the obtaining module is further configured to: performing the following for each of the clusters: averaging the first classification features of each target image sample in each cluster to obtain the centroid of each cluster; performing the following for each of the image samples of the plurality of image samples: determining a feature distance between a first classification feature of the image sample and a centroid of each of the clusters, determining a cluster of centroids corresponding to a feature distance less than a feature distance threshold as a cluster associated with the image sample, and determining a second scene category corresponding to the cluster as a second scene category of the image sample.

In the foregoing solution, the obtaining module is further configured to: prior to determining a cluster of centroids corresponding to feature distances less than a feature distance threshold as the cluster associated with the target image sample, performing the following for each of the first scene categories: querying a database for a tagging accuracy of the first scene category; obtaining a first scene category threshold positively correlated to the marking accuracy; determining a number fraction of target image samples belonging to the first scene class in the plurality of image samples and using the number fraction as a weight of the first scene class threshold; and carrying out weighted summation processing on the plurality of first scene category thresholds based on the weight of each first scene category threshold to obtain the characteristic distance threshold.

In the foregoing solution, the obtaining module is further configured to: for each target image sample, determining a candidate center distance of a first classification feature of the target image sample from the centroid of each cluster, and determining the smallest candidate center distance as the center distance of the target image sample; acquiring ascending sequencing positions positively correlated with the marking accuracy, and performing ascending sequencing on the center distances of the target image samples; and acquiring a center distance corresponding to the ascending sorting position in the ascending sorting result, and determining the center distance as the first scene category threshold.

In the above scheme, the first scene classification model includes a feature network, a first feature processing network, and a first classification network corresponding to the first scene category; the obtaining module is further configured to: before the first classification feature of each image sample is obtained through the first scene classification model, feature extraction processing is carried out on the image samples through the feature network, and a maximum pooling processing result of the image samples is obtained; performing first embedding processing on the maximum pooling processing result of the image sample through the first feature processing network to obtain a first classification feature of the image sample; mapping, by the first classification network, the first classification feature to a first prediction probability of a pre-labeled first scene class for the image sample; determining a first classification loss for the image sample based on the first prediction probability and a pre-labeled first scene class of the image sample; updating parameters of the first scene classification model according to a first classification loss of the image sample.

In the foregoing solution, the training module is further configured to: performing the following for each of the image samples: carrying out scene classification processing on the image sample through the first scene classification model, the second scene classification model and the third scene classification model to obtain a scene classification result of the image sample; updating parameters of the first scene classification model, the second scene classification model, and the third scene classification model based on the scene classification result, the pre-labeled first scene class of the image sample, and the pre-labeled second scene class of the image sample.

In the above solution, the first scene classification model shares a feature network of the first scene classification model with the second scene classification model and the third scene classification model respectively, the first scene classification model shares a first feature processing network of the first scene classification model with the third scene classification model, and the second scene classification model shares a second feature processing network of the second scene classification model with the third scene classification model; the training module is further configured to: performing feature extraction processing on the image sample through the feature network to obtain a maximum pooling processing result of the image sample; performing first embedding processing on the maximum pooling processing result of the image sample through a first feature processing network of the first scene classification model to obtain a first classification feature of the image sample; mapping, by the first classification network, the first classification feature to a first prediction probability of a pre-labeled first scene class for the image sample; performing second embedding processing on the maximum pooling processing result of the image sample through the second feature processing network to obtain a second classification feature of the image sample, and mapping the second classification feature to a second prediction probability of at least one pre-labeled second scene category of the image sample through a second classification network of the second scene classification model; splicing the first classification characteristic and the second classification characteristic, and mapping a splicing processing result to a third prediction probability of the image sample, which is used for pre-marking the first scene category, through a third classification network of the third scene classification model; and combining the first prediction probability, the second prediction probability and the third prediction probability into a scene classification result of the image sample.

In the foregoing solution, the training module is further configured to: determining a first classification loss based on the first prediction probability and a pre-labeled first scene class of the image sample; determining a third classification loss based on the third prediction probability and a pre-labeled first scene class of the image sample; determining a second classification loss based on at least one of the second prediction probabilities and at least one pre-labeled second scene class of the image sample; performing fusion processing on the first classification loss, the second classification loss and the third classification loss to obtain a joint loss; updating parameters of the first scene classification model, the second scene classification model, and the third scene classification model according to the joint loss.

An embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the scene classification method based on artificial intelligence provided by the embodiment of the application when the executable instructions stored in the memory are executed.

The embodiment of the application provides a computer-readable storage medium, which stores executable instructions and is used for realizing the scene classification method based on artificial intelligence provided by the embodiment of the application when being executed by a processor.

The embodiment of the application has the following beneficial effects:

scene labels (second scene classes) used for representing fine-grained scenes are mined from image samples with labels of the first scene classes, so that training can be performed based on the second scene classes and the first scene classes simultaneously, which is equivalent to learning feature expression of the first scene classes and feature expression of the fine-grained scenes simultaneously, and as the third scene classification model is obtained based on the combination of the first scene classification model and the second scene classification model, the trained third scene classification model has combined feature expression of the first scene classes and the fine-grained scene classes (second scene classes), so that scene recognition accuracy is effectively improved.

Drawings

FIG. 1 is a logic diagram of a scene classification method in the related art;

FIG. 2 is a schematic diagram of an architecture of an artificial intelligence based scene classification system provided in an embodiment of the present application;

fig. 3 is a schematic structural diagram of an electronic device provided in an embodiment of the present application;

FIG. 4A is a flowchart illustrating an artificial intelligence based scene classification method according to an embodiment of the present application;

FIG. 4B is a flowchart illustrating step 101 of an artificial intelligence based scene classification method according to an embodiment of the present application;

FIG. 4C is a flowchart illustrating step 102 of an artificial intelligence based scene classification method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of an architecture of a scene classification method based on artificial intelligence provided in an embodiment of the present application;

FIG. 6 is a schematic diagram of a residual error network of an artificial intelligence-based scene classification method provided in an embodiment of the present application;

FIG. 7 is a diagram illustrating a second scene class of an artificial intelligence based scene classification method according to an embodiment of the present application;

FIG. 8 is a flowchart illustrating a scene classification method based on artificial intelligence according to an embodiment of the present disclosure.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, references to the terms "first \ second \ third" are only to distinguish similar objects and do not denote a particular order, but rather the terms "first \ second \ third" are used to interchange specific orders or sequences, where appropriate, so as to enable the embodiments of the application described herein to be practiced in other than the order shown or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) Image recognition: image recognition is a technology for performing specific level classification on an image, generally, regardless of a specific instance of an object, only the class of the object is considered for image recognition and the class to which the object belongs is given, for example, the image is classified into a person, a dog, a cat, a bird, and the like, and a model trained based on a large-scale general object recognition source data set ImageNet can recognize which one of 1000 classes a certain object belongs to.

2) Multi-label recognition task of images: whether an image corresponds to having multiple attribute tags is identified, for example, an image has multiple attribute tags, and the multi-tag identification task is used for judging which attribute tags the image has.

3) And (3) carrying out noise identification: training of the image recognition task is performed based on noise samples, including samples with incorrect class labels, samples with inaccurate class labels, e.g. the image does not correspond exactly to the class label, the concepts of the two class labels have a partial overlap, the image has the properties of the two class labels, but only one class label.

4) ImageNet: large generic objects identify the source data set.

5) ImageNet pre-training model: and training a deep learning network model based on the large-scale general object recognition source data set ImageNet, wherein the trained deep learning network model is an ImageNet pre-training model.

In the related art, the primary task of video understanding is to identify scenes in which a scenario occurs in a video, see fig. 1, fig. 1 is a logical schematic diagram of a scene classification method in the related art, fig. 1 presents a local region (salient region) extraction process of a certain scale, where a local region is a salient region of a background of an image, but any region of the image can not be used for scene identification, for an image X, a potential object density of each position in the scene is calculated according to distribution of potential object frames, for the potential object density in the scene, an object density in a window region in the image is calculated by using a sliding window, a sliding window response is performed by combining the image X and the potential object density, so that a region with the highest potential object density is extracted as a local region, and feature learning is performed based on the multi-scale local region, and scene recognition is performed based on the feature learning result.

When scene recognition is carried out based on a multi-scale local region feature learning mode in the related art, the applicant finds that the following technical problems exist: 1. the scene classification model in the related technology is a two-stage model, and a target detection and positioning task needs to be completed in advance in a training process and a reasoning process, and then a scene recognition task is completed; 2. when a model corresponding to a target detection positioning task is trained in the related technology, all objects possibly appearing in a scene need to be marked, and time and labor are consumed; 3. in the related art, detection targets do not exist in all scenes, such as seasides, forests and the like, and the scenes have no common detection target.

The applicant also finds that the difficulty of recognizing the high-level semantics in scene recognition is greater than that of recognizing a general object, so as to cause semantic blurring of scene categories, and thus the recognition is difficult, for example, a sample of an image of a park category may include more precise scenes such as lake scenes, paths in the forest, children's rides, park fitness areas, park lawns, park forests, and the like, and the combination of these objects may constitute the park category, but only 1 object appears in the park category, and for example, when only a child's ride appears in the image, the scene of the image may not necessarily be a park category, but may also be a commercial park category, and therefore, since fine-grained scene categories under the scene categories are not labeled, the recognition of the semantic blurring of the scene recognition is difficult, the applicant finds how to mine the implicit fine-grained scene categories of each scene category when implementing the embodiment of the present application, and scene recognition is carried out by means of the mined fine-grained scene categories, which is an urgent technical problem to be solved.

In view of the foregoing technical problems, embodiments of the present application provide a scene classification method and apparatus based on artificial intelligence, an electronic device, and a computer-readable storage medium, which can mine a fine-grained scene category in a first scene category, and train joint expression capability of features of the fine-grained scene category and features of the first scene category, so as to effectively improve scene recognition accuracy.

The scene classification method provided by the embodiment of the application can be implemented by various electronic devices, for example, can be implemented by a terminal or a server alone, or can be implemented by the terminal and the server in a cooperation manner.

Referring to fig. 2, fig. 2 is a schematic diagram of an architecture of an artificial intelligence based scene classification system provided in an embodiment of the present application, a terminal 400 is connected to a server 200 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of the two.

In some embodiments, the functionality of the artificial intelligence based scene classification system is implemented on the basis of the server 200, during the use of the terminal 400 by the user, the terminal 400 collects image samples and transmits them to the server 200, such that the server 200 performs a training of a first scene classification model (for identifying a different first scene class), a second scene classification model (for identifying a different second scene class) and a third scene classification model based on the first scene class and the second scene class, integrates the trained third scene classification model in the server 200, in response to the terminal 400 receiving the image shot by the user, the terminal 400 sends the image to the server 200, and the server 200 determines a scene classification result of the image through the third scene classification model and sends the scene classification result to the terminal 400, so that the terminal 400 directly presents the scene classification result.

In some embodiments, when the scene classification system is applied to a video recommendation scene, the terminal 400 receives a video to be uploaded, the terminal 400 sends the video to the server 200, the server 200 determines a scene classification result of a video frame in the video as a scene classification result of the video through the third scene classification model, and performs a recommendation operation on the video based on the scene classification result to send the recommended video to the terminal 400, where the terminal uploading the video and the terminal presenting the scene classification result may be the same or different.

In some embodiments, when the scene classification system is applied to an image capturing scene, the terminal 400 receives an image captured by a user, the terminal 400 sends the captured image to the server 200, and the server 200 determines a scene classification result of the image through the third scene classification model and sends the scene classification result to the terminal 400, so that the terminal 400 directly presents the scene classification result and stores the captured image according to the corresponding scene classification result.

In other embodiments, when the scene classification method provided in this embodiment is implemented by the terminal alone, in the various application scenarios described above, the terminal may run the third scene classification model to determine a scene classification result of the image or the video, and directly present the scene classification result.

In some embodiments, the server 200 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform. The terminal 400 may be a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, a smart television, a smart car device, and the like, and the terminal 400 may be provided with a client, for example, but not limited to, a video client, a browser client, an information flow client, an image capturing client, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the embodiment of the present application is not limited.

Next, a structure of an electronic device for implementing an artificial intelligence based scene classification method provided in an embodiment of the present application is described, and as described above, the electronic device provided in an embodiment of the present application may be the server 200 or the terminal 400 in fig. 2. Referring to fig. 3, fig. 3 is a schematic structural diagram of an electronic device provided in the embodiment of the present application, and the electronic device is taken as a server 200 for example. The server 200 shown in fig. 3 includes: at least one processor 210, memory 250, at least one network interface 220. The various components in server 200 are coupled together by a bus system 240. It is understood that the bus system 240 is used to enable communications among the components. The bus system 240 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 240 in fig. 3.

The Processor 210 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The memory 250 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 250 optionally includes one or more storage devices physically located remotely from processor 210.

The memory 250 includes volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 250 described in embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 250 is capable of storing data, examples of which include programs, modules, and data structures, or a subset or superset thereof, to support various operations, as exemplified below.

An operating system 251 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks; a network communication module 252 for communicating to other electronic devices via one or more (wired or wireless) network interfaces 220, exemplary network interfaces 220 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), among others.

In some embodiments, the artificial intelligence based scene classification apparatus provided in the embodiments of the present application may be implemented in software, and fig. 3 illustrates an artificial intelligence based scene classification apparatus 255 stored in a memory 250, which may be software in the form of programs and plug-ins, and includes the following software modules: an acquisition module 2551, a training module 2552 and an application module 2553, which are logical and thus can be arbitrarily combined or further split according to the implemented functions, which will be described below.

The artificial intelligence based scene classification method provided by the embodiment of the present application will be described in conjunction with an exemplary application and implementation of the server 200 provided by the embodiment of the present application.

Referring to fig. 5, fig. 5 is an architecture diagram of a scene classification method based on artificial intelligence according to an embodiment of the present application, where the architecture diagram includes a basic model (hereinafter referred to as a first scene classification model), a prototype class mining network (corresponding to prototype learning), a second scene classification model based on prototype classes, and a third scene classification model based on joint features, where a prototype class (hereinafter referred to as a second scene class) is a sub-scene of the first scene class, for example, the first scene class is a park class, and the second scene class is a lake, an amusement park, a footpath, etc., the prototype class mining network is a network for acquiring a second scene labeling class of an image, the second scene classification model has a capability of identifying a different second scene class, and the joint features are obtained by splicing second classification features output by the second scene classification model with first classification features output by the first scene classification model, the third scene classification model has the capability of identifying different first scene classes, and the feature expression of the first scene class and the feature expression of the second scene class can be learned at the same time because the features of the third scene classification model are derived from the other two scene classification models.

Referring to fig. 4A, fig. 4A is a schematic flowchart of a scene classification method based on artificial intelligence according to an embodiment of the present application, which will be described with reference to steps 101 and 103 shown in fig. 4A.

In step 101, for a plurality of image samples of a first scene class, at least one second scene class of each image sample is acquired.

As an example, the at least one second scene category of each image sample is a sub-scene category of the first scene category of the image sample, i.e. the second scene category is a sub-scene of the first scene category, e.g. the first scene category is a park category, the second scene category is a lake, a playground, a walkway, etc. within a park, the first scene category is a mall category, the second scene category is an escalator, a checkout counter, a restaurant, etc. within a mall. Step 101 is to dig out a second scene class of the image samples based on a plurality of image samples pre-marked with the first scene class, for example, there are 100 image samples, of which 30 image samples are pre-marked with the first scene class of the park class and 70 image samples are pre-marked with the first scene class of the mall class, the second scene class of each of the 100 image samples may be obtained through step 101, for example, whether each image sample may be marked as a lake, a playground, etc., and each image sample may dig out one second scene class or dig out a plurality (at least two) of second scene classes.

As an example, in the scene classification method based on artificial intelligence provided in the embodiment of the present application, since the second scene class of each image sample is mined in an unsupervised clustering manner, the second scene class does not have a specific real class name, that is, the second scene class that does not concern the image samples is a lake or a sea, and can be abstractly replaced with a mark symbol, as long as when the second scene class is identified in the training stage, the corresponding image samples are mapped to the corresponding mark symbol to obtain a corresponding loss, it can be ensured that in the actual learning process, the second scene classification model and the third scene classification model learn an actual feature expression of a certain second scene class, for example, 30 target image samples of the first scene class a are clustered to obtain two clusters, which are respectively the cluster C corresponding to the second scene class C and the cluster D corresponding to the second scene class D, although it is not clear to the user whether the second scene categories C and D are lakes or amusement parks, it may be determined that the target image samples in the set corresponding to the cluster C of the second scene category C are all objects having the second scene category C, and therefore training with the target image samples identified as the second scene category C during training may sufficiently learn the feature expression of the second scene category a, thereby implementing unsupervised second scene category mining.

In some embodiments, referring to fig. 4B, fig. 4B is a flowchart illustrating a step 101 of the artificial intelligence based scene classification method provided in an embodiment of the present application, where in the step 101, at least one second scene category of each image sample is obtained for a plurality of image samples of a first scene category, and the obtaining may be implemented in

steps

1011 and 1013 of fig. 4B.

In step 1011, a first classification feature of each of the plurality of image samples is obtained through the first scene classification model.

In some embodiments, the obtaining the first classification characteristic of each of the plurality of image samples in step 1011 may be implemented by: the following processing is performed for each image sample: extracting a first convolution characteristic of the image sample, and performing pooling processing on the first convolution characteristic of the image sample to obtain a first pooling characteristic of the image sample; carrying out residual iteration processing on the first pooled features for multiple times to obtain a residual iteration processing result of the image sample; performing maximum pooling on the residual iterative processing result of the image sample to obtain a maximum pooling result of the image sample; and carrying out first embedding processing on the maximum pooling processing result of the image sample to obtain a first classification characteristic of the image sample.

As an example, after the image sample is input into the convolutional neural network, the feature extraction process is performed, and the feature extraction process may be as shown in table 1, first the first convolutional feature of the image sample is extracted through the convolutional layer Conv1 in table 1, and the first convolutional feature of the image sample is subjected to a pooling process through the convolutional layer Conv2 in table 1, so as to obtain a first pooled feature of the image sample, the pooling process may be a maximum pooling process, then the first pooling feature is subjected to a plurality of residual iteration processes through the convolutional layers Conv2-Conv5 in table 1, so as to obtain a residual iteration process result of the image sample, the residual iteration process result is subjected to a maximum pooling process through the pooling layer in table 2, so as to obtain a maximum pooling result of the image sample, the residual iteration process result is a feature extraction result in table 1, the first feature processing network of the first scene classification model is practiced as a first feature processing layer, the first classification characteristic is obtained by performing full join processing on the maximum pooling result through the first characteristic processing layer (Embedding 1) in table 2.

TABLE 1 convolution layer structure table in ResNet-101

TABLE 2 Structure tables of layers in the first scene classification model except for convolutional layer structures

In some embodiments, the first scene classification model includes a feature network, a first feature processing network, and a first classification network corresponding to a first scene category; before the first classification feature of each image sample in the plurality of image samples is obtained through the first scene classification model, feature extraction processing is carried out on the image samples through a feature network, and the maximum pooling processing result of the image samples is obtained; performing first embedding processing on the maximum pooling processing result of the image sample through a first feature processing network to obtain a first classification feature of the image sample; mapping the first classification feature to a first prediction probability of a pre-labeled first scene class of the image sample through a first classification network; determining a first classification loss of the image sample based on the first prediction probability and a pre-labeled first scene class of the image sample; parameters of the first scene classification model are updated according to the first classification loss of the image sample.

As an example, the feature network may be practiced as the layer structure in table 1 and the pooling layer Pool in table 2, the first feature processing network may be practiced as the first feature processing layer imbedding 1 in table 2, the first classification network may be practiced as the first classification layer FC1 in table 2, the process of obtaining the maximum pooling processing result of the image sample by performing the feature extraction processing on the image sample through the feature network, and the process of obtaining the first classification feature of the image sample by performing the first Embedding processing on the maximum pooling processing result of the image sample through the first feature processing network may refer to step 1011, obtain the first prediction probability that the image sample belongs to each first scene class by performing the full connection processing on the first classification feature through the first classification layer (FC 1) in table 2, the number of the first scene classes is P, P is a positive integer, the first prediction probability cross-loss function of the pre-labeled first scene class entropy of the image sample is substituted into the entropy cross-loss function, obtaining a first classification loss, updating parameters of the first scene classification model in a reverse direction, updating only parameters of the first classification layer and the first feature processing layer during the reverse updating, and completing training of the first scene classification model through forward propagation and reverse updating based on a gradient descent method, wherein the first feature processing layer is a full-connection layer and outputs 1024-dimensional first classification features, the convolutional layer Conv1-Conv5 can adopt parameters of ResNet101 pre-trained on an ImageNet data set, and newly added layers (such as Embedding1 and FC 1) are initialized by adopting a Gaussian distribution with a variance of 0.01 and a mean value of 0. Different network structures, different pre-trained model weights may be used as the first scene classification model. The first scene classification model obtained through the training process has the scene recognition capability of recognizing P first scene categories.

In step 1012, the following is performed for each first scene category: the method comprises the steps of obtaining image samples belonging to a first scene category from a plurality of image samples to serve as target image samples, and carrying out clustering processing on the plurality of target image samples based on first classification characteristics of the plurality of target image samples to obtain at least one cluster corresponding to at least one second scene category one to one.

Taking the above example in mind, there are 100 image samples, 30 of which are pre-labeled as the first scene category of park category (hereinafter referred to as first scene category a), 70 of which are pre-labeled as the first scene category of mall category (hereinafter referred to as first scene category B), and for the first scene category a, the image sample pre-labeled as the first scene category a of the 100 image samples is obtained as the target image sample in the implementation of step 1012, then based on the first classification features of 30 target image samples pre-labeled as the first scene class a, and (3) clustering the 30 target image samples to obtain at least one cluster corresponding to at least one second scene category one to one, and finishing clustering processing aiming at the first scene category A based on the process so as to obtain at least one second scene category under the first scene category A.

In some embodiments, in step 1012, based on the first classification features of the multiple target image samples, performing clustering processing on the multiple target image samples to obtain at least one cluster in one-to-one correspondence with at least one second scene category, which may be implemented by the following technical solutions: forming a plurality of target image samples into a target image sample set; randomly selecting N target image samples from a target image sample set, taking first classification features corresponding to the N target image samples as initial centroids of a plurality of clusters, and removing the N target image samples from the target image sample set, wherein N is the number of second scene classes corresponding to the first scene class, and N is an integer greater than or equal to 2; initializing the iteration number of clustering processing to be M, and establishing a null set corresponding to each cluster, wherein M is an integer greater than or equal to 2; the following processing is performed during each iteration of the clustering process: updating each cluster set, executing centroid generation processing based on the updating processing result to obtain a new centroid of each cluster, adding the target image sample corresponding to the initial centroid to the target image sample set again when the new centroid is different from the initial centroid, and updating the initial centroid based on the new centroid; determining a set of each cluster obtained after M times of iteration as a clustering result, or determining a set of each cluster obtained after M times of iteration as a clustering result; the centroids of the clusters obtained after iteration M times are the same as those of the clusters obtained after iteration M-1 times, M is smaller than M, M is an integer variable, and the value of M is larger than or equal to 2 and smaller than or equal to M.

In some embodiments, the updating process is performed on the set of each cluster, and the centroid generation process is performed based on the result of the updating process to obtain a new centroid of each cluster, which may be implemented by the following technical solutions: performing the following for each target image sample of the set of target image samples: determining a similarity between the first classification feature of the target image sample and the initial centroid of each cluster; determining the initial centroid corresponding to the maximum similarity as belonging to the same cluster as the target image sample, and transferring the target image sample to a set of clusters corresponding to the initial centroid of the maximum similarity, wherein the initial centroid of the maximum similarity is the initial centroid corresponding to the maximum similarity; and averaging the first classification characteristics of each target image sample in each cluster set to obtain a new centroid of each cluster.

Taking the above example as a support, for the first scene class a, there are 30 target image samples, and the preset number of the second scene class is 2, then the clustering process is aimed at dividing the 30 target image samples into two clusters, each cluster has a corresponding set, each set includes the target image samples of the corresponding cluster, first randomly selecting the first classification features of the 2 target image samples as the initial centroids of the two clusters, respectively, for the remaining 28 target image samples, calculating the similarity between each target image sample and the 2 initial centroids, for example, taking the L2 distance to evaluate the similarity, for example, for the target image sample E whose first classification feature is closer to the initial centroid a, allocating the target image sample E to the set of clusters corresponding to the initial centroid a, after performing the allocation operation for all 28 target image samples, recalculating a corresponding new centroid for each cluster, if the new centroids of the two clusters are the same or the similarity between the new centroid and the initial centroid is greater than the similarity threshold, directly determining each set as a clustering result, if the new centroids of the two clusters are different and the similarity between the new centroid and the initial centroid is not greater than the similarity threshold, continuously updating the initial centroid with the new centroid, and performing allocation operation on the target image samples other than the target image samples corresponding to the centroids again, for example, after the original initial centroid a is replaced by the new centroid, the target image sample corresponding to the original initial centroid a belongs to the target image samples other than the target image samples corresponding to the centroids and needs to participate in the allocation process again until the new centroids of the two clusters are the same or the similarity between the new centroid and the initial centroid is greater than the similarity threshold, or iterated a specified number of times.

In step 1013, a second scene category for each image sample is determined based on the at least one cluster.

In some embodiments, the determining the second scene category of each image sample based on at least one cluster may be implemented by the following technical solutions: the following processing is performed for each cluster: averaging the first classification features of each target image sample in each cluster to obtain the centroid of each cluster; performing the following for each of a plurality of image samples: determining a feature distance between a first classification feature of the image sample and the centroid of each cluster, determining a cluster of centroids corresponding to the feature distance smaller than a feature distance threshold as a cluster associated with the image sample, and determining a second scene category corresponding to the cluster as a second scene category of the image sample.

Taking advantage of the above example, step 1012 may obtain at least one cluster of the first scene categories a corresponding to the second scene categories one to one, and at least one cluster of the first scene categories B corresponding to the second scene categories one to one, and if the number of the second scene categories in each first scene category is 2, step 1012 obtains 4 clusters in total, and each cluster corresponds to 4 second scene categories. At least one cluster in step 1013 refers to a cluster obtained for all first scene classes, and for 100 image samples, a second scene class corresponding to each image sample needs to be determined, so that a centroid of the cluster corresponding to each second scene class is first determined, for example, for a second scene class C, there is a set C of corresponding clusters, where the set C includes 10 target image samples, and since the first classification features of 100 image samples have been obtained in advance and the target image samples are from the image samples, the first classification features of 10 target image samples are directly averaged to obtain a centroid of the cluster corresponding to the second scene class C, where the centroid can represent a feature expression of the second scene class C, and for a certain image sample F, a feature distance between the image sample F and 4 centroids of 4 second scene classes is determined, the feature distance may be measured by an L2 distance, or by a cosine distance, and the second scene class corresponding to the centroid where the feature distance is smaller than the feature distance threshold is determined as the second scene class of the image sample F.

In some embodiments, prior to determining a cluster of centroids corresponding to feature distances less than a feature distance threshold as the cluster associated with the image sample, for each first scene class, performing the following: querying a database for a tagging accuracy of a first scene category; acquiring a first scene category threshold which is positively correlated with the marking accuracy; determining the number proportion of target image samples belonging to a first scene category in a plurality of image samples, and taking the number proportion as the weight of a first scene category threshold; and carrying out weighted summation processing on the plurality of first scene category thresholds based on the weight of each first scene category threshold to obtain a characteristic distance threshold.

Taking the above example as a support, the feature distance is evaluated by the feature distance threshold, and if the feature distance is smaller than the feature distance threshold, the second scene category corresponding to the centroid corresponding to the feature distance is considered as the second scene category of the image sample, so the feature distance threshold is very important for the accuracy of mining the second scene category, and since the feature distance threshold is threshold data applicable to all the image samples and all the second scene categories, the labeling accuracy of all the first scene categories needs to be considered comprehensively to determine the feature distance threshold, and since the higher the labeling accuracy is, the more accurate the set of clusters representing the second scene categories obtained by clustering corresponding to the first scene categories is, it is seen that, if the number of image samples of the first scene category with high labeling accuracy is greater, the higher the feature distance threshold may be, and if the number of image samples of the first scene category with low labeling accuracy is greater, the feature distance threshold needs to be lowered appropriately to prevent errors in associating the second scene class to the image sample.

Taking the above example in mind, the following processing is performed for each first scene class, for example, the following processing is performed for the first scene class a: querying a database for a labeling accuracy of the first scene class a, e.g. there are 3 image samples out of 30 image samples labeled as first scene class a that do not actually belong to the first scene class a, obtaining a first scene class threshold 90% positively correlated with a labeling accuracy, determining a number fraction of target image samples belonging to the first scene class in a plurality of image samples, the number fraction of image samples labeled as first scene class a being 30%, and using the number fraction as a weight of the first scene class threshold, i.e. 0.3 as a weight of the labeling accuracy 90%, by which a labeling accuracy of 80% for the first scene class B, a first scene class threshold 80% positively correlated with a labeling accuracy of 80%, and a number fraction of image samples labeled as first scene class B being 70% are also obtained, based on the weight of each first scene class threshold, and performing weighted summation processing on the plurality of first scene category thresholds to obtain a characteristic distance threshold, and thus performing weighted summation on the marking accuracy of 80% and the marking accuracy of 90% to obtain the characteristic distance threshold of 0.83.

In some embodiments, the obtaining of the first scene category threshold positively correlated to the marking accuracy may be implemented by the following technical solutions: for each target image sample, determining a candidate center distance between a first classification feature of the target image sample and the centroid of each cluster, and determining the minimum candidate center distance as the center distance of the target image sample; acquiring ascending sequencing positions positively correlated with the marking accuracy, and performing ascending sequencing on the center distances of the multiple target image samples; and acquiring the center distance corresponding to the ascending sorting position in the ascending sorting result, and determining the center distance as a first scene category threshold.

Continuing with the above embodiment, with respect to the first scene category a, determining a first scene category threshold through specific steps, where the determined first scene category threshold is positively correlated with the marking accuracy, there are 30 target image samples marked as the first scene category a, obtaining clusters of two second scene categories under the first scene category a after the clustering processing of step 1012, determining, with respect to the target image sample E, a candidate center distance between the first classification feature of the target image sample and the centroid of each of the clusters of the two second scene categories, for example, determining the candidate center distance between the target image sample E and the centroid of a certain second scene category to be 0.7, determining the candidate center distance between the target image sample E and the centroid of another second scene category to be 0.8, and determining the smallest candidate center distance 0.7 as the center distance of the target image sample E, the above-described processing is performed on the 30 target image samples, thereby obtaining the center distances of the 30 target image samples, obtaining the ascending sort positions that are positively correlated with the marking accuracy, for example, the marking accuracy is 80%, the ascending sort positions that are positively correlated with the marking accuracy are the 24 th bit among the 30 target image samples, and ascending sorting the center distances of the plurality of target image samples, ascending sorting the center distances of the 30 target image samples from small to large, obtaining the center distance corresponding to the 24 th bit in the ascending sort result, and determining the center distance of the 24 th bit as the first scene category threshold.

In step 102, a first scene classification model, a second scene classification model and a third scene classification model are jointly trained based on a first scene class of a plurality of image samples and a second scene class of a plurality of image samples.

As an example, a first scene classification model is used to identify a first different scene class, a second scene classification model is used to identify a second different scene class, and a third scene classification model is obtained based on a combination of the first scene classification model and the second scene classification model.

In some embodiments, referring to fig. 4C, fig. 4C is a flowchart illustrating a step 102 of the artificial intelligence based scene classification method provided in the embodiment of the present application, in which in the step 102, based on a first scene category of a plurality of image samples and a second scene category of the plurality of image samples, a first scene classification model, a second scene classification model, and a third scene classification model are jointly trained, and the implementation of step 1021 and step 1022 shown in fig. 4C may be performed for each image sample.

In step 1021, a scene classification process is performed on the image sample through the first scene classification model, the second scene classification model, and the third scene classification model, so as to obtain a scene classification result of the image sample.

In some embodiments, the first scene classification model shares a feature network of the first scene classification model with the second scene classification model and the third scene classification model, respectively, the first scene classification model shares a first feature processing network of the first scene classification model with the third scene classification model, and the second scene classification model shares a second feature processing network of the second scene classification model with the third scene classification model; in step 1021, performing scene classification processing on the image sample through the first scene classification model, the second scene classification model and the third scene classification model to obtain a scene classification result of the image sample, which can be implemented by the following technical scheme: performing feature extraction processing on the image sample through a feature network to obtain a maximum pooling processing result of the image sample; performing first embedding processing on the maximum pooling processing result of the image sample through a first feature processing network of a first scene classification model to obtain a first classification feature of the image sample; mapping the first classification feature to a first prediction probability of a pre-labeled first scene class of the image sample through a first classification network; performing second embedding processing on the maximum pooling processing result of the image sample through a second feature processing network to obtain a second classification feature of the image sample, and mapping the second classification feature into a second prediction probability of at least one pre-marked second scene category of the image sample through a second classification network of a second scene classification model; splicing the first classification characteristic and the second classification characteristic, and mapping a splicing processing result to a third prediction probability of the pre-marked first scene category of the image sample through a third classification network of a third scene classification model; and combining the first prediction probability, the second prediction probability and the third prediction probability into a scene classification result of the image sample.

As an example, referring to fig. 5, the first scene classification model shares a feature network (convolutional neural network) of the first scene classification model with the second scene classification model and the third scene classification model respectively, the first scene classification model shares a first feature processing network (Embedding 1) of the first scene classification model with the third scene classification model, the second scene classification model shares a second feature processing network (Embedding 2) of the second scene classification model with the third scene classification model, the image sample is input into the convolutional neural network and then subjected to a feature extraction process, which may be referred to in table 1 and table 2, first convolution feature of the image sample is extracted through convolution layer Conv1 in table 1, and the first convolution feature of the image sample is subjected to a pooling process through convolution layer Conv2 in table 1 to obtain a first pooling feature of the image sample, and the pooling process may be a maximum pooling process, performing residual iterative processing on the first pooled feature for multiple times by using a convolutional layer Conv2-Conv5 in Table 1 to obtain a residual iterative processing result of the image sample, performing maximum pooling processing on the residual iterative processing result by using a pooling layer in Table 2, wherein the residual iterative processing result is a feature extraction result in Table 1 to obtain a maximum pooling result of the image sample, performing first Embedding processing on the maximum pooling result by using a first feature processing layer (Embedding 1) in Table 2, wherein the first Embedding processing is actually full-connection processing to obtain a first classification feature, the first classification network of the first scene classification model is implemented as a first classification layer, and similarly, performing maximum pooling processing on the residual iterative processing result by using the pooling layer in Table 3, wherein the iterative processing result is a feature extraction result in Table 1 to obtain a maximum pooling result of the image sample, and performing full-connection processing on the first classification feature by using the first classification layer (FC 1) in Table 2, mapping to obtain a first prediction probability of the image sample belonging to each first scene category, wherein the first prediction probability (corresponding to the first classification loss) of the pre-labeled first scene categories is included, the number of the first scene categories is P, P is a positive integer, aiming at the second scene classification model, performing a second Embedding process on the maximum pooling result of the image sample through a second feature processing layer (Embedding 2) in table 3, the second Embedding process is actually a full connection process to obtain a second classification feature, a second classification network of the second scene classification model is implemented as a second classification layer, performing a full connection process on the second classification feature through a second classification layer (FC 2) in table 3, mapping to obtain a second prediction probability (corresponding to the second classification loss) of the image sample belonging to each second scene category, wherein the second prediction probability (corresponding to the second classification loss) of the pre-labeled second scene categories is included, the number of the second scene categories is K x P, K is a positive integer, and K is the number of the second scene categories under each first scene category.

TABLE 3 Structure tables of layers in the second scene classification model except for convolutional layer structure

As an example, for the third scene classification model, the third classification network is implemented as a third classification layer FC3, the first classification features and the second classification features are subjected to splicing processing, and then full-connection processing is performed on the splicing processing result through the third classification layer (FC 3) in table 4, so as to obtain a third prediction probability that the image sample belongs to each first scene category, where the third prediction probability (corresponding to a third classification loss) that pre-marks the first scene category is included, and the number of the first scene categories is P.

Table 4 table of each layer structure of third classification network in third scene classification model

In step 1022, parameters of the first scene classification model, the second scene classification model and the third scene classification model are updated based on the scene classification results of the plurality of image samples, the pre-labeled first scene class of the image samples and the pre-labeled second scene class of the image samples.

In some embodiments, updating parameters of the first scene classification model, the second scene classification model, and the third scene classification model based on the scene classification results of the plurality of image samples, the pre-labeled first scene classification of the image samples, and the pre-labeled second scene classification of the image samples in step 1022 may be implemented by: determining a first classification loss based on the first prediction probability and a pre-labeled first scene class of the image sample; determining a third classification loss based on the third prediction probability and the pre-labeled first scene class of the image sample; determining a second classification loss based on the at least one second prediction probability and the at least one pre-labeled second scene class of the image sample; performing fusion processing on the first classification loss, the second classification loss and the third classification loss to obtain combined loss; and updating parameters of the first scene classification model, the second scene classification model and the third scene classification model according to the joint loss.

As an example, for a first scene classification model, the first classification penalty is a cross-entropy penalty function, see equation (1):

（1）；

wherein an image sample with a pre-marked first scene category is input,

a first scene category is pre-labeled for the image sample,

a first prediction probability for FC1 layer prediction, E a first classification loss, nclass characterizing n first scene classes, i an identification of image samples.

As an example, taking at least one bi-classified cross-entropy loss function as the second classification loss corresponding to the output of FC2, the following is a bi-classified cross-entropy loss function for a certain second scene class, see equation (2):

（2）；

wherein the content of the first and second substances,

is the loss of an image sample i for a certain second scene class,

a second prediction probability output for FC2 corresponding to the pre-labeled second scene class,

is the real label of the image sample, if the real label of the image sample is the second scene type of the pre-mark

1, if the true label of the image sample is not the pre-labeled second scene type, then

Is 0.

As an example, for an image sample F, which has 2 second scene categories (e.g., lakes and footpaths), cross entropy loss functions of two categories of the two second scene categories are obtained respectively, and then the cross entropy loss functions are summed to obtain a second category loss.

As an example, the third classification loss of the third scene classification model is similar to the first classification loss, and the overall loss of the joint training is the joint of the first classification loss, the second classification loss, and the third classification loss.

In some embodiments, when updating the parameters, the parameters of the first scene classification model, the second scene classification model, and the third scene classification model may be updated at the same time, or only the parameters of the second feature processing network of the second scene classification model, the parameters of the second classification network, and the parameters of the third classification network of the third scene classification model may be updated, so that the training efficiency is effectively improved.

In step 103, performing scene classification processing on the image to be classified through the trained third scene classification model to obtain a first scene class corresponding to the image to be classified in different first scene classes.

As an example, when the trained third scene classification model is used to perform scene classification processing on an image to be classified, in practice, full connection processing is performed on the splicing processing result of the first classification feature and the second classification feature through a third classification network of the third scene classification model, the full connection processing is mapped to a third prediction probability that an image sample belongs to each first scene category, and a convolutional neural network (feature network), a first feature processing network, a second feature processing network, and a third classification network in the framework shown in fig. 5 are retained in an application stage.

Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described.

In some embodiments, the scene classification method based on artificial intelligence provided by the embodiments of the present application is applied to a video recommendation application scene, a terminal uploads a video to be published to a server, the server performs scene classification processing on a key video frame of the video to obtain a scene classification result of the key video frame, the scene classification result of the key video frame is used as an attribute tag of the video to be published, and the video is published and recommended to a user whose user portrait matches the attribute tag.

Referring to fig. 5, the architecture diagram includes a basic model (hereinafter referred to as a first scene classification model), a prototype class mining network, a second scene classification model based on prototype classes, and a third scene classification model based on joint features, wherein the prototype class (hereinafter referred to as a second scene class) is a sub-scene of the first scene class, for example, the first scene class is a park class, and the second scene class is a lake, a play field, a sidewalk, etc., the prototype class mining network is a network for acquiring a second scene labeling class of an image, the second scene classification model has a capability of identifying a different second scene class, the joint features are obtained by splicing a second classification feature output by the second scene classification model with a first classification feature output by the first scene classification model, and the third scene classification model has a capability of identifying a different first scene class, since the features of the third scene classification model are derived from the other two scene classification models, the feature expression of the first scene class and the feature expression of the second scene class can be learned simultaneously.

In some embodiments, the image F is input into the convolutional neural network and then subjected to a feature extraction process, the feature extraction process may refer to table 1, the feature extraction result of table 1 is subjected to a maximum pooling process by a pooling layer in table 2 to obtain a maximum pooling result of the image F, the maximum pooling result is subjected to a full join process by a first feature processing layer Embedding1 in table 2 to obtain first classification features, the first classification features are subjected to a full join process by a first classification layer FC1 in table 2 to obtain a first prediction probability that the image F belongs to each first scene class, the number of the first scene classes is P, and P is a positive integer, the process is a forward propagation process of the first scene classification model, the first prediction probability of the pre-labeled first scene class of the image F is substituted into a cross entropy loss function to update parameters of the first scene classification model in a backward direction, and during reverse update, only parameters of the first classification layer and the first feature processing layer can be updated, and the training of the first scene classification model is completed through forward propagation and reverse update based on a gradient descent method, wherein the first feature processing layer is a full-link layer and outputs 1024-dimensional first classification features, the convolutional layer Conv1-Conv5 adopts parameters of ResNet101 pre-trained on an ImageNet data set, and newly added layers (such as Embedding1 and FC 1) are initialized by adopting Gaussian distribution with the variance of 0.01 and the mean value of 0. Different network structures, different pre-trained model weights may be used as the first scene classification model.

Referring to fig. 6, fig. 6 is a schematic diagram of a residual error network of the artificial intelligence based scene classification method provided in the embodiment of the present application, the structure of the residual error module in table 1 refers to the residual error network in fig. 6, the input of the residual error network is 256 dimensions, the convolution processing result and the input are added as the input of the next residual error module through convolution processing of convolution kernels with three different sizes, where relu represents an activation function, and the activation processing is performed in a representation manner. The residual error network is composed of three convolution layers, a fusion operator and an activation function, and fusion processing is carried out on the output of a certain residual error network and the input of the residual error network, for example, addition processing is carried out through an addition operator, so that a fusion processing result is obtained; the activation processing is performed on the fusion processing result, the activation processing is completed through a relu activation function, the convolution layer is used for performing multi-size convolution processing on the activation processing result, for example, convolution processing of three layers is performed, the training becomes increasingly difficult along with the increase of the network depth, mainly because the phenomenon of gradient dispersion or gradient explosion is very easily caused by multi-layer back propagation of error signals in the network training process based on random gradient reduction, the problem of training difficulty caused by the network depth is solved by the residual error network shown in fig. 6, and the network performance (the accuracy and the precision of completing tasks) is high.

In some embodiments, after the training of the first scene classification model is completed, second scene categories are mined from the learned first feature processing layer, where the mining of the second scene categories is to perform prototype extraction on each first scene category to obtain sub-scenes with finer granularity in the scene of each first scene category, so as to perform scene feature extraction with finer granularity on the first scene categories, and each first scene category predefines the number of K second scene categories (e.g., K = 10), and when the number of the first scene categories is P, the number of K × P second scene categories is obtained in total, which is specifically performed as follows: forward computing each image in the plurality of images by a first scene classification model to obtain a first classification feature (output of Embedding 1) of each image, and performing the following processing for a certain first scene class: taking the first scene type as the park type as an example, clustering the first classification features of the images marked as the park type, for example, K-means clustering to obtain K clusters Si corresponding to the existence of K cluster centers, calculating center distances of all the images marked as the park type, for example, regarding the image F, calculating distances between the first classification features of the image F and the K cluster centers, taking the closest distance Di among the K distances from the K cluster centers as the center distance of the image F, arranging the center distances Di of all the images in ascending order, taking the x% bit distance in ascending order as the threshold value thri of the park type, and storing data (thri, Pi), where Pi is the number of the images marked as the park type, for example, x% is 80%, if there are 10 samples, the threshold value thri is the 8 th value after being arranged in ascending order, x% is the proportion of the number of images concentrated in the clustering center, x% can be a preset value, if the labels of all the images labeled as the park category are correct, that is, there is no situation that the park is actually a mall, x% can be preset to 90% or more, it means that 90% of the data can be correctly assigned to the clustering center of the second scene category under the park category, the threshold value represents the maximum number of images correctly clustered under the park category, usually x% is not preset to 100%, for example, the images at two clustering boundaries are removed, the proportion of preset difficult samples is not more than 10%, if x% is adjusted according to the requirement, if a certain category is labeled with 40% error, x% is preset to 50%, it means that the center distance of the image at the 50 th% position is used as the category threshold value of the park category, all the first scene categories are processed, finally, K × P clustering centers and corresponding K × P (thri, Pi) data are obtained, an image partition threshold is determined through a weighted summation mode, for example, for the K × P (thri, Pi) data, the summation result of thri × Pi/P is calculated to obtain an image partition threshold Thr, for example, the image of the park category accounts for 1/5 of the total number of images, the category threshold of the park category is 0.6 of the center distance corresponding to the 80 th position, the image of the mall category accounts for 4/5 of the total number of images, the category threshold of the mall category is 0.8 of the center distance corresponding to the 90 th position, then 0.6 is multiplied by 1/5 to obtain 0.12, 0.8 is multiplied by 4/5 to obtain 0.64, the finally obtained image partition threshold is 0.76, finally, for all the images, a corresponding clustering center to which the image F belongs is determined, for example, for the image F, the distance of the image F to the K × P clustering centers is calculated, recording the cluster centers with the distances smaller than the image partition threshold value as the cluster centers of the image F (assuming that there are 3), saving 3 second scene categories for the image F, and obtaining the second scene categories of all the images, wherein the weights of the 3 second scene categories are the same.

In some embodiments, the number of second scene categories under each first scene category is designated as K, and the number of second scene categories may also be adjusted according to the complexity of the first scene categories, and the number of second scene categories positively correlated to the data amount of the first scene categories may be determined, such as the number of first scene categories a and second scene categories B having 1000 and 2000 samples, where the number of second scene categories of B is 2 times the number of second scene categories of a.

The method comprises the steps of mining prototype classes (second scene classes) in a first scene class through first classification features, mapping an image of a mixed complex fine-grained scene class to more specific sub-scenes, and enabling common features of a plurality of different second scene classes in the first scene class to be learned conveniently and respectively.

In some embodiments, referring to fig. 7, fig. 7 is a schematic diagram of a second scene category of the artificial intelligence based scene classification method provided in the embodiment of the present application, from which a fine-grained second scene category may be mined, for example, a second scene category with higher accuracy under the first scene category (park), where the second scene category includes: lake landscape, lane, child facilities, flower bed, etc.

In some embodiments, after the training of the first scene classification model is completed and the second scene classification of each image is mined, the first scene classification model, the second scene classification model and the third scene classification model are jointly trained, wherein during the joint training, the forward propagation process of the first scene classification model is similar to the above process, for the second scene classification model, the maximum pooling result of the image samples output by the first scene classification model is fully connected through the second feature processing layer (Embedding 2) in table 3 to obtain a second classification feature, and then the second classification feature is fully connected through the second classification layer (FC 2) in table 3 to obtain a second prediction probability that the image F belongs to each second scene classification, the number of the second scene classifications is K P, K is a positive integer, and K is the number of the second scene classifications under each first scene classification, the above process is a forward propagation process of the second scene classification model.

In some embodiments, for the third scene classification model, the first classification feature and the second classification feature are subjected to stitching processing, and then the stitching processing result is subjected to full-link processing by using a third classification layer (FC 3) in table 4, so as to obtain a third prediction probability that the image F belongs to each first scene category, where the above process is a forward propagation process of the third scene classification model. And the second classification features are combined into the first classification features in an embedded mode, so that the first classification features of the coarse-grained first scene categories and the second classification features which are more common jointly drive a scene classification task.

In some embodiments, the loss of each scene classification model is determined based on a first prediction probability obtained by forward propagation of the first scene classification model during the joint training, a second prediction probability obtained by forward propagation of the second scene classification model, and a third prediction probability obtained by forward propagation of the third scene classification model, and for the first scene classification model, the first classification loss is a cross entropy loss function, see formula (3):

（3）；

wherein an image of a first scene category having a pre-label is input,

a first scene category is pre-labeled for an image,

a first prediction probability for FC1 layer prediction, E a first classification loss, nclass characterizing n first scene classes, i an identification of the picture.

For the second scene classification model, since a plurality of second scene classes have been obtained for each image when prototype mining is performed, there may be a plurality of second scene classes for an image, each second scene class is regarded as one label in the multi-label class, a cross entropy loss function of the plurality of second classes is adopted as a second classification loss corresponding to the output of the FC2, and the cross entropy loss function of the second class of a certain second scene class is as follows, see formula (4):

（4）；

wherein the content of the first and second substances,

is the loss of image i for a certain second scene class,

is the real label of the image, if the real label of the image is the pre-marked second scene type, then

Is 1, if the true label of the image is not the second scene type of the pre-label, then

Is 0.

In some embodiments, for K × P second scene categories, the cross entropy loss functions of K × P second categories are summed to obtain second category losses, and for an image F having 2 second scene categories (e.g., lakes and footpaths), the cross entropy loss functions of the two second categories of the two second scene categories are obtained respectively, and then the summation is performed to obtain second category losses.

For the third scene classification model, the third scene classification model is actually a full-feature classification model, the full feature refers to a third classification feature taking into account the first classification feature of the first scene classification model and the second classification feature of the second scene classification model, and FC3 is used for mapping the third classification feature to P different first scene classes, so that the third classification loss of the third scene classification model is similar to the first classification loss, and the overall loss of the joint training is the sum of the first classification loss, the second classification loss and the third classification loss, thereby ensuring that the learned third scene classification model learns the feature expression of both the fine-grained prototype feature (the second classification feature) and the scene classification feature (the first classification feature).

When parameters are updated in a joint training stage, parameters of Embedding1, Embedding2, FC1, FC2 and FC3 are updated simultaneously, namely a first scene classification, a second scene classification and a third scene classification are learned simultaneously, wherein the third scene classification model is used as a target classifier to be learned, the output splicing of Embedding1 and Embedding2 is used as a third classification feature, and the third classification model is output as P first scene classes to be learned.

In a joint training stage, parameters of a first scene classification model trained in the previous stage are loaded, model parameters are updated by a gradient descent algorithm, and compared with the first stage which only needs to train the first scene classification model, the stage also needs to learn network layers such as Embedding2, FC2, FC3 and the like additionally, wherein the output of Embedding1 (1 × 1024) and the output of Embedding2 (1 × 1024) are spliced to obtain a third classification feature (1 × 2048) which is used as the input of FC3, the output of Embedding1 represents that a first classification feature required for identifying the first scene category and the output of Embedding2 represent specific prototype features (namely a second scene category which can be differentiated according to the prototype features), and the first classification feature and the Embedding2 are combined to realize the joint of two types of features.

When the method is applied, only the output result of the FC3 is taken as the probability of each predicted first scene category, and the first scene category corresponding to the first predicted probability with the maximum likelihood function is obtained as the scene classification result.

In some embodiments, the scene classification model may be subjected to noise training, and the scene classification model obtained through the noise training is loaded on a cloud server to provide a scene recognition service, referring to fig. 8, where fig. 8 is a processing flow diagram of the scene classification method based on artificial intelligence provided in the embodiments of the present application, a terminal a receives an image input by a user and uploads the image to the server, the server performs scene classification on the image input by the user by using a third scene classification model provided in the embodiments of the present application, and outputs a scene classification result to a terminal B for corresponding display, the terminal B is a terminal different from the terminal a, for example, the terminal a uploads the image to the server for publication, and the terminal B receives the image sent by the server and the classification result of the corresponding image.

In some embodiments, by mining an original category prototype (a second scene category) and establishing feature learning of the second scene category, a trained third scene classification model can more easily capture common features of the second scene category, and support to overall scene recognition, and more sufficient category information is obtained in model learning, and the artificial intelligence-based scene classification method provided by the embodiment of the present application has the following advantages: 1) unsupervised second scene category mining is carried out through clustering processing, and a learning task for identifying the second scene category can be carried out under the condition that the second scene category is not required to be labeled manually; 2) gradually optimizing feature expression of the second scene category through dynamic self-supervision learning of the second scene category; 3) the second classification features of the second scene category are combined with the first classification features of the first scene category, so that the recognition effect of the third scene classification model for the first scene category is improved; 4) the joint training framework is obtained with limited added complexity (only Embedding2, FC2, and FC3 are added compared to the first scene classification model).

Compared with the scheme in the related art, the scene classification method based on artificial intelligence provided by the embodiment of the application does not need to explicitly learn a detection model, so that the generalization capability is improved while the model inference complexity is not increased, the cost of additional labeling caused by labeling a detection data set in advance for training a detector is avoided, unsupervised multi-instance modeling of the second scene category is performed on the first scene category in the training stage, the feature expression of the second scene category is enhanced by the model, the multi-instance analysis of the first scene category is realized by the second scene category, and each image can be mapped to the accurate category of the second scene category, so that the noise label of the image and the second scene category can be distinguished, the feature commonality extraction is facilitated, and by combining the basic feature of the first classification feature and the multi-instance feature of the second classification feature, implicit multi-instance feature and basic feature combination is achieved, and therefore the representations of multiple scene styles are obtained.

Continuing with the exemplary structure of the artificial intelligence based scene classification device 255 provided by the embodiments of the present application implemented as software modules, in some embodiments, as shown in fig. 3, the software modules stored in the artificial intelligence based scene classification device 255 of the memory 250 may include: an obtaining module 2551, configured to obtain, for a plurality of image samples of a first scene class, at least one second scene class for each image sample; wherein the at least one second scene category of each image sample is a sub-scene category of the first scene category of the image sample; a training module 2552, configured to jointly train a first scene classification model, a second scene classification model, and a third scene classification model based on a first scene class of the plurality of image samples and a second scene class of the plurality of image samples; the system comprises a first scene classification model, a second scene classification model, a third scene classification model and a third scene classification model, wherein the first scene classification model is used for identifying different first scene classes, the second scene classification model is used for identifying different second scene classes, and the third scene classification model is obtained based on the combination of the first scene classification model and the second scene classification model; the application module 2553 is configured to perform scene classification processing on the image to be classified through the trained third scene classification model, so as to obtain a first scene class corresponding to the image to be classified in different first scene classes.

In some embodiments, the obtaining module 2551 is further configured to: acquiring a first classification characteristic of each image sample through a first scene classification model; for each first scene category, performing the following: acquiring image samples belonging to a first scene category from the plurality of image samples as target image samples, and performing clustering processing on the plurality of target image samples based on first classification characteristics of the plurality of target image samples to obtain at least one cluster corresponding to at least one second scene category one to one; a second scene category for each image sample is determined based on the at least one cluster.

In some embodiments, the obtaining module 2551 is further configured to: the following processing is performed for each image sample: extracting a first convolution characteristic of the image sample, and performing pooling processing on the first convolution characteristic of the image sample to obtain a first pooling characteristic of the image sample; carrying out residual iteration processing on the first pooled features for multiple times to obtain a residual iteration processing result of the image sample; performing maximum pooling on the residual iterative processing result of the image sample to obtain a maximum pooling result of the image sample; and carrying out first embedding processing on the maximum pooling processing result of the image sample to obtain a first classification characteristic of the image sample.

In some embodiments, the obtaining module 2551 is further configured to: forming a plurality of target image samples into a target image sample set; randomly selecting N target image samples from a target image sample set, taking first classification features corresponding to the N target image samples as initial centroids of a plurality of clusters, and removing the N target image samples from the target image sample set, wherein N is the number of second scene classes corresponding to the first scene class, and N is an integer greater than or equal to 2; initializing the iteration number of clustering processing to be M, and establishing a null set corresponding to each cluster, wherein M is an integer greater than or equal to 2; the following processing is performed during each iteration of the clustering process: updating each cluster set, executing centroid generation processing based on the updating processing result to obtain a new centroid of each cluster, adding the target image sample corresponding to the initial centroid to the target image sample set again when the new centroid is different from the initial centroid, and updating the initial centroid based on the new centroid; determining a set of each cluster obtained after M times of iteration as a clustering result, or determining a set of each cluster obtained after M times of iteration as a clustering result; the centroids of the clusters obtained after iteration M times are the same as those of the clusters obtained after iteration M-1 times, M is smaller than M, M is an integer variable, and the value of M is larger than or equal to 2 and smaller than or equal to M.

In some embodiments, the obtaining module 2551 is further configured to: performing the following for each target image sample of the set of target image samples: determining a similarity between the first classification feature of the target image sample and the initial centroid of each cluster; determining the initial centroid corresponding to the maximum similarity as belonging to the same cluster as the target image sample, and transferring the target image sample to a set of clusters corresponding to the initial centroid of the maximum similarity, wherein the initial centroid of the maximum similarity is the initial centroid corresponding to the maximum similarity; and averaging the first classification characteristics of each target image sample in each cluster set to obtain a new centroid of each cluster.

In some embodiments, the obtaining module 2551 is further configured to: performing the following for each cluster of the plurality of clusters: averaging the first classification features of each target image sample in each cluster to obtain the centroid of each cluster; performing the following for each of a plurality of target image samples: determining a feature distance between a first classification feature of the target image sample and the centroid of each cluster, determining a cluster of centroids corresponding to the feature distance smaller than a feature distance threshold as a cluster associated with the target image sample, and determining a second scene category corresponding to the cluster as a second scene category of the target image sample.

In some embodiments, the obtaining module 2551 is further configured to: querying a database for the labeling accuracy of each first scene category before determining clusters of centroids corresponding to feature distances less than a feature distance threshold as clusters associated with the target image sample; for each first scene category, acquiring a first scene category threshold value positively correlated with the marking accuracy; determining the number ratio of the target image samples in the plurality of image samples, and taking the number ratio as the weight of the first scene category threshold; and carrying out weighted summation processing on the plurality of first scene category thresholds based on the weight of each first scene category threshold to obtain a characteristic distance threshold.

In some embodiments, the obtaining module 2551 is further configured to: for each target image sample, determining a candidate center distance between a first classification feature of the target image sample and the centroid of each cluster, and determining the minimum candidate center distance as the center distance of the target image sample; acquiring ascending sequencing positions positively correlated with the marking accuracy, and performing ascending sequencing on the center distances of the multiple target image samples; and acquiring the center distance corresponding to the ascending sorting position in the ascending sorting result, and determining the center distance as a first scene category threshold.

In some embodiments, the first scene classification model includes a feature network, a first feature processing network, and a first classification network corresponding to a first scene category; an obtaining module 2551, further configured to: before the first classification feature of each image sample is obtained through the first scene classification model, feature extraction processing is carried out on the image samples through a feature network, and the maximum pooling processing result of the image samples is obtained; performing first embedding processing on the maximum pooling processing result of the image sample through a first feature processing network to obtain a first classification feature of the image sample; mapping the first classification feature to a first prediction probability of a pre-labeled first scene class of the image sample through a first classification network; determining a first classification loss of the image sample based on the first prediction probability and a pre-labeled first scene class of the image sample; parameters of the first scene classification model are updated according to the first classification loss of the image sample.

In some embodiments, training module 2552 is further configured to: the following processing is performed for each image sample: carrying out scene classification processing on the image sample through the first scene classification model, the second scene classification model and the third scene classification model to obtain a scene classification result of the image sample; and updating parameters of the first scene classification model, the second scene classification model and the third scene classification model based on the scene classification result, the pre-labeled first scene classification of the image sample and the pre-labeled second scene classification of the image sample.

In some embodiments, the first scene classification model shares a feature network of the first scene classification model with the second scene classification model and the third scene classification model, respectively, the first scene classification model shares a first feature processing network of the first scene classification model with the third scene classification model, and the second scene classification model shares a second feature processing network of the second scene classification model with the third scene classification model; a training module 2552, further configured to: performing feature extraction processing on the image sample through a feature network to obtain a maximum pooling processing result of the image sample; performing first embedding processing on the maximum pooling processing result of the image sample through a first feature processing network of a first scene classification model to obtain a first classification feature of the image sample; mapping the first classification feature to a first prediction probability of a pre-labeled first scene class of the image sample through a first classification network; performing second embedding processing on the maximum pooling processing result of the image sample through a second feature processing network to obtain a second classification feature of the image sample, and mapping the second classification feature into a second prediction probability of at least one pre-marked second scene category of the image sample through a second classification network of a second scene classification model; splicing the first classification characteristic and the second classification characteristic, and mapping a splicing processing result to a third prediction probability of the pre-marked first scene category of the image sample through a third classification network of a third scene classification model; and combining the first prediction probability, the second prediction probability and the third prediction probability into a scene classification result of the image sample.

In some embodiments, training module 2552 is further configured to: determining a first classification loss based on the first prediction probability and a pre-labeled first scene class of the image sample; determining a third classification loss based on the third prediction probability and the pre-labeled first scene class of the image sample; determining a second classification loss based on the at least one second prediction probability and the at least one pre-labeled second scene class of the image sample; performing fusion processing on the first classification loss, the second classification loss and the third classification loss to obtain combined loss; and updating parameters of the first scene classification model, the second scene classification model and the third scene classification model according to the joint loss.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the electronic device executes the artificial intelligence based scene classification method described in the embodiment of the present application.

The embodiment of the application provides a computer-readable storage medium storing executable instructions, and when the executable instructions are executed by a processor, the processor executes the scene classification method based on artificial intelligence provided by the embodiment of the application.

In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

As an example, executable instructions may be deployed to be executed on one electronic device or on multiple electronic devices located at one site or distributed across multiple sites and interconnected by a communication network.

In summary, the scene labels (second scene categories) used for representing the fine-grained scenes are mined from the image samples with the labels of the first scene categories, so that training can be performed based on the second scene categories and the first scene categories simultaneously, which is equivalent to learning feature expressions of the first scene categories and feature expressions of the fine-grained scenes simultaneously, and since the third scene classification model is obtained based on the combination of the first scene classification model and the second scene classification model, the trained third scene classification model has combined feature expressions of both the first scene categories and the fine-grained scene categories (second scene categories), thereby effectively improving scene recognition accuracy.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A scene classification method based on artificial intelligence is characterized by comprising the following steps:

performing the following for each of the image samples:

acquiring a first classification characteristic of the image sample and a first prediction probability of a pre-marked first scene category through a first scene classification model, and acquiring a second classification characteristic of the image sample and a second prediction probability of the pre-marked second scene category through a second scene classification model;

mapping a splicing processing result of the first classification feature and the second classification feature to a third prediction probability of a pre-marked first scene category of the image sample through a third classification network of a third scene classification model;

determining a joint loss based on the first, second, and third prediction probabilities of the plurality of image samples, the pre-labeled first and second scene classes of the plurality of image samples, and updating parameters of the first, second, and third scene classification models based on the joint loss;

2. The method of claim 1, wherein the obtaining, for the plurality of image samples of the first scene class, at least one second scene class for each of the image samples comprises:

acquiring a first classification characteristic of each image sample through the first scene classification model;

performing the following for each of the first scene categories:

acquiring image samples belonging to the first scene category from the plurality of image samples to serve as target image samples, and performing clustering processing on the plurality of target image samples based on first classification characteristics of the plurality of target image samples to obtain at least one cluster in one-to-one correspondence with at least one second scene category;

determining a second scene category for each of the image samples based on the at least one cluster.

3. The method of claim 2, wherein said obtaining a first classification feature for each of said image samples comprises:

performing the following for each of the image samples:

extracting a first convolution feature of the image sample, and performing pooling processing on the first convolution feature of the image sample to obtain a first pooling feature of the image sample;

performing residual iteration processing on the first pooled features for multiple times to obtain a residual iteration processing result of the image sample;

performing maximum pooling on the residual iterative processing result of the image sample to obtain a maximum pooling processing result of the image sample;

and performing first embedding processing on the maximum pooling processing result of the image sample to obtain a first classification characteristic of the image sample.

4. The method according to claim 2, wherein the clustering the plurality of target image samples based on the first classification features of the plurality of target image samples to obtain at least one cluster corresponding to at least one of the second scene categories one by one comprises:

forming a plurality of the target image samples into a target image sample set;

randomly selecting N target image samples from the set of target image samples, taking first classification features corresponding to the N target image samples as initial centroids of a plurality of clusters, and removing the N target image samples from the set of target image samples, wherein N is the number of second scene classes corresponding to the first scene class, and N is an integer greater than or equal to 2;

initializing the iteration number of clustering processing to be M, and establishing a null set corresponding to each cluster, wherein M is an integer greater than or equal to 2;

performing the following processing during each iteration of the clustering processing: updating each set of clusters, executing centroid generation processing based on the updating processing result to obtain a new centroid of each cluster, adding a target image sample corresponding to the initial centroid to the target image sample set again when the new centroid is different from the initial centroid, and updating the initial centroid based on the new centroid;

determining each cluster set obtained after M times of iteration as a cluster processing result, or determining each cluster set obtained after M times of iteration as a cluster processing result;

the centroids of the clusters obtained after iteration for M times are the same as the centroids of the clusters obtained after iteration for M-1 times, wherein M is an integer variable and the value of M is more than or equal to 2 and less than or equal to M.

5. The method of claim 4, wherein the updating each set of clusters and performing a centroid generation process based on the updating process result to obtain a new centroid for each cluster comprises:

performing the following for each of the target image samples:

determining a similarity between a first classification feature of the target image sample and an initial centroid of each of the clusters;

determining an initial centroid corresponding to the maximum similarity as belonging to the same cluster as the target image sample, and transferring the target image sample to a set of clusters corresponding to the maximum similarity initial centroid, wherein the maximum similarity initial centroid is the initial centroid corresponding to the maximum similarity;

and averaging the first classification characteristics of each target image sample in each cluster set to obtain a new centroid of each cluster.

6. The method of claim 4, wherein said determining a second scene category for each of said image samples based on said at least one cluster comprises:

performing the following for each of the clusters: averaging the first classification features of each target image sample in each cluster to obtain the centroid of each cluster;

performing the following for each of the image samples of the plurality of image samples:

determining a feature distance between a first classification feature of the image sample and a centroid of each of the clusters, determining a cluster of centroids corresponding to a feature distance less than a feature distance threshold as a cluster associated with the image sample, and determining a second scene category corresponding to the cluster as a second scene category of the image sample.

7. The method of claim 6, wherein prior to determining a cluster of centroids corresponding to feature distances less than a feature distance threshold as the cluster associated with the image sample, the method further comprises:

performing the following for each of the first scene categories:

querying a database for a tagging accuracy of the first scene category;

obtaining a first scene category threshold positively correlated to the marking accuracy; determining a number fraction of the target image samples belonging to the first scene class in the plurality of image samples and weighting the number fraction as the first scene class threshold;

and carrying out weighted summation processing on the plurality of first scene category thresholds based on the weight of each first scene category threshold to obtain the characteristic distance threshold.

8. The method of claim 7, wherein obtaining a first scene category threshold positively correlated with the marking accuracy comprises:

for each target image sample, determining a candidate center distance of a first classification feature of the target image sample from the centroid of each cluster, and determining the smallest candidate center distance as the center distance of the target image sample;

acquiring ascending sequencing positions positively correlated with the marking accuracy, and performing ascending sequencing on the center distances of the target image samples;

and acquiring a center distance corresponding to the ascending sorting position in the ascending sorting result, and determining the center distance as the first scene category threshold.

9. The method of claim 2, wherein the first scene classification model comprises a feature network, a first feature processing network, and a first classification network corresponding to the first scene class;

before the obtaining, by the first scene classification model, the first classification feature of each image sample, the method further includes:

performing feature extraction processing on the image sample through the feature network to obtain a maximum pooling processing result of the image sample;

performing first embedding processing on the maximum pooling processing result of the image sample through the first feature processing network to obtain a first classification feature of the image sample;

mapping, by the first classification network, the first classification feature to a first prediction probability of a pre-labeled first scene class for the image sample;

determining a first classification loss for the image sample based on the first prediction probability and a pre-labeled first scene class of the image sample;

updating parameters of the first scene classification model according to a first classification loss of the image sample.

10. The method of claim 1,

the first scene classification model shares a feature network of the first scene classification model with the second scene classification model and the third scene classification model respectively, the first scene classification model shares a first feature processing network of the first scene classification model with the third scene classification model, and the second scene classification model shares a second feature processing network of the second scene classification model with the third scene classification model;

the obtaining, by a first scene classification model, a first classification feature of the image sample and a first prediction probability of a pre-labeled first scene class includes:

performing first embedding processing on the maximum pooling processing result of the image sample through a first feature processing network of the first scene classification model to obtain a first classification feature of the image sample;

the obtaining, by the second scene classification model, a second classification feature of the image sample and a second prediction probability of a pre-labeled second scene class includes:

and performing second embedding processing on the maximum pooling processing result of the image sample through the second feature processing network to obtain a second classification feature of the image sample, and mapping the second classification feature to a second prediction probability of at least one pre-labeled second scene category of the image sample through a second classification network of the second scene classification model.

11. The method of claim 1, wherein determining the joint loss based on the first, second, and third prediction probabilities for the plurality of image samples, the pre-labeled first scene class and the pre-labeled second scene class for the plurality of image samples comprises:

determining a first classification loss based on the first prediction probability and a pre-labeled first scene class of the image sample;

determining a third classification loss based on the third prediction probability and a pre-labeled first scene class of the image sample;

determining a second classification loss based on at least one of the second prediction probabilities and at least one pre-labeled second scene class of the image sample;

and performing fusion processing on the first classification loss, the second classification loss and the third classification loss to obtain the joint loss.

12. A scene classification device based on artificial intelligence is characterized by comprising:

a training module to perform the following for each of the image samples:

13. An electronic device, comprising:

a memory for storing executable instructions;

a processor for implementing the artificial intelligence based scene classification method of any one of claims 1 to 11 when executing executable instructions stored in the memory.

14. A computer-readable storage medium storing executable instructions for implementing the artificial intelligence based scene classification method of any one of claims 1 to 11 when executed by a processor.