CN113705293A

CN113705293A - Image scene recognition method, device, equipment and readable storage medium

Info

Publication number: CN113705293A
Application number: CN202110218951.6A
Authority: CN
Inventors: 郭卉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2021-11-26

Abstract

The application discloses an image scene identification method, device and equipment and a readable storage medium, and relates to the field of machine learning. The method comprises the following steps: acquiring a target image; extracting global features of the target image; extracting at least two regional subgraphs from the target image based on the global features and the image recognition classification library; and identifying and obtaining a scene category label corresponding to the target image based on the global feature and the sub-image feature of the regional sub-image. In the scene recognition process of the target image, after the global features of the target image are extracted, the regional subgraph is extracted from the target image by adopting an attention mechanism based on the global features, so that the scene recognition is carried out on the target image based on the subgraph features and the global features, namely, the reference content of the scene recognition not only comprises a single entity in the target image, but also comprises image regions related to various scene category labels in the image recognition classification library in the target image, and the scene recognition accuracy of the target image is improved.

Description

Image scene recognition method, device, equipment and readable storage medium

Technical Field

The embodiment of the application relates to the field of machine learning, in particular to an image scene identification method, device, equipment and a readable storage medium.

Background

Scene recognition refers to recognition of a depicted scene in multimedia content, and may be applied to video scene recognition or image scene recognition, where when applied to video scene recognition, scene recognition is performed on image frames within a video, and scene features are typically in the background environment of the recognized image.

In the related art, when scene recognition is implemented, object detection is usually performed first, and an object detection result is used as preliminary information, so that scene recognition is performed on an image based on the object detection result to obtain a scene recognition result, and a scene tag of the image is obtained.

However, since not all scenes have scene detection targets, such as: scene key objects in scenes such as seasides and forests are in a dispersed form and cannot be accurately identified, so that the accuracy of scene identification is greatly influenced, and the failure rate of scene identification is high.

Disclosure of Invention

The embodiment of the application provides an image scene identification method, an image scene identification device, image scene identification equipment and a readable storage medium, and the identification accuracy of an image scene can be improved. The technical scheme is as follows:

in one aspect, a method for identifying an image scene is provided, where the method includes:

acquiring a target image, wherein the target image is an image to be identified in an image scene;

extracting global features of the target image, wherein the global features are obtained by extracting features of the whole target image;

extracting at least two region subgraphs from the target image based on the global features and an image recognition classification library, wherein the image recognition classification library comprises scene class labels for labeling images;

and identifying and obtaining a scene category label corresponding to the target image based on the global feature and the sub-image feature of the regional sub-image.

In another aspect, an apparatus for recognizing an image scene is provided, the apparatus comprising:

the system comprises an acquisition module, a recognition module and a processing module, wherein the acquisition module is used for acquiring a target image, and the target image is an image to be recognized in an image scene;

the extraction module is used for extracting the global features of the target image, wherein the global features are obtained by extracting the features of the whole target image;

the extraction module is further used for extracting at least two region subgraphs from the target image based on the global features and an image recognition classification library, wherein the image recognition classification library comprises scene category labels used for labeling images;

and the identification module is used for identifying and obtaining a scene category label corresponding to the target image based on the global feature and the sub-image feature of the regional sub-image.

In another aspect, a computer device is provided, which includes a processor and a memory, where at least one program is stored in the memory, and the at least one program is loaded and executed by the processor to implement the method for recognizing an image scene as described in any of the embodiments of the present application.

In another aspect, a computer-readable storage medium is provided, in which at least one instruction, at least one program, a set of codes, or a set of instructions is stored, which is loaded and executed by a processor to implement the method for recognizing an image scene as described in any of the embodiments of the present application.

In another aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to make the computer device execute the image scene identification method in any of the above embodiments.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

in the scene recognition process of the target image, after the global features of the target image are extracted, the regional subgraph is extracted from the target image by adopting an attention mechanism based on the global features, so that the scene recognition is carried out on the target image based on the subgraph features and the global features, namely, the reference content of the scene recognition not only comprises a single entity in the target image, but also comprises image regions related to various scene category labels in the image recognition classification library in the target image, and the scene recognition accuracy of the target image is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic illustration of an implementation environment provided by an exemplary embodiment of the present application;

FIG. 2 is a flow chart of a method for identifying an image scene provided by an exemplary embodiment of the present application;

FIG. 3 is a schematic diagram of a scene category label identification process provided based on the embodiment shown in FIG. 2;

FIG. 4 is a flow chart of a method for identifying an image scene provided by another exemplary embodiment of the present application;

FIG. 5 is a schematic diagram of a feature point and candidate subgraph mapping relation in a target image provided based on the embodiment shown in FIG. 4;

FIG. 6 is a flow chart of a method for identifying an image scene provided by another exemplary embodiment of the present application;

FIG. 7 is a schematic diagram of a loss value calculation process of a scene recognition model provided based on the embodiment shown in FIG. 6;

FIG. 8 is a block diagram illustrating an apparatus for recognizing an image scene according to an exemplary embodiment of the present application;

fig. 9 is a block diagram of an apparatus for recognizing an image scene according to another exemplary embodiment of the present application;

fig. 10 is a schematic structural diagram of a computer device according to an exemplary embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

First, a brief description is given of terms referred to in the embodiments of the present application:

artificial Intelligence (AI): the method is a theory, method, technology and application system for simulating, extending and expanding human intelligence by using a digital computer or a machine controlled by the digital computer, sensing the environment, acquiring knowledge and obtaining the best result by using the knowledge. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

In the embodiment of the present application, a computer vision technology is mainly involved, scene recognition is performed on image content, schematically, after a target image is input into a scene recognition model, a scene recognition result of the target image is output, for example: and identifying the scene of the target image as a library scene.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

Computer Vision technology (Computer Vision, CV): computer vision is a science for researching how to make a machine "see", and further, it means that a camera and a computer are used to replace human eyes to perform machine vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes image processing, image Recognition, image semantic understanding, image retrieval, Optical Character Recognition (OCR), video processing, video semantic understanding, video content/behavior Recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also includes common biometric technologies such as face Recognition and fingerprint Recognition.

Scene recognition: the identification is performed on the background of the image or the video, and unlike the entity identification, the entity identification is performed on the category of an object appearing in the image or the video, such as: the entity in the image is identified to belong to a person, an animal and the like, and scene identification is mainly performed on background features of the image or the video, such as: and identifying that the scene corresponding to the current image is a restaurant. In the related technology, in the implementation process of scene recognition, the scene recognition is implemented based on multi-scale salient region feature learning, namely, firstly, object detection is carried out on an image to obtain an object detection result, and one or more parts with objects in the image are obtained from the object detection result; and secondly, extracting scene key information through the object part for identification, thereby identifying and obtaining the scene information corresponding to the image.

Scene recognition belongs to high-level semantic recognition, the difficulty is generally higher than that of entity class recognition, because scene features are often in the background environment of image recognition, and a conventional image recognition pre-training model concentrates on extracting features on specific entities or parts, the scene recognition is easy to over-fit to the foreground in a target scene, namely the scene recognition model memorizes the situations in some scenes. In the scenario recognition scheme, not all scenarios have detection targets, such as: seaside, forest and the like do not usually compare specific detection entities, and the difficulty of scene identification is high.

The application provides an end-to-end scene recognition method based on image local positioning and robust feature combination, which realizes scene recognition by performing self-supervision attention feature extraction in a high-dimensional image space, supervising learning on multi-attention blocks and finally adopting robust combination features. The method carries out multi-task supervised learning on the local positioning and identification and the global identification, optimizes the global characteristic through multi-characteristic combination, and also optimizes the local characteristic through local characteristic classification and positioning, thereby avoiding the problem of local influence classification in scene identification caused by the fact that the positioning of local blocks and the characteristic extraction of the local blocks are not carried out with supervised learning in the traditional scene identification.

Fig. 1 is a schematic diagram of an implementation environment provided by an exemplary embodiment of the present application, as shown in fig. 1, the implementation environment includes a terminal 110 and a server 120, where the terminal 110 and the server 120 are connected through a communication network 130.

The terminal 110 is configured to transmit multimedia content requiring scene recognition to the server 120, wherein the multimedia content includes image content or video content. The terminal 110 transmits multimedia contents to the server 120 through the communication network 130 and instructs the server 120 to perform scene recognition on the multimedia contents.

In the above-mentioned content, the multimedia content is exemplified as being transmitted from the terminal 110 to the server 120, and in some embodiments, the multimedia content is a content stored in the server 120 itself or a content transmitted from another server to the server 120.

The server 120 includes a scene recognition model 121, where the scene recognition model 121 includes the following features: 1. mining related characteristics of local blocks of the image through an attention mechanism; 2. extracting scene key information for recognition through fusion of local attention and global features; 3. the model realizes end-to-end scene recognition by inputting pictures, self-supervision attention learning and extraction and combining recognition results. After receiving the multimedia content transmitted by the terminal 110, the server 120 inputs the multimedia content into the scene recognition model 121.

When the multimedia content is image content, directly inputting the image content into the scene recognition model 121; when the multimedia content is a video content, the video content is decoded first, and an image frame is obtained from the video content, so that the image frame is input into the scene recognition model 121 for scene recognition, or after the video content is decoded, the decoded video stream is input into the scene recognition model 121 for scene recognition frame by frame, which is not limited in the embodiment of the present application.

In some embodiments, after obtaining the scene recognition result through the scene recognition model 121, the server 120 feeds the scene recognition result back to the terminal 110 through the communication network 130, and after receiving the scene recognition result, the terminal 110 displays the scene recognition result, or performs processing such as classification on the multimedia content based on the scene recognition result, which is not limited in the embodiment of the present application.

It is to be noted that the method for identifying an image scene provided in the embodiment of the present application may be implemented by a terminal, may also be implemented by a server, and may also be implemented by cooperation of the terminal and the server. That is, in the above embodiments, the scene recognition model 121 is located in the server 120 for example, in some embodiments, the scene recognition model 121 may also be configured in the terminal 110, or a part of the network in the scene recognition model 121 may also be configured in the terminal 110, and the other part of the network is configured in the server 120.

The terminal comprises at least one of terminals such as a smart phone, a tablet computer, a portable laptop, a desktop computer, an intelligent sound box, intelligent wearable equipment and intelligent face recognition equipment, the server can be a physical server or a cloud server for providing cloud computing service, and the server can be realized as one server or a server cluster or distributed system formed by a plurality of servers. When the terminal and the server cooperatively implement the scheme provided by the embodiment of the present application, the terminal and the server may be directly or indirectly connected in a wired or wireless communication manner, which is not limited in the embodiment of the present application.

With reference to the above description, an application scenario of the embodiment of the present application is described.

First, the scene identifies an application scene of the function.

In some embodiments, an application program is installed in the terminal, and a scene recognition function is provided in the application program, when the user applies the application program, a target image or a target video is uploaded in a function interface of the scene recognition function, so that the terminal sends the target image or the target video to the server for scene recognition, after the server performs scene recognition on the target image or the target video, a scene recognition result obtained through recognition is fed back, and the terminal displays the scene recognition result.

Second, application scenarios of content classification.

Illustratively, taking a video application program as an example, n video contents are stored in a server, where n is a positive integer, and after the n video contents are subjected to scene recognition through a scene recognition model, the n video contents are classified according to scene recognition results respectively corresponding to the n video contents obtained through recognition, so that options respectively corresponding to the n video contents are displayed in a video content selection interface of a terminal in a classified manner, and a user can view the video contents of a specified scene type in a classified manner. Illustratively, after a user selects a library scene in the video application program, candidate video contents corresponding to the library scene are displayed in the terminal interface, so that the user selects and views a target video content in the candidate video contents.

It should be noted that, in the distance of the application scene, the scene recognition function and the content classification function are taken as examples for description, and the image scene recognition method provided in the present application may also be applied to other application scenes that need to perform scene recognition analysis on an image, which is not limited in the embodiments of the present application.

Based on the above, a method for identifying an image scene provided in an embodiment of the present application is described, and fig. 2 is a flowchart of a method for identifying an image scene provided in an exemplary embodiment of the present application, which is described by taking an application of the method in a server as an example, as shown in fig. 2, the method includes:

step 201, a target image is obtained, and the target image is an image of an image scene to be identified.

In some embodiments, the manner in which the target image is acquired includes at least one of the following.

Firstly, a server receives video content uploaded by a terminal, wherein the video content comprises at least one of a short video or a conventional video, and the short video refers to a video with the time length of the video uploaded in a short video platform being less than the required time length or a video played in a short video playing platform; regular video refers to video that is not limited in video format or duration. After receiving the video content, the server needs to perform scene recognition on the video content, so the video content is decoded first to obtain a decoded video frame, and thus a specified frame (such as a key frame or an image frame located at a specified position) is obtained from the decoded video frame as a target image for scene recognition.

Secondly, video content is stored in the server, and scene classification needs to be carried out on the stored video content based on a video scene classification function, so that the server acquires and decodes the stored video, and acquires and recognizes scenes of specified video frames one by one.

Thirdly, the server receives the image content uploaded by the terminal, and the image content is the content which is specified by the user and needs scene recognition, so the image content is used as a target image for scene recognition.

In the above example, the target image is taken as an example of an image that needs to be subjected to scene recognition in practical application, and in some embodiments, the target image may also be implemented as a sample image labeled with a sample category label, that is, the target image is an image that is to be subjected to image scene recognition and then trains a scene recognition model based on the sample category label. The image scene of the target image is realized through a scene recognition model, the target image serving as a sample is marked with a sample class label, and after the scene recognition model performs scene recognition on the target image to obtain the scene class label, the scene recognition model is trained (namely model parameter adjustment of the scene recognition model) based on the scene class label and the sample class label.

In some embodiments, when the target image is implemented as a sample image, the target image is a randomly selected image in the sample image library, or an image selected in turn in the sample image library.

Step 202, extracting global features of the target image.

The global feature is a feature obtained by extracting a feature of the entire target image.

In some embodiments, the scene recognition model includes a feature extraction network, and the global features of the target image are extracted by inputting the target image into the feature extraction network as a whole.

In some embodiments, a ResNet101 network is used to extract global features of the target image, and the ResNet101 network is implemented as a 5-layer convolutional neural network structure, for illustration, see table one below.

Watch 1

In the case of "7 × 7, 64", 7 × 7 represents the convolution size, 64 represents the number of channels, that is, in correspondence to "1 × 1, 128", 1 × 1 represents the convolution size, and 128 represents the number of channels of convolution.

That is, the global feature of the target image can be extracted by inputting the target image into the feature extraction network shown in table one.

And step 203, extracting at least two regional subgraphs from the target image based on the global features and the image recognition classification library.

The image identification classification library comprises scene classification labels used for labeling the images. The scene type labels are set by developers, or the scene type labels are stored in the image recognition classification library before the sample images are used for training the scene recognition model after the sample images are manually labeled.

In some embodiments, firstly, feature points in the global features are identified based on the global features and the image identification classification library to obtain candidate point scores corresponding to the feature points, the confidence of candidate subgraphs in the target image is determined based on the candidate point scores, and a mapping relation exists between the candidate subgraphs and the feature points. At least two regional subgraphs are determined from the candidate subgraphs of the target image based on the confidence.

That is, in the process of extracting the region subgraph, firstly, feature points corresponding to the global features need to be identified under different identification dimensions, and the identification process is realized based on the association between the feature points and scene category labels in the image identification classification library. And because the global features are carried out under different identification dimensions, the feature points are mapped to the corresponding candidate subgraphs in the original target image, and the candidate subgraphs obtained by mapping the feature points under different dimensions are different in size. And determining at least two regional subgraphs based on the recognition of the feature points and the mapping relation between the feature points and the target image.

And step 204, identifying and obtaining a scene category label corresponding to the target image based on the global feature and the sub-image feature of the regional sub-image.

Firstly, extracting the characteristics of the regional subgraph to obtain subgraph characteristics, and then identifying based on the global characteristics and the subgraph characteristics to obtain a scene category label.

When the feature extraction is performed on the regional subgraph, the feature extraction network described in the table I is adopted for extraction, or the feature extraction network shown in the table I is a first feature extraction network, the subgraph feature of the regional subgraph is extracted by a second feature extraction network, the network structures of the first feature extraction network and the second feature extraction network are the same or different, and the network parameters of the first feature extraction network and the second feature extraction network are the same or different.

In some embodiments, the global features and the sub-graph features are combined, and the combined features are identified to obtain a scene category label.

The merging mode of the global features and the sub-graph features comprises the following steps: the method comprises the steps of splicing one by one, splicing an integral splicing mode and the like, wherein the splicing one by one mode means that sub-image features of at least two regional sub-images are spliced with global features respectively to obtain basic fusion features, and therefore the at least two basic fusion features are spliced to obtain fusion features; the overall splicing mode means that the sub-image features and the global features of at least two regional sub-images are integrally spliced to obtain a fusion feature.

In the embodiment of the present application, an integral splicing manner is taken as an example for explanation. Referring to fig. 3, schematically, a process for identifying a scene category label according to an exemplary embodiment of the present application is shown. As shown in fig. 3, in the process, a target image 300 is obtained first, and a global feature 301 is extracted through a convolutional neural network 310; the attention position 330 is obtained through the attention extraction module 320 based on the global feature 301 of the target image 300, so that at least two region sub-graphs 340 are extracted from the target image 300 (4 region sub-graphs are extracted as shown in fig. 3), sub-graph features 341 of the region sub-graphs 340 are extracted through the convolutional neural network 310 (4 sub-graph features corresponding to the 4 region sub-graphs are extracted as shown in fig. 3), a fusion feature 350 is obtained after the global feature 301 and the sub-graph features 341 are combined, and scene recognition is performed based on the fusion feature 350, so that a scene category label 360 is obtained.

In some embodiments, the attention extraction module 320 is trained by obtaining the attention loss value 370 after the attention position 330 is obtained by the attention extraction module 320.

In summary, according to the image scene recognition method provided in the embodiment of the present application, after the global feature of the target image is extracted in the scene recognition process of the target image, the regional subgraph is extracted from the target image by using an attention mechanism based on the global feature, so that the target image is subjected to scene recognition based on the subgraph feature and the global feature, that is, the reference content of the scene recognition not only includes a single entity in the target image, but also includes image regions related to each scene category label in the image recognition classification library in the target image, thereby improving the scene recognition accuracy of the target image.

In some embodiments, the region subgraph is extracted from the target image by an attention mechanism. Fig. 4 is a flowchart of an image scene recognition method according to another exemplary embodiment of the present application, which is described by taking the method as an example for being applied to a server, and as shown in fig. 4, the method includes:

step 401, a target image is obtained, and the target image is an image to be identified in an image scene.

The manner of acquiring the target image is described in step 201, and is not described herein again.

Step 402, extracting global features of the target image.

And 403, identifying the feature points in the global features based on the global features and the image identification classification library to obtain candidate point scores corresponding to the feature points.

In some embodiments, global feature classification is performed based on the image recognition classification library, so that candidate point scores of feature points are obtained based on the global feature classification result.

And determining a feature classification network based on the feature extraction network shown in the table I, and performing global feature classification based on the feature classification network, wherein the structure of the feature classification network is shown in the table II.

Watch two

After global feature classification is performed through the feature classification network shown in table two, a region subgraph is extracted from the target image based on an attention mechanism.

In some embodiments, candidate point scores for feature points in the global features are determined by an attention-based mechanism of the site extraction network. Schematically, please refer to the following table three, which shows the network structure of the above-mentioned location extraction network.

Watch III

The part extraction network comprises a Down layer network and a post layer network respectively, the layer name of the Down layer is represented as Down1_ y, and the layer name of the post layer is represented as post2_ y.

The matrix size of the output of the Propost2_ y layer is 6 × 9 × 15, where 6 denotes the number of channels, the 128 channels of the previous layer of network structure are compressed into 6 channels through the Propost2_ y layer, and 9 × 15 denotes the length and width of the space after convolution. Where a feature point in 9 × 15 represents the attention intensity of the spatial coordinate in which the feature point is located, the feature point may be mapped to a region (i.e., a candidate subgraph) in the target image. The attention intensity 6 × 9 × 15 matrix output from the post-remodeling prompt 2_ y layer is converted into 810 candidate point scores corresponding to the feature points, i.e., 6 × 9 × 15.

It should be noted that, in the above table three, the number of channels is illustrated as 6, in some embodiments, the number of channels may be more or less, and may be obtained based on experimental experience of a developer, and this is not limited in the embodiments of the present application.

Step 404, determining confidence of the candidate subgraph in the target image based on the candidate point scores.

Since the global feature is a feature image obtained by up-sampling or down-sampling the target image, feature points in a space obtained by convolving the global feature through 6 channels can be mapped to the target image to obtain corresponding candidate subgraphs.

In some embodiments, after normalizing the candidate point scores corresponding to the feature points after the channel convolution, obtaining the confidence degrees of the candidate subgraphs corresponding to the feature points; or, taking the candidate point score corresponding to the feature point after the channel convolution as the confidence of the corresponding candidate subgraph.

Step 405, at least two regional subgraphs are determined from the candidate subgraphs of the target image based on the confidence.

In some embodiments, candidate subgraphs of the target image are ranked based on the confidence, a designated subgraph is determined from the ranked candidate subgraphs, the designated subgraph is a subgraph meeting designated requirements in the candidate subgraphs, a required subgraph is determined from the candidate subgraphs based on the overlapping relation between the designated subgraph and the candidate subgraph, the required subgraph is a subgraph meeting the overlapping relation requirement with the designated subgraph, and at least two regional subgraphs are determined based on the required subgraph and the designated subgraph.

In some embodiments, from the candidate subgraphs in the sequence, the candidate subgraph with the highest confidence value is determined as the designated subgraph.

When the required subgraph is determined based on the overlapping relation, determining the overlapping rate (IoU) between the designated subgraph and the candidate subgraph, and reserving and determining the candidate subgraph as the required subgraph in response to the overlapping rate between the designated subgraph and the candidate subgraph reaching the threshold of the overlapping rate; and discarding the candidate subgraph in response to the overlapping rate between the designated subgraph and the candidate subgraph being less than the overlapping rate threshold.

Schematically, the feature points respectively correspond to candidate subgraphs in the target image, and after the candidate subgraphs are determined, at least two regional subgraphs are obtained through a Non-Maximum Suppression algorithm (Hard Non-Maximum Suppression, Hard NMS). The execution mode of the Hard NMS comprises the following steps: the confidence given to each candidate subgraph according to the model is ranked from high to low, and then the maximum is reserved, and all candidate subgraphs which are higher than a threshold value and are corresponding to the candidate subgraph with the maximum confidence are deleted IoU. Illustratively, there are 4 candidate subgraphs: (box1, 0.8), (box2, 0.9), (box3, 0.7), (box4, 0.5), 0.8, 0.9, 0.7 and 0.5 are used to represent the confidence level of each candidate subgraph corresponding to a certain scene category label, and the four candidate subgraphs are sorted from high confidence level to low confidence level: box2> box1> box3> box 4. The candidate box2 with the highest confidence is retained, the IoU between the remaining three boxes and box2 is calculated, and if IoU is greater than a preset threshold, the box is deleted. Assuming that the preset threshold is 0.5, IoU (box1, box2) ═ 0.1<0.5, hold; IoU (box3, box2) ═ 0.7<0.5, delete; IoU (box4, box2) ═ 0.2<0.5, retention; so box1, box2, and box4 are retained, and then the above process is repeated again for other scene category labels, sorted, and deleted and retained.

Schematically, referring to fig. 5, the target image 500 is convolved by a first layer of convolutional layer through 6 channels, then pooled by the first layer of pooling layer, then convolved by a second layer of convolutional layer, and pooled by the second layer of pooling layer, and then classified based on Softmax layer after passing through the full-link layer. Wherein, the feature points can be mapped to obtain candidate subgraphs 510 in the target image 500.

And 406, identifying and obtaining a scene category label corresponding to the target image based on the global feature and the sub-image feature of the regional sub-image.

Firstly, extracting sub-graph features of at least two regional sub-graphs, and then identifying based on the global features and the sub-graph features to obtain a scene category label.

In some embodiments, the global feature and the sub-image feature are combined to obtain a fusion feature, so that scene recognition is performed on the fusion feature to obtain a scene category label corresponding to the target image.

Schematically, firstly, directly connecting the pooling results first, obtaining a feature vector of (1+ K) × 2048 by connecting, wherein K is the number of regional subgraphs, and then predicting the probability of belonging to N categories by adopting a full-connection layer. The input of the full-connection layer is 1 x (1+ K) x 2048, the output is 1 x N, the layer calculates the probability that all (global + local) feature predictions of the target image belong to a certain scene class label, and finally 1 classification result is obtained.

According to the method provided by the embodiment, the regional subgraph is determined from the target image based on the mapping relation between the feature point and the candidate subgraph, so that the scene category label is identified based on the combination of the subgraph feature and the global feature of the regional subgraph, and the identification accuracy of the scene category label is further improved.

In some embodiments, the target image is a sample image labeled with a sample class label, that is, the scene recognition model is trained by the target image. Fig. 6 is a flowchart of an image scene recognition method according to another exemplary embodiment of the present application, which is described by taking the method as an example for being applied to a server, and as shown in fig. 6, the method includes:

step 601, acquiring a target image, wherein the target image is an image to be identified in an image scene.

Step 602, extracting global features of the target image.

Step 603, extracting at least two regional subgraphs from the target image based on the global features and the image recognition classification library.

The extraction manner of the region sub-map is described in detail in the above steps 403 to 406, and is not described herein again.

And step 604, obtaining image prediction results corresponding to the global features and the sub-image features of the regional sub-images through scene recognition model recognition.

Wherein, the image prediction result comprises: in this embodiment, the description will be given taking an example that the image prediction result includes a global prediction result, an attention prediction result, a positioning prediction result, and a sub-graph prediction result.

The global prediction result is a prediction result obtained by scene recognition based on global features of the target image; the attention prediction result is a prediction result obtained based on the global feature and the sub-graph feature of the region sub-graph determined based on the attention mechanism; the positioning prediction result refers to the positioning accuracy of the predicted regional subgraph; the subgraph prediction result is a prediction result obtained by carrying out scene recognition on the regional subgraph.

Namely, global feature prediction is carried out on the global features through the scene recognition model, and a global prediction result is obtained; carrying out attention classification prediction on the fusion characteristics of the sub-image characteristics and the global characteristics through a scene recognition model to obtain a scene category label corresponding to the target image as an attention prediction result; carrying out positioning accuracy prediction on the subgraph characteristics of at least two regional subgraphs through a scene recognition model to obtain a positioning prediction result; and carrying out classification prediction on the sub-image characteristics of the at least two region sub-images through the scene recognition model to obtain a sub-image prediction result.

Step 605, obtaining a loss value of the image prediction result based on the image prediction result and the sample class label.

In some embodiments, a first loss value is derived based on the global prediction result and the sample class label; obtaining a second loss value based on the attention prediction result and the sample class label; and obtaining a third loss value based on the positioning prediction result and the sample class label, and obtaining a fourth loss value based on the sub-image prediction result and the sample class label, so that the loss value of the image prediction result is obtained based on the first loss value, the second loss value, the third loss value and the fourth loss value.

Wherein, the calculation mode of the first loss value is as follows: the first loss value is calculated using the following formula one, which is a cross entropy loss function of the classification. Wherein the input is a target image labeled with a sample class label.

The formula I is as follows:

wherein y is a sample class label labeled on the target image,

and L is a scene category label obtained by global prediction and is a first loss value.

The calculation method of the second loss value refers to the calculation method of the first loss value, that is, the scene category label obtained by global prediction of the formula is used

And replacing the scene type label corresponding to the attention prediction result.

The third loss value is calculated in the following manner: after extracting the region subgraphs through an attention mechanism, designing a positioning accuracy prediction network to calculate attention positioning prediction to learn the final N classes, wherein the purpose is to enable each attention output result to have the perception capability on the classes. After at least two regional subgraphs are input into the feature extraction network, the output result is KxN through the pooling layer and the positioning accuracy prediction network, namely the probability that the K regional subgraphs belong to any one of N classes is respectively predicted for the K regional subgraphs, and finally K positioning loss values are calculated to obtain the positioning accuracy loss (namely a third loss value) of the regional subgraphs. Wherein the third loss value is an average of the K positioning loss values, or a sum of the K positioning loss values.

The fourth loss value is calculated in the following manner: after extracting the region subgraph by an attention mechanism, designing the subgraph to predict the probability that the network predicts the region subgraph belongs to N classes, so that the network has the recognition capability on the local features of the region subgraph. After at least two regional subgraphs are input into the feature extraction network, the output result is K multiplied by N through the pooling layer and the subgraph prediction network, namely, the probability that the K regional subgraphs belong to any one of N classes is respectively predicted, and finally K classification loss values are calculated to obtain the accurate loss of subgraph identification (namely, the fourth loss value). Wherein the fourth penalty value is the sum of the K classification penalty values.

In some embodiments, the penalty value for the image predictor is a weighted sum of the first penalty value, the second penalty value, the third penalty value, and the fourth penalty value described above. Schematically, please refer to the following formula two.

The formula II is as follows: loss is a × Loss _ cr + b × Loss _ locate + c × Loss _ part + d × Loss _ all

Where, Loss represents a Loss value of the image prediction result, Loss _ cr represents the first Loss value, Loss _ all represents the second Loss value, Loss _ locate represents the third Loss value, Loss _ part represents the fourth Loss value, a is a first weight of the first Loss value, d is a second weight of the second Loss value, b is a third weight of the third Loss value, and c is a fourth weight of the fourth Loss value.

And 606, performing parameter adjustment on the scene recognition model based on the loss value of the image prediction result.

In some embodiments, the model is converged by means of multiple iterations.

Schematically, a Stochastic Gradient Descent (SGD) method is adopted to solve convolution parameters and bias parameters of a scene recognition model, and all parameters of the scene recognition model are set to be in a state needing learning; in each iteration process, extracting m sample images; calculating all sample images in a forward direction, calculating to obtain a loss value, reversely transmitting the loss value to a scene recognition model (namely a convolutional neural network model), calculating a gradient and updating parameters of the scene recognition model; and iterating the process for multiple rounds.

It should be noted that the neural network model shown in the above embodiments is only an illustrative example, and the specific structure of the model and the specific parameter setting are not limited in the embodiments of the present application.

Referring to fig. 7, which is a schematic diagram illustrating a loss value calculation process of a scene recognition model according to an exemplary embodiment of the present application, as shown in fig. 7, first, a sample image 700 marked with a sample class label 710 is subjected to feature extraction through a feature extraction network 701 to obtain global features, so that global feature classification is performed through a global feature classification network 702 to obtain a global prediction result 703, and a first loss value is obtained based on the global prediction result 703 and the sample class label 710.

At least two regional subgraphs 705 are obtained from the sample image 700 based on the global features through the attention mechanism-based part extraction network 704, the subgraph features of the regional subgraphs 705 are extracted through the feature extraction network 701, the subgraph features and the global features are combined to obtain a fusion feature 706, the fusion feature 706 is predicted through the global attention classification prediction network 707 to obtain a global prediction result 708, and therefore a second loss value is obtained based on the global prediction result 708 and the sample class label 710.

After the sub-graph features are extracted, a positioning prediction result 711 is obtained through prediction of a positioning accuracy prediction network 709, and a third loss value is obtained based on the positioning prediction result 711 and the sample class label 710.

After the subgraph features are extracted, a subgraph prediction result 713 is obtained through the subgraph prediction network 712 in a prediction mode, and a fourth loss value is obtained on the basis of the subgraph prediction result 713 and the sample class label 710.

Accordingly, a loss value of the image prediction result is obtained based on the first loss value, the second loss value, the third loss value and the fourth loss value, and the scene recognition model is trained based on the loss value.

Local part features which are easy to be ignored globally are extracted by the model through attention and mechanical training to be combined with global features for recognition and classification, so that the feature description capability of the model on different scenes is improved, and the recognition effect of the scenes is optimized.

The whole process learns end to end, and the problems that the model is difficult to optimize in stages and feature learning in different stages cannot be shared are solved.

Fig. 8 is a block diagram of an apparatus for recognizing an image scene according to an exemplary embodiment of the present application, where the apparatus includes:

an obtaining module 810, configured to obtain a target image, where the target image is an image of an image scene to be identified;

an extracting module 820, configured to extract a global feature of the target image, where the global feature is obtained by performing feature extraction on the whole target image;

the extracting module 820 is further configured to extract at least two region subgraphs from the target image based on the global feature and an image recognition classification library, where the image recognition classification library includes a scene category label for labeling an image;

and the identifying module 830 is configured to identify and obtain a scene category tag corresponding to the target image based on the global feature and the sub-image feature of the region sub-image.

In some embodiments, the identifying module 830 is further configured to identify feature points in the global features based on the global features and the image identification classification library, so as to obtain candidate point scores corresponding to the feature points;

as shown in fig. 9, the extraction module 820 includes:

a determining unit 821, configured to determine a confidence of a candidate subgraph in the target image based on the candidate point score, where a mapping relationship exists between the candidate subgraph and the feature point;

the determining unit 821 is further configured to determine the at least two regional subgraphs from the candidate subgraphs of the target image based on the confidence.

In some embodiments, the extraction module 820 further comprises:

a sorting unit 822 for sorting the candidate subgraphs of the target image based on the confidence;

the determining unit 821 is further configured to determine a designated subgraph from the candidate subgraphs arranged in sequence, where the designated subgraph is a subgraph meeting a designated requirement in the candidate subgraphs;

the determining unit 821 is further configured to determine a required subgraph from the candidate subgraphs based on an overlapping relationship between the designated subgraph and the candidate subgraph, where the required subgraph is a subgraph whose overlapping relationship with the designated subgraph meets the requirement of the overlapping relationship;

the determining unit 821 is further configured to determine the at least two regional subgraphs based on the required subgraph and the specified subgraph.

In some embodiments, the determining unit 821 is further configured to determine, as the designated subgraph, a candidate subgraph with a highest confidence value from the candidate subgraphs arranged in the sequence.

In some embodiments, the determining unit 821 is further configured to determine an overlap ratio between the designated sub-graph and the candidate sub-graph;

the determining unit 821, further configured to reserve and determine the candidate subgraph as the required subgraph in response to an overlapping rate between the designated subgraph and the candidate subgraph reaching an overlapping rate threshold; discarding the candidate subgraph in response to the overlap ratio between the designated subgraph and the candidate subgraph being less than the overlap ratio threshold.

In some embodiments, the extracting module 820 is further configured to extract the sub-graph features of the at least two region sub-graphs;

the device further comprises:

a merging module 840, configured to merge the sub-graph features and the global features to obtain a fused feature;

the identifying module 830 is further configured to perform scene identification on the fusion feature to obtain a scene category label corresponding to the target image.

In some embodiments, the device is installed with a scene recognition model, and the target image is a sample image labeled with a sample category label;

the identifying module 830 is further configured to identify, by the scene identification model, to obtain an image prediction result corresponding to the global feature and the sub-image feature of the region sub-image; obtaining a loss value of the image prediction result based on the image prediction result and the sample class label;

the device further comprises:

an adjusting module 850, configured to perform parameter adjustment on the scene recognition model based on the loss value of the image prediction result.

In some embodiments, the identifying module 830 is further configured to perform global feature prediction on the global feature through the scene identification model to obtain a global prediction result;

the identifying module 830 is further configured to perform attention classification prediction on the fusion feature obtained by merging the sub-image feature and the global feature through the scene identification model, and obtain a scene category label corresponding to the target image as an attention prediction result;

the recognition module 830 is further configured to perform positioning accuracy prediction on the sub-image features of the at least two region sub-images through the scene recognition model to obtain a positioning prediction result;

the identifying module 830 is further configured to perform classification prediction on the sub-graph features of the at least two region sub-graphs through the scene identification model to obtain a sub-graph prediction result.

In some embodiments, the identifying module 830 is further configured to obtain a first loss value based on the global prediction result and the sample class label; obtaining a second loss value based on the attention prediction result and the sample class label; obtaining a third loss value based on the positioning prediction result and the sample class label; obtaining a fourth loss value based on the subgraph prediction result and the sample class label; obtaining a loss value of the image prediction result based on the first loss value, the second loss value, the third loss value, and the fourth loss value.

In summary, according to the image scene recognition device provided in the embodiment of the present application, after the global feature of the target image is extracted in the scene recognition process of the target image, the regional subgraph is extracted from the target image by using an attention mechanism based on the global feature, so that the target image is subjected to scene recognition based on the subgraph feature and the global feature, that is, the reference content of the scene recognition not only includes a single entity in the target image, but also includes an image region in the target image, which is related to each scene category label in the image recognition classification library, so that the scene recognition accuracy of the target image is improved.

It should be noted that: the image scene recognition apparatus provided in the foregoing embodiment is only illustrated by the division of the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the functions described above. In addition, the image scene recognition device provided in the above embodiment has the same concept as the image scene recognition method embodiment, and the specific implementation process thereof is described in the method embodiment and is not described herein again.

Fig. 10 shows a schematic structural diagram of a computer device provided in an exemplary embodiment of the present application, which may be implemented as the server 120 shown in fig. 1 above. Specifically, the method comprises the following steps:

the computer apparatus 1000 includes a Central Processing Unit (CPU) 1001, a system Memory 1004 including a Random Access Memory (RAM) 1002 and a Read Only Memory (ROM) 1003, and a system bus 1005 connecting the system Memory 1004 and the Central Processing Unit 1001. The computer device 1000 also includes a mass storage device 1006 for storing an operating system 1013, application programs 1014, and other program modules 1015.

The mass storage device 1006 is connected to the central processing unit 1001 through a mass storage controller (not shown) connected to the system bus 1005. The mass storage device 1006 and its associated computer-readable media provide non-volatile storage for the computer device 1000. That is, the mass storage device 1006 may include a computer-readable medium (not shown) such as a hard disk or Compact disk Read Only Memory (CD-ROM) drive.

Without loss of generality, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), flash Memory or other solid state Memory technology, CD-ROM, Digital Versatile Disks (DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage, or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 1004 and mass storage device 1006 described above may be collectively referred to as memory.

According to various embodiments of the present application, the computer device 1000 may also operate as a remote computer connected to a network via a network, such as the Internet. That is, the computer device 1000 may be connected to the network 1012 through the network interface unit 1011 connected to the system bus 1005, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 1011.

The memory further includes one or more programs, and the one or more programs are stored in the memory and configured to be executed by the CPU.

Embodiments of the present application further provide a computer device, which includes a processor and a memory, where at least one instruction, at least one program, a code set, or a set of instructions is stored in the memory, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by the processor to implement the method for recognizing an image scene provided by the foregoing method embodiments.

Embodiments of the present application further provide a computer-readable storage medium, where at least one instruction, at least one program, a code set, or a set of instructions is stored on the computer-readable storage medium, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by a processor to implement the method for identifying an image scene provided by the foregoing method embodiments.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to make the computer device execute the image scene identification method in any of the above embodiments.

Optionally, the computer-readable storage medium may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a Solid State Drive (SSD), or an optical disc. The Random Access Memory may include a resistive Random Access Memory (ReRAM) and a Dynamic Random Access Memory (DRAM). The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for identifying an image scene, the method comprising:

2. The method of claim 1, wherein said extracting at least two region subgraphs from the target image based on the global features and an image recognition classification library comprises:

identifying feature points in the global features based on the global features and the image identification classification library to obtain candidate point scores corresponding to the feature points;

determining a confidence level of a candidate subgraph in the target image based on the candidate point scores, wherein a mapping relation exists between the candidate subgraph and the feature points;

determining the at least two region subgraphs from the candidate subgraphs of the target image based on the confidence.

3. The method of claim 2, wherein the determining the at least two region subgraphs from the candidate subgraphs of the target image based on the confidence comprises:

ranking the candidate subgraphs of the target image based on the confidence;

determining a designated subgraph from the candidate subgraphs which are arranged in sequence, wherein the designated subgraph is the subgraph which meets the designated requirement in the candidate subgraphs;

determining a required subgraph from the candidate subgraphs based on the overlapping relation between the designated subgraph and the candidate subgraph, wherein the required subgraph is the subgraph of which the overlapping relation with the designated subgraph meets the requirement of the overlapping relation;

determining the at least two region subgraphs based on the required subgraphs and the specified subgraphs.

4. The method of claim 3, wherein determining a designated subgraph from the candidate subgraphs in the sequence comprises:

and determining the candidate subgraph with the highest confidence value as the appointed subgraph from the candidate subgraphs which are arranged in sequence.

5. The method of claim 3, wherein determining a required subgraph from the candidate subgraphs based on the overlapping relationship between the specified subgraph and the candidate subgraphs comprises:

determining an overlap ratio between the designated subgraph and the candidate subgraph;

in response to an overlap ratio between the designated subgraph and the candidate subgraph reaching an overlap ratio threshold, reserving and determining the candidate subgraph as the required subgraph;

discarding the candidate subgraph in response to the overlap ratio between the designated subgraph and the candidate subgraph being less than the overlap ratio threshold.

6. The method of any one of claims 1 to 5, wherein the identifying and obtaining the scene category label corresponding to the target image based on the global feature and the sub-graph feature of the regional sub-graph comprises:

extracting the subgraph features of the at least two regional subgraphs;

combining the sub-graph features and the global features to obtain fusion features;

and carrying out scene recognition on the fusion characteristics to obtain a scene category label corresponding to the target image.

7. The method according to any one of claims 1 to 5, wherein the method is applied to a scene recognition model, and the target image is a sample image labeled with a sample category label;

the method further comprises the following steps:

identifying and obtaining an image prediction result corresponding to the global feature and the sub-image feature of the regional sub-image through the scene identification model;

obtaining a loss value of the image prediction result based on the image prediction result and the sample class label;

and performing parameter adjustment on the scene recognition model based on the loss value of the image prediction result.

8. The method of claim 7, wherein the obtaining of the image prediction result corresponding to the global feature and the sub-graph feature of the region sub-graph through the scene recognition model recognition comprises:

global feature prediction is carried out on the global features through the scene recognition model, and a global prediction result is obtained;

performing attention classification prediction on the fusion feature combined by the sub-image feature and the global feature through the scene recognition model to obtain a scene category label corresponding to the target image as an attention prediction result;

carrying out positioning accuracy prediction on the subgraph characteristics of the at least two regional subgraphs through the scene recognition model to obtain a positioning prediction result;

and carrying out classification prediction on the sub-image characteristics of the at least two region sub-images through the scene recognition model to obtain a sub-image prediction result.

9. The method of claim 8, wherein deriving the loss value for the image predictor based on the image predictor and the sample class label comprises:

obtaining a first loss value based on the global prediction result and the sample class label;

obtaining a second loss value based on the attention prediction result and the sample class label;

obtaining a third loss value based on the positioning prediction result and the sample class label;

obtaining a fourth loss value based on the subgraph prediction result and the sample class label;

obtaining a loss value of the image prediction result based on the first loss value, the second loss value, the third loss value, and the fourth loss value.

10. An apparatus for recognizing an image scene, the apparatus comprising:

11. The apparatus according to claim 10, wherein the identifying module is further configured to identify feature points in the global features based on the global features and the image identification classification library, so as to obtain candidate point scores corresponding to the feature points;

the extraction module comprises:

a determining unit, configured to determine a confidence of a candidate subgraph in the target image based on the candidate point score, where a mapping relationship exists between the candidate subgraph and the feature point;

the determining unit is further configured to determine the at least two region subgraphs from the candidate subgraphs of the target image based on the confidence.

12. The apparatus of claim 11, wherein the extraction module further comprises:

the sorting unit is used for sorting the candidate subgraphs of the target image based on the confidence coefficient;

the determining unit is further configured to determine a designated subgraph from the candidate subgraphs arranged in sequence, where the designated subgraph is a subgraph meeting a designated requirement in the candidate subgraphs;

the determining unit is further configured to determine a required subgraph from the candidate subgraphs based on an overlapping relationship between the specified subgraph and the candidate subgraph, where the required subgraph is a subgraph whose overlapping relationship with the specified subgraph meets the requirement of the overlapping relationship;

the determining unit is further configured to determine the at least two region subgraphs based on the required subgraph and the specified subgraph.

13. A computer device, characterized in that the computer device comprises a processor and a memory, wherein at least one program is stored in the memory, and the at least one program is loaded and executed by the processor to implement the method for recognizing an image scene according to any one of claims 1 to 9.

14. A computer-readable storage medium, in which at least one program is stored, which is loaded and executed by a processor to implement the method for recognizing an image scene according to any one of claims 1 to 9.