CN111597921A

CN111597921A - Scene recognition method and device, computer equipment and storage medium

Info

Publication number: CN111597921A
Application number: CN202010348239.3A
Authority: CN
Inventors: 周立广; 岑俊; 林天麟; 徐扬生
Original assignee: Shenzhen Institute of Artificial Intelligence and Robotics; Chinese University of Hong Kong CUHK
Current assignee: Shenzhen Institute of Artificial Intelligence and Robotics; Chinese University of Hong Kong CUHK
Priority date: 2020-04-28
Filing date: 2020-04-28
Publication date: 2020-08-28
Anticipated expiration: 2040-04-28
Also published as: CN111597921B

Abstract

The application relates to a scene recognition method, a scene recognition device, computer equipment and a storage medium. The method comprises the following steps: acquiring an image to be identified; identifying the image to be identified to obtain a plurality of target objects in the image to be identified; generating an object group matrix according to the target object, and extracting object group characteristics according to the object group matrix; calling a scene recognition model to perform feature extraction on the image to be recognized to obtain image features; and carrying out classification processing according to the object group characteristics and the image characteristics to obtain a scene type corresponding to the image to be identified. By adopting the method, the accuracy of scene recognition can be effectively improved.

Description

Scene recognition method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a scene recognition method and apparatus, a computer device, and a storage medium.

Background

With the development of computer technology, scene recognition technology is widely applied to a plurality of fields. Scene recognition technology is also called scene classification technology, and is one of important topics of computer vision, which is used for recognizing scenes in images. For example, scene recognition technology can be applied in the field of robotics, so that the robot can recognize the current scene, and accurate scene recognition helps the robot to make more intelligent decisions.

In a conventional scene recognition method, generally, according to a relationship between a single object and a scene, when an object appears in an image, a scene type corresponding to the image is determined. However, if the same object exists in different scenes, the scene type corresponding to the image cannot be accurately recognized by the object, which leads to a decrease in accuracy of scene recognition.

Disclosure of Invention

In view of the above, it is necessary to provide a scene recognition method, apparatus, computer device and storage medium capable of improving scene recognition accuracy.

A method of scene recognition, the method comprising:

acquiring an image to be identified;

identifying the image to be identified to obtain a plurality of target objects in the image to be identified;

generating an object group matrix according to the target object, and extracting object group characteristics according to the object group matrix;

calling a scene recognition model to perform feature extraction on the image to be recognized to obtain image features;

and carrying out classification processing according to the object group characteristics and the image characteristics to obtain a scene type corresponding to the image to be identified.

In one embodiment, the generating an object group matrix from the target object includes:

acquiring a preset standard scene object;

traversing the target object according to the standard scene object to obtain an object vector corresponding to the target object;

and generating an object group matrix according to the object vector.

combining the target objects to obtain a plurality of object groups;

acquiring a standard specificity matrix, wherein the standard specificity matrix comprises a plurality of standard specificities;

screening a plurality of standard specificity degrees according to the object group to obtain target specificity degrees corresponding to the object group;

and generating the object group matrix according to the target specificity and the standard specificity matrix.

In one embodiment, the method further includes a step of generating the criterion specificity matrix, and the step of generating the criterion specificity matrix includes:

acquiring a training image set, wherein the training image set comprises training images corresponding to various scene types;

identifying training objects in the training images, and generating a plurality of training object groups according to the training objects;

counting the number of object groups corresponding to the training object group according to the scene type, and generating a joint probability matrix corresponding to the scene type according to the number of the object groups;

calculating training specificity corresponding to the training object group according to the joint probability matrix;

and when the training specificity accords with a preset condition, recording the training specificity as a standard specificity, and generating the standard specificity matrix according to the standard specificity.

In one embodiment, after the counting the number of object groups corresponding to the training object group according to the scene type, the method further includes:

comparing the number of the object groups with a preset threshold value;

and when the number of the object groups is smaller than the preset threshold value, recording the training object group corresponding to the number of the object groups as a noise object group.

In one embodiment, the identifying the image to be identified to obtain a plurality of target objects in the image to be identified includes:

acquiring a trained target detection model;

and inputting the image to be recognized into the target detection model to obtain a plurality of target image areas corresponding to the image to be recognized and target objects corresponding to the target image areas.

calling a semantic segmentation model;

inputting the image to be recognized into the semantic segmentation model to obtain a semantic segmentation image corresponding to the image to be recognized, wherein the pixel value of the semantic segmentation image corresponds to the category of the pixel point;

and determining the target object in the image to be identified according to the pixel points with the same belonged categories.

A scene recognition apparatus, the apparatus comprising:

the object identification module is used for acquiring an image to be identified; identifying the image to be identified to obtain a plurality of target objects in the image to be identified;

the characteristic extraction module is used for generating an object group matrix according to the target object and extracting object group characteristics according to the object group matrix; calling a scene recognition model to perform feature extraction on the image to be recognized to obtain image features;

and the scene identification module is used for carrying out classification processing according to the object group characteristics and the image characteristics to obtain a scene type corresponding to the image to be identified.

A computer device comprising a memory storing a computer program and a processor implementing the steps of the above-described scene recognition method when executing the computer program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned scene recognition method.

According to the scene recognition method, the scene recognition device, the computer equipment and the storage medium, the obtained image to be recognized is recognized to obtain the plurality of target objects in the image to be recognized, the object group matrix is generated according to the target objects, the corresponding object group characteristics are extracted according to the object group matrix, and the object group characteristics can reflect the relationship between the object group in the image to be recognized and the scene type. The method comprises the steps of extracting features of an image to be recognized by calling a scene recognition model, classifying according to image features and object group features to obtain a scene type corresponding to the image to be recognized, fully integrating the object group features and the image features of the image, and effectively improving the accuracy of scene recognition.

Drawings

FIG. 1 is a diagram of an exemplary scenario recognition application environment;

FIG. 2 is a flow diagram illustrating a method for scene recognition in one embodiment;

FIG. 3 is a flowchart illustrating the steps of generating an object group matrix from target objects in one embodiment;

FIG. 4 is a flowchart illustrating the steps of generating an object group matrix from target objects in another embodiment;

FIG. 5 is a flowchart illustrating the steps of generating a criteria specificity matrix in one embodiment;

FIG. 6 is a block diagram showing the structure of a scene recognition apparatus according to an embodiment;

FIG. 7 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The scene identification method can be applied to a terminal and a server. In one embodiment, the scene recognition method provided by the present application can be applied to the application environment as shown in fig. 1. Wherein the terminal 102 and the server 104 communicate via a network. The terminal 102 may obtain the image to be recognized by itself, or may obtain the image to be recognized from the server 104 through the network. The terminal 102 identifies the image to be identified to obtain a plurality of target objects in the image to be identified. The terminal 102 generates an object group matrix according to the target object, and extracts corresponding object group characteristics according to the object group matrix. The terminal 102 calls the scene recognition model to perform feature extraction on the image to be recognized, so as to obtain image features. And the terminal 102 performs classification processing according to the object group characteristics and the image characteristics to obtain a scene type corresponding to the image to be identified. The terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, portable wearable devices, and robots. The robot refers to a machine device that automatically performs work, and the robot may include, but is not limited to, an industrial robot, a service robot, a special robot, and the like. The server 104 may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.

In an embodiment, as shown in fig. 2, a scene recognition method is provided, which is described by taking the method as an example applied to the terminal 102 in fig. 1, and includes the following steps:

step 202, acquiring an image to be identified.

The image to be recognized refers to an image to be subjected to scene recognition. Scene recognition is one of the image classification tasks, which is used to recognize scenes appearing in images and classify the images based on scene dimensions. The recognition result of the scene recognition may be a scene type corresponding to the image scene, or a scene name corresponding to the scene type. An image scene may reflect the environment to which the image corresponds from the scene dimension. The image scene may be an indoor scene or an outdoor scene. The image scenes may correspond to different scene types in the same environment. For example, the image scene may correspond to different scene types in a residential environment, and specifically may include scenes of a living room, a bathroom, a balcony, a kitchen, a dining room, or a bedroom. The image scenes may also be different scene types in different environments. For example, the image scene may specifically include a scene of a coffee shop, a restaurant, a bar, a conference room, a library, a waiting hall, or the like. The image to be recognized may specifically include at least one of a living room image, a kitchen image, a bedroom image, a coffee shop image, a bar image, a conference room image, a library image, or the like, corresponding to the scene.

The terminal can acquire one or more images to be identified. The plurality of sheets means two or more sheets. The terminal can acquire the image to be identified in various ways. Specifically, the terminal may acquire image data of a current corresponding scene through a corresponding acquisition device. The acquisition device may specifically be a camera, a still camera, a video camera, or the like corresponding to the terminal. The terminal can acquire the image data of the current corresponding scene through the acquisition equipment and also can acquire the video data of the current corresponding scene.

When the terminal collects video data, the terminal can analyze the video data and acquire one or more frames of video images in the video data as images to be identified. The terminal can perform scene recognition by acquiring image data of a current corresponding scene as an image to be recognized, so that the terminal can conveniently know current environment information from scene dimensions. In one embodiment, when the scene recognition method is applied to a server, the terminal may upload the acquired image data to the server, and the server may record the received image data uploaded by the terminal as an image to be recognized and perform scene recognition on the image to be recognized.

The terminal can also crawl images from the network, and the images crawled from the network are used as images to be identified for scene identification. The terminal can also acquire images sent by other computer equipment, and the acquired images are recorded as images to be identified. Wherein the other computer devices may include other terminals or servers.

And 204, identifying the image to be identified to obtain a plurality of target objects in the image to be identified.

The target object refers to an object that needs to be recognized from an image to be recognized. Based on the difference of the scenes corresponding to the different images to be recognized, the target objects corresponding to the different images to be recognized may be different. The scene corresponding to the image to be recognized usually includes a plurality of objects, and the target object may specifically be an object in the image to be recognized. For example, in a bedroom scene, a plurality of objects such as a bed, a bedside table, a pillow, a wardrobe, a ceiling lamp and the like are usually included in the bedroom, and the target objects corresponding to the image to be recognized based on the bedroom scene may be at least two of the plurality of objects.

The terminal can perform image recognition on the image to be recognized through the image recognition model to obtain a plurality of target objects in the image to be recognized. Wherein, a plurality means two or more. The terminal can perform image recognition on the image to be recognized through at least one of a plurality of image recognition modes. For example, the terminal may perform image recognition on the image to be recognized through at least one of target detection, semantic segmentation, and the like. The image recognition model may specifically be at least one of a target detection model or a semantic segmentation model, corresponding to the image recognition mode.

Specifically, the terminal can obtain a target detection model, and perform target detection on the image to be recognized through the target detection model to obtain a target object corresponding to the image to be recognized. The target detection model may be pre-established according to a target detection algorithm and obtained after training. The terminal can also obtain a semantic segmentation model obtained after training, perform semantic segmentation on the image to be recognized through the semantic segmentation model, and determine the target object according to the semantic segmentation result of each pixel point in the image to be recognized. The semantic segmentation model may be built according to a semantic segmentation algorithm. The terminal can also call the target detection model and the semantic segmentation model to identify the image to be identified, and determines a plurality of target objects in the image to be identified by integrating the identification result of the target detection model and the identification result of the semantic segmentation model, so that the identification accuracy of the target objects in the image to be identified is effectively improved.

And step 206, generating an object group matrix according to the target object, and extracting object group characteristics according to the object group matrix.

The target object and the scene have a relationship, and the scene type corresponding to the image can be identified according to the relationship between the target object and the scene. For example, when a bathtub is included in the target object, the scene corresponding to the image may be determined to be a toilet according to the relationship between the target object and the scene. However, the target object may belong to a common object that is common to a plurality of scenes, and the scene of the image cannot be accurately recognized by the common object. For example, when the target object is recognized as a table from the image, it is not possible to accurately recognize that the scene corresponding to the image is a dining room or a restaurant in a house by the table. Therefore, the terminal can generate an object group according to the target object, and the scene type corresponding to the image to be recognized is accurately reflected through the object group.

The terminal can generate an object group matrix according to the target objects, and the scene type corresponding to the image to be identified can be identified more accurately through the object group matrix. Specifically, the terminal may combine the plurality of identified target objects to obtain a plurality of object groups. Each object group may include a plurality of target objects, and the number of target objects in an object group may be one of 2, 3, or 4, etc. The relation between the target object and the scene can be more accurately expressed through the object group. For example, the object group may include two target objects, and when the two target objects in the object group are a bathtub and a washstand, respectively, it may be determined that the scene corresponding to the object group is a bathroom.

The terminal can generate an object group matrix according to the generated plurality of object groups, and extract corresponding object group characteristics according to the object group matrix. In one embodiment, the terminal may pass the object group matrix through a feature extraction network to obtain the object group features output by the feature extraction network. Wherein the feature extraction network may be a fully connected network. The object group matrix is generated according to the target object, and the relation between the target object and the scene can be more accurately expressed through the object group characteristics corresponding to the object group matrix, so that the accuracy of scene identification is effectively improved.

And 208, calling the scene recognition model to perform feature extraction on the image to be recognized to obtain image features.

The scene recognition model is a model for performing scene recognition on the image, and the scene recognition model can be a network model established based on a neural network and obtained after being trained by a large amount of training data. Specifically, the scene recognition model may be a network model established based on Place-CNN (Convolutional Neural Networks), and the network architecture of the scene recognition model may be one of a plurality of network architectures corresponding to the Convolutional Neural network. For example, the scene recognition model may be established based on ResNet (Residual Network). After the scene recognition model is established, the scene recognition model can be trained through a large amount of training data to obtain the scene recognition model which can be used for scene recognition.

After the scene recognition model is trained, the scene recognition model can be configured in the terminal, so that the terminal can call the scene recognition model to perform scene recognition on the image to be recognized. The terminal can call the scene recognition model, the image to be recognized is input into the scene recognition model, and the terminal can extract the features of the image to be recognized through the scene recognition model to obtain the image features corresponding to the image to be recognized. In one embodiment, after the terminal performs feature extraction on the image to be recognized through the scene recognition model to obtain the image features corresponding to the image to be recognized, the terminal can also perform scene recognition directly according to the image features based on the scene recognition model to obtain the scene type corresponding to the image to be recognized output by the scene recognition model.

And step 210, performing classification processing according to the object group characteristics and the image characteristics to obtain a scene type corresponding to the image to be identified.

The terminal can classify the image to be recognized according to the object group characteristics and the image characteristics to obtain the scene type corresponding to the image to be recognized. The relation between the target object and the scene can be more accurately expressed through the object group characteristics, so that the accuracy of scene identification is improved. And moreover, the image features are supplemented according to the object group features, and classification processing is carried out according to the object group features and the image features, so that the accuracy of classification results is further improved, and the accuracy of scene recognition is improved.

Specifically, the terminal may combine the object group features and the image features to obtain scene features corresponding to the image to be recognized, and the terminal may perform classification processing according to the scene features corresponding to the image to be recognized to obtain a scene type corresponding to the image to be recognized. For example, the terminal may obtain 512-dimensional object group features and image features through feature extraction, and the terminal may combine the 512-dimensional object group features and the 512-dimensional image features to obtain 1024-dimensional scene features. The terminal can perform classification processing according to the scene features of 1024 dimensions to obtain the scene type corresponding to the image to be identified.

The terminal can classify the scene features in various classification modes. For example, the terminal may obtain a preset classifier, and perform classification processing on the scene features through the classifier to obtain a scene type corresponding to the image to be identified. The classifier may specifically include, but is not limited to, an SVM (support vector machine), a DT (decision tree), or an NBM (Naive Bayesian Model). The terminal can also input the scene characteristics into a classification network, and the classification network is used for classifying the scene characteristics to obtain the scene type corresponding to the image to be identified. The classification network may be a fully connected network, and the fully connected network may have multiple layers. For example, as illustrated by the dimensions of the scene features above, a fully connected network for classification may have three layers, 1024, 512 and the number of types corresponding to the scene type.

In one embodiment, after the terminal performs classification processing according to the object group features and the image features, the obtained classification result may be a degree of relationship corresponding to each of the plurality of scene types. The scene type is set according to the actual application requirement, the terminal can acquire a training image corresponding to the scene type to be recognized, and training data is generated according to the training image for training. The training data includes training images and scene labeling information corresponding to the training images. The relationship degree refers to the possibility that the scene corresponding to the image to be recognized belongs to each scene type, and the relationship degree can be represented by means of probability, percentage, percentile score and the like. The terminal can directly take the plurality of scene types and the degree of relationship corresponding to the plurality of scene types as the scene recognition result. The terminal can also determine the scene type corresponding to the image to be identified according to the respective corresponding relationship degree of the plurality of scene types. For example, the terminal may determine the scene type with the largest degree of relationship as the scene type corresponding to the image to be recognized.

In this embodiment, a plurality of target objects in an image to be recognized are obtained by recognizing the acquired image to be recognized, an object group matrix is generated according to the target objects, and corresponding object group features are extracted according to the object group matrix. Compared with the traditional method of identifying scenes according to the relationship between a single target object and the scenes, the method can more accurately express the relationship between the target object and the scenes through the object group characteristics. The method comprises the steps of extracting features of an image to be recognized by calling a scene recognition model to obtain image features, classifying the image features according to object group features and the image features, supplementing the image features through more accurate object group features, integrating the object group features and the image features, classifying the image features according to the object features and the image features to obtain a scene type corresponding to the image to be recognized, and effectively improving the accuracy of scene recognition.

In an embodiment, the step of identifying the image to be identified to obtain a plurality of target objects in the image to be identified includes: acquiring a trained target detection model; and inputting the image to be recognized into the target detection model to obtain a plurality of target image areas corresponding to the image to be recognized and target objects corresponding to the target image areas.

The terminal can obtain a target detection model, and the target detection model can be a trained network model and is used for performing target detection on the image to be recognized and outputting a detection result of the image to be recognized. The target detection model can be a network model established based on a deep neural network, and the target detection model can be specifically established according to one of various target detection algorithms such as R-CNN (Region-CNN), FastR-CNN, Faster-RCNN, YOLO (You Only Look one) or SSD (Single Shell Multi BoxDetector).

The terminal can acquire training data to train the established network model, so that a target detection model capable of carrying out target detection on the image to be recognized is obtained. In one embodiment, the training of the target detection model may also be configured to the terminal after the training is completed by other terminals or servers. The training data includes a plurality of training images including or not including the object, and object labeling information corresponding to each of the training images. The training data can determine the detection capability of the target detection model, training is performed according to different training data, and the target objects which can be detected by the trained target detection model and the number of the target objects can be different. For example, the terminal may obtain a COCO (common Objects incontext) dataset, and train the target detection model through the COCO dataset. The COCO data set includes a training set, a validation set, and a test set.

The terminal can call the target detection model, input the image to be recognized for scene recognition into the target detection model, and perform target detection on the image to be recognized through the target detection model. The terminal can obtain a target detection result output by the target detection model after target detection is performed. The target detection result may include target objects detected from the image to be recognized, and target image regions in which the target objects are each in the image to be recognized. The target image area refers to an area where a target object framed out of the image to be recognized through a square frame is located. The target image areas corresponding to different target objects may overlap each other. It is understood that the target image area is usually a square frame, but may be a regular frame such as a circular frame or a triangular frame, or an irregular frame.

In this embodiment, the terminal may obtain the trained target detection model, and the image to be recognized is input to the target detection model, so as to detect the target object in the image to be recognized through the target detection model, thereby effectively ensuring the accuracy of the recognized target object, and facilitating the accurate recognition of the scene type corresponding to the image to be recognized according to the target object.

In an embodiment, the step of identifying the image to be identified to obtain a plurality of target objects in the image to be identified includes: calling a semantic segmentation model; inputting an image to be recognized into a semantic segmentation model to obtain a semantic segmentation image corresponding to the image to be recognized, wherein pixel values of the semantic segmentation image correspond to the categories of pixel points; and determining the target object in the image to be identified according to the pixel points with the same category.

The semantic segmentation model is a network model used for performing semantic segmentation processing on an image to be recognized. Semantic segmentation is an image classification mode on a pixel level, and pixels belonging to the same class in an image are classified into one class. Similar to the target detection model in the above embodiment, the semantic segmentation model may be configured at the terminal after being trained in advance.

The terminal can call the semantic segmentation model, input the image to be recognized into the semantic segmentation model, and perform semantic segmentation on the image to be recognized through the semantic segmentation model. The semantic segmentation model may employ a neural network model, such as an FCN (full convolution neural Networks) model. The terminal can perform semantic classification on each pixel point in the image to be recognized through the semantic segmentation model to obtain a semantic segmentation image output after the semantic segmentation is performed on the semantic segmentation model. And the pixel points of the semantically segmented image correspond to the pixel points of the image to be identified. The pixel value of each pixel point in the semantic segmentation image can be used for representing the category of the pixel point, and the pixel values corresponding to the pixel points belonging to the same category are the same. The terminal can determine the target objects corresponding to the pixel points according to the pixel points with the same category.

In this embodiment, the semantic segmentation is performed on the image to be recognized through the called semantic segmentation model, so as to obtain a semantic segmentation image corresponding to the image to be recognized. The corresponding target object is determined by semantically segmenting the pixel points with the same category in the image, so that the target object in the image to be recognized is accurately recognized, and the scene type corresponding to the image to be recognized is accurately recognized according to the target object.

In one embodiment, as shown in fig. 3, the step of generating the object group matrix according to the target object includes:

step 302, acquiring a preset standard scene object.

And 304, traversing the target object according to the standard scene object to obtain an object vector corresponding to the target object.

Step 306, generating an object group matrix according to the object vector.

The standard scene objects include all objects in the recognized scene type. The standard scene object is determined and set when the image recognition model is trained in advance. As described in the foregoing embodiment with respect to recognizing an image to be recognized, the terminal may need to acquire training data to train the image recognition model, so as to train the capability of the image recognition model to recognize the target object. For example, if the training data only includes training images of wardrobes, the image recognition model can only recognize wardrobes in the images to be recognized after being trained by the training data, but cannot recognize other objects, i.e., the recognition capability is limited by the training data.

When the image recognition model is trained, the object which can be recognized by the image recognition model can be determined according to the recognition capability corresponding to the image recognition model. The terminal may note objects that can be recognized as standard scene objects. Taking the training data of the target detection model as an example, the COCO data set is a data set for performing image recognition training, and the target detection model trained based on the COCO data set can recognize 80 object objects. The terminal may note these 80 object objects as standard scene objects. In one embodiment, the image recognition models may also be trained according to a larger or smaller data set, resulting in image recognition models with different recognition capabilities. The number and content of standard scene objects obtained from different training data may also be different, corresponding to the image recognition model.

The terminal can obtain the standard scene object, and traverse the target object according to the standard scene object to obtain the object vector corresponding to the target object. Specifically, the terminal may compare the plurality of standard scene objects with the plurality of target objects one by one, and generate an object vector corresponding to the target object according to a comparison result. Wherein, the object vector includes "0" and "1", and "0" and "1" can be used to indicate whether the plurality of target objects include the standard scene object.

The terminal can obtain a plurality of standard scene objects, and the standard scene objects are arranged into a standard scene object sequence according to a preset sequencing mode. The terminal reads the standard scene objects one by one according to the standard scene object sequence, compares the read standard scene objects with the plurality of target objects, and judges whether the plurality of target objects comprise the standard scene objects. When the plurality of target objects include the standard scene object, the comparison is successful, and the terminal may mark the standard scene object as "1", and continue to read the next standard scene object, and compare the standard scene object with the plurality of target objects until the comparison of all the standard scene objects is completed. When the plurality of target objects do not include the standard scene object, indicating that the comparison fails, the terminal may mark the standard scene object as "0", and continue to read the next standard scene object, and compare the standard scene object with the plurality of target objects until the comparison of all the standard scene objects is completed.

When the target object is traversed according to the standard scene object, the terminal can obtain the object vectors including "0" and "1" corresponding to the target object. The length of the object vector corresponds to the number of standard scene objects. The object vector may specifically be a row vector or a column vector. The terminal may transpose the object vector, thereby generating an object group matrix corresponding to the target object.

For example, when the number of the standard scene objects is 80, the terminal may obtain an object vector with a length of 80 by traversing the standard scene objects to the target object. The terminal may transpose the object vector and perform a product operation with the transposed object vector to obtain an 80 × 80 object group matrix. It can be understood that, when the number of the standard scene objects is 150, the length of the object vector corresponding to the target object is 150, and the terminal may generate a 150 × 150 object group matrix according to the object vector. The terminal combines a plurality of target objects into a plurality of object groups including two target objects by generating an object group matrix from the object vectors.

In this embodiment, a preset standard scene object is obtained, the target object is traversed according to the standard scene object to obtain an object vector corresponding to the target object, an object group matrix can be generated according to the object vector, the object group matrix can represent an object group formed by the target object, and the relationship between the target object and the scene can be more accurately expressed according to the object group characteristics extracted from the object group matrix, so that the accuracy of scene identification is effectively improved.

In one embodiment, as shown in fig. 4, the step of generating the object group matrix according to the target object includes:

step 402, combining the target objects to obtain a plurality of object groups.

Step 404, a standard specificity matrix is obtained, wherein the standard specificity matrix comprises a plurality of standard specificities.

And 406, screening the plurality of standard specificity degrees according to the object group to obtain the target specificity degree corresponding to the object group.

And step 408, generating an object group matrix according to the target specificity and standard specificity matrixes.

The terminal may combine the plurality of recognized target objects after recognizing the target object from the image to be recognized, to obtain a plurality of object groups including at least two target objects. When there are too many target objects in the object group, unnecessary definition may be generated on the scene, and thus, the target objects included in the object group may be less than or equal to four. The terminal can arrange and combine a plurality of target objects, and the target objects can be combined with the terminal to form an object group. Taking the example that the object group includes two target objects, when the terminal recognizes 6 target objects from the image to be recognized, the terminal may generate 36 object groups from the 6 target objects.

The terminal may obtain a standard specificity matrix, which may be preset. The criterion specification degree matrix comprises a plurality of criterion specification degrees. The standard specification degree, which corresponds to the standard scene object in the above embodiment, refers to a specification degree corresponding to a standard object group generated by combining the standard scene objects. The degree of specificity may be used to represent a particular degree of the set of objects as compared to the scene. The greater the degree of specificity, the more specifically the corresponding object group appears in a specific scene type, and the smaller the degree of specificity, the more commonly the corresponding object group can appear in a plurality of scene types. For example, when the number of standard scene objects that can be identified by the terminal is 80, 6400 standard object groups can be generated according to the 80 standard scene objects, and the terminal can calculate the standard specificity corresponding to each standard object group and generate a standard specificity matrix with a dimension of 80 × 80.

The terminal can read the standard specificity in the standard specificity matrix and screen the plurality of standard specificities according to the object group generated by the target object. The terminal can obtain a standard object group corresponding to the standard specificity, and search a target object group corresponding to the object group from the plurality of standard object groups, so as to obtain the target specificity corresponding to the object group. The terminal can generate an object group matrix according to the target specificity and standard specificity matrixes. Specifically, the terminal may reserve the target specificity in the standard specificity matrix, and replace the standard specificity in the standard specificity matrix except for the target specificity with 0, thereby generating the object group matrix. The dimensions of the object group matrix are consistent with the dimensions of the standard specificity matrix.

For example, when the dimension of the criterion specification degree matrix is 80 × 80, the terminal may generate an 80 × 80 object group matrix according to the criterion specification degree matrix and the target specification degree. It will be appreciated that the dimensions of the generated object group matrix may also be different based on the number of standard scene objects being different. For example, when the number of standard scene objects is 150, the terminal may generate an object group matrix having a dimension of 150 × 150.

In this embodiment, a plurality of object groups are obtained by combining the target objects, a plurality of standard specificity degrees in the standard specificity degree matrix are screened according to the object groups to obtain target specificity degrees corresponding to the plurality of object groups, an object group matrix can be generated according to the target specificity degrees and the standard specificity degree matrix, and an object group matrix generated according to the target specificity degrees can represent the specificity degrees of the object groups formed by the target objects. The object group characteristics extracted according to the object group matrix can more accurately express the relation between the target object and the scene, so that the accuracy of scene identification is effectively improved.

It can be understood that, in an embodiment, after the terminal identifies and obtains a plurality of target objects, the terminal may traverse the target objects according to the standard scene objects, generate an object group matrix according to object vectors generated by the target objects, and extract object group features according to the object group matrix. The terminal can classify according to the object group characteristics and the image characteristics corresponding to the object vectors to obtain the scene type corresponding to the image to be identified.

In an embodiment, the terminal may also generate an object group matrix according to the target specificity and the standard specificity matrix, and extract the object group feature according to the object group matrix. The terminal can perform classification processing according to the object group characteristics and the image characteristics corresponding to the target specificity and standard specificity matrixes to obtain the scene type corresponding to the image to be identified.

In one embodiment, the terminal may further extract the object group features from an object group matrix generated according to the object vector, and extract the object group features from an object group matrix generated according to the target specificity and standard specificity matrices. For convenience of description, the object group vector feature may be referred to as an object group matrix extraction object group feature generated from the object vector, and the specificity feature may be referred to as an object group matrix extraction object group feature generated from the object specificity and standard specificity matrices. The terminal can combine the object group vector features and the specificity features, and perform classification processing according to the combined object group features and the image features, so as to obtain the scene type corresponding to the image to be identified. The terminal can combine the two types of object group characteristics, the combined object group characteristics can be more accurately expressed, and the relation between a target object group consisting of target objects in the image to be recognized and the scene is further improved.

In one embodiment, the scene recognition method further includes a step of generating a standard specificity matrix. As shown in fig. 5, the generating step of the standard specificity matrix includes:

step 502, a training image set is obtained, wherein the training image set comprises training images corresponding to a plurality of scene types.

Step 504, identifying the training objects in the training images, and generating a plurality of training object groups according to the training objects.

Step 506, counting the number of object groups corresponding to the training object group according to the scene type, and generating a joint probability matrix corresponding to the scene type according to the number of the object groups.

And step 508, calculating the training specificity corresponding to the training object group according to the joint probability matrix.

And 510, when the training specificity accords with a preset condition, recording the training specificity as a standard specificity, and generating a standard specificity matrix according to the standard specificity.

The terminal can obtain a training image set, and the training image set comprises training images corresponding to multiple scene types for training. According to the actual application requirements, the terminal can identify which scene types, and then training images corresponding to the scene types can be acquired and used as training data for training. The training image set may be used to train to a criteria specificity matrix. Specifically, the terminal may acquire training data used for training the image recognition model as a training image set, and the terminal may recognize a training object in the training image and generate a plurality of training object groups according to the training object. The way in which the terminal identifies the training objects in the training images and generates the plurality of training object groups according to the training objects is similar to the way in which the target objects are identified and the object groups are generated in the above embodiments, and therefore, the details are not repeated here.

The terminal can count the number of object groups corresponding to the training object group according to the scene type, and generate a joint probability matrix corresponding to the scene type according to the number of the object groups. Specifically, the terminal may determine the respective scene types of the training images according to the scene labeling information corresponding to the training images. The terminal may count the number of object groups corresponding to each of the plurality of training object groups according to the scene type, and the number of object groups may indicate the number of times that the corresponding training object group appears in the scene type. The terminal can generate a joint probability matrix corresponding to the scene type according to the number of object groups corresponding to the training object groups, and each scene type corresponds to the joint probability matrix. The joint probability matrix includes probabilities that multiple training object sets may appear given an image corresponding to one scene type.

In one embodiment, the conditional probability that a training object appears in a scene type can be expressed as:

i∈[0,N_objs],j∈[0,N_scenes]

wherein, p (o)_i|c_j) Representing the occurrence of a training object oi in a scene type c_jIs determined. N is a radical of_oiRepresenting a training object o_iNumber of occurrences, N_ototalRepresenting the total number of training subjects comprised by the training image. N is a radical of_objsRepresenting the total number of training subjects, N_scenesRepresenting the total number of corresponding scene types in the training image set.

Corresponding to a single training object, the terminal may determine the conditional probability of the training object group corresponding to the training object according to the conditional probability corresponding to the single training object. Taking an example that the training object group comprises two training objects, the training objects comprised by the training object group are o respectively_hAnd o_iThen the training object set appears at scene type c_jThe conditional probability in (1) can be expressed as:

p(o_h,o_i|c_j)＝p(o_h|c_j)p(o_i|c_j)

h,i∈[0,N_objs],j∈[0,N_scenes]

the terminal can generate a joint probability matrix corresponding to the scene type according to the conditional probability corresponding to each training object group. The multiple scene types each correspond to a joint probability matrix. Scene type c_jCorresponding joint probability matrix p (o, o | c)_j) Can be expressed as:

where n and m are numbers between 0 and the total number of training subjects.

In one embodiment, the terminal may compare the number of object groups corresponding to the training object group with a preset threshold, where the preset threshold is a noise threshold preset according to actual application requirements. When the number of the object groups is smaller than the preset threshold, the training object group corresponding to the number of the object groups appears less in a single scene type, and only exists in the image corresponding to the individual scene type. The terminal may record, as a noise object group, a training object group whose number of object groups is smaller than a preset threshold value according to the comparison result. The noise object group may affect the relationship between the training object and the scene type, the terminal may ignore the probability of the noise object group appearing in the scene type, and the conditional probability corresponding to the noise object group is recorded as 0 in the joint probability matrix corresponding to the scene type.

For example, in the training image set, there may be a situation where the training image of one toilet scene includes a television, the probability of the television appearing in the toilet is low, and the object group including the television cannot accurately express the relationship with the type of the toilet scene. Therefore, the terminal may ignore the number of times the training object group including the television appears, and mark the conditional probability corresponding to the training object group including the television as 0.

In this embodiment, the training object groups with the number of object groups smaller than the preset threshold are regarded as noise object groups, so that the interference of the noise object groups on scene recognition is eliminated, the operation pressure of the terminal is reduced, and the accuracy of the scene recognition is effectively improved.

After the terminal obtains the joint probability matrix corresponding to each of the plurality of scene types, the terminal may calculate the training specificity corresponding to each of the plurality of training object groups according to the joint probability matrix. Specifically, the terminal may calculate, according to the conditional probability in the joint probability matrix, the probabilities corresponding to the plurality of scene types when determining the training object group by using the bayesian rule and the total probability formula. In determining the training object group, the probability corresponding to the scene type may be expressed as:

the terminal can calculate the posterior probability corresponding to each scene type according to the joint probability matrix corresponding to each scene type. The specificity may be used to represent a specific degree of the set of objects compared to the scene type. Specifically, the terminal may represent the training specificity of the training object group according to the standard deviation corresponding to the training object group. When the standard deviation is large, the probability fluctuation indicating the scene type corresponding to the training object group is large, and the training object group is unique to the scene type having a large probability. When the standard deviation is small, the probability fluctuation representing the scene type corresponding to the training object group is small, and the difference of the probability of the training object group appearing in each scene type is small. The terminal can obtain a plurality of posterior probabilities corresponding to the training object group from the joint probability matrix corresponding to the scene types, and calculate a standard deviation according to the posterior probabilities to obtain the training specificity corresponding to the training object group.

In one embodiment, a training subject o is included_hAnd o_iTraining specification dis (o) corresponding to the training object group of (1)_h,o_i) Can be expressed as:

when the training specificity meets the preset condition, the terminal can record the training specificity as the standard specificity. The preset condition may be a training end condition preset by the user according to actual requirements. When the training specificity meets the training end condition, the terminal can record the training object as a standard scene object to obtain the standard specificity corresponding to the standard object group. The training end condition may be at least one of a plurality of conditions. For example, the training end condition may be that the training is determined to be ended when the variation fluctuation of the training specification degree is smaller than a preset value. The training end condition may be that training is ended when all the training images in the training end training image set are used.

The terminal can generate a standard specificity matrix according to the standard object group and the standard specificity corresponding to the standard object group. The dimensions of the standard specificity matrix correspond to the number of standard scene objects. Taking the example that the standard object group includes two standard scene objects, when the number of the standard scene objects is 80, the dimension of the standard specificity matrix is 80 × 80. When the number of standard scene objects is 150, the dimension of the standard specificity matrix is 150 × 150. In one embodiment, the criterion specificity matrix may be expressed as:

in this embodiment, training objects in a training image are identified, the number of object groups corresponding to a training object group is counted according to a scene type, and a joint probability matrix corresponding to each of a plurality of scene types is generated according to the number of object groups. And calculating the training specificity corresponding to the training object group through the joint probability matrix, and recording the training specificity as the standard specificity when the training specificity meets the preset condition to generate a standard specificity matrix. The relation between the target object and the scene can be more accurately expressed according to the standard specificity matrix, so that the accuracy of scene identification is effectively improved.

It should be understood that although the various steps in the flow charts of fig. 2-5 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-5 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.

In one embodiment, as shown in fig. 6, there is provided a scene recognition apparatus including: an object identification module 602, a feature extraction module 604, and a scene identification module 606, wherein:

an object recognition module 602, configured to obtain an image to be recognized; and identifying the image to be identified to obtain a plurality of target objects in the image to be identified.

A feature extraction module 604, configured to generate an object group matrix according to the target object, and extract object group features according to the object group matrix; and calling a scene recognition model to perform feature extraction on the image to be recognized to obtain image features.

And the scene identification module 606 is configured to perform classification processing according to the object group characteristics and the image characteristics to obtain a scene type corresponding to the image to be identified.

In an embodiment, the feature extraction module 604 is further configured to obtain a preset standard scene object; traversing the target object according to the standard scene object to obtain an object vector corresponding to the target object; and generating an object group matrix according to the object vector.

In an embodiment, the feature extraction module 604 is further configured to combine the target objects to obtain a plurality of object groups; acquiring a standard specificity matrix, wherein the standard specificity matrix comprises a plurality of standard specificities; screening a plurality of standard specificity degrees according to the object group to obtain target specificity degrees corresponding to the object group; and generating an object group matrix according to the target specificity and the standard specificity matrix.

In one embodiment, the scene recognition apparatus further includes a matrix generation module, configured to acquire a training image set, where the training image set includes training images corresponding to multiple scene types; identifying training objects in the training images, and generating a plurality of training object groups according to the training objects; counting the number of object groups corresponding to the training object groups according to the scene types, and generating a joint probability matrix corresponding to the scene types according to the number of the object groups; calculating training specificity corresponding to the training object group according to the joint probability matrix; and when the training specificity accords with the preset condition, recording the training specificity as the standard specificity, and generating a standard specificity matrix according to the standard specificity.

In one embodiment, the matrix generation module is further configured to compare the number of object groups with a preset threshold; and when the number of the object groups is smaller than a preset threshold value, recording the training object group corresponding to the number of the object groups as a noise object group.

In one embodiment, the object recognition module 602 is further configured to obtain a trained target detection model; and inputting the image to be recognized into the target detection model to obtain a plurality of target image areas corresponding to the image to be recognized and target objects corresponding to the target image areas.

In one embodiment, the object recognition module 602 is further configured to invoke a semantic segmentation model; inputting an image to be recognized into a semantic segmentation model to obtain a semantic segmentation image corresponding to the image to be recognized, wherein pixel values of the semantic segmentation image correspond to the categories of pixel points; and determining the target object in the image to be identified according to the pixel points with the same category.

For the specific definition of the scene recognition device, reference may be made to the above definition of the scene recognition method, which is not described herein again. The modules in the scene recognition device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 7. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a scene recognition method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the scene recognition method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the above-mentioned scene recognition method embodiment.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of scene recognition, the method comprising:

acquiring an image to be identified;

2. The method of claim 1, wherein generating an object group matrix from the target objects comprises:

acquiring a preset standard scene object;

and generating an object group matrix according to the object vector.

3. The method of claim 1, wherein generating an object group matrix from the target objects comprises:

combining the target objects to obtain a plurality of object groups;

4. The method of claim 3, further comprising the step of generating the criterion specification matrix, the step of generating the criterion specification matrix comprising:

5. The method according to claim 4, wherein after counting the number of object groups corresponding to the training object group according to the scene type, the method further comprises:

comparing the number of the object groups with a preset threshold value;

6. The method according to claim 1, wherein the recognizing the image to be recognized to obtain a plurality of target objects in the image to be recognized comprises:

acquiring a trained target detection model;

7. The method according to claim 1, wherein the recognizing the image to be recognized to obtain a plurality of target objects in the image to be recognized comprises:

calling a semantic segmentation model;

8. A scene recognition apparatus, characterized in that the apparatus comprises:

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.