WO2022179281A1

WO2022179281A1 - Environment identification method and apparatus

Info

Publication number: WO2022179281A1
Application number: PCT/CN2021/140833
Authority: WO
Inventors: 彭璐; 赵安; 黄维
Original assignee: 华为技术有限公司
Priority date: 2021-02-25
Filing date: 2021-12-23
Publication date: 2022-09-01
Also published as: CN115049909A

Abstract

An environment identification method and an apparatus. The method comprises: a terminal device captures images in a single environment by means of a plurality of cameras and from a plurality of azimuth angles, wherein the azimuth angle when each camera captures an image is the azimuth angle corresponding to said image, and each azimuth angle is the included angle of a gravity unit vector and a direction vector when each camera captures an image (S700); and the terminal device identifies said single environment according to the images and the azimuth angles corresponding to the images, and obtains an environment identification result (S701). The present method integrates a plurality of images and azimuth angles corresponding to each of the images to perform identification on an environment, and because more comprehensive environment information is obtained, the accuracy of image-based environment identification can be improved, the problem of limitations on an angle of view range and photographing angles when a single camera is identifying an environment is solved, and identification is more accurate and precise.

Description

Scene recognition method and device

This application claims the priority of the Chinese patent application with the application number 202110215000.3 and the application name "Scene Recognition Method and Apparatus" filed with the China Patent Office on February 25, 2021, the entire contents of which are incorporated into this application by reference.

technical field

The present application relates to the technical field of image processing, and in particular, to a scene recognition method and device.

Background technique

The scene recognition technology based on mobile terminals such as mobile phones is an important basic perception capability, which can serve a variety of services such as smart travel, direct service, intention decision-making, smart noise reduction of headphones, search, and recommendation.

Existing scene recognition technologies mainly include: image-based scene recognition methods, sensor (eg, WiFi, Bluetooth, location sensors, etc.) positioning-based scene recognition methods, or signal fingerprint comparison-based scene recognition methods.

Among them, the image-based scene recognition method is affected by the camera's viewing angle range, camera angle, object occlusion and other factors, the robustness in complex scenes is greatly challenged, and it is difficult to realize real-time scene recognition without user perception. Specifically, based on the image scene recognition method, the front or rear camera is used to collect images for recognition. Due to the small viewing angle range and few effective features, or the shooting angle is arbitrary, or there is a large amount of object feature information in the same image, noise information will be Overwhelmed by key features, these are prone to miscalculation. For example, when a ceiling feature appears in the image, it is recognized as indoors, but it may actually be on a subway or an airplane. Therefore, the technical problem to be solved by this application is how to improve the recognition accuracy of the image-based scene recognition method.

SUMMARY OF THE INVENTION

In view of this, a scene recognition method and device are proposed, which combine multiple images and the azimuth angle corresponding to each image to recognize the scene. Since more comprehensive scene information is obtained, the image-based scene recognition can be improved. The accuracy solves the problem of limited viewing angle range and shooting angle of single camera recognition scene, and the recognition is more accurate.

In a first aspect, an embodiment of the present application provides a scene recognition method, the method includes: a terminal device collects images in the same scene from multiple azimuths through multiple cameras, wherein when each camera collects the image The azimuth angle is the azimuth angle corresponding to the image, and the azimuth angle is the angle between the direction vector and the gravity unit vector when each camera collects the image; The same scene is identified by the azimuth, and the scene identification result is obtained.

In the scene recognition method of the embodiment of the present application, multiple cameras are used to shoot images of multiple azimuth angles of the same scene, and the scene is recognized by combining the multiple images and the azimuth angles corresponding to each image. Since more comprehensive scene information is obtained, Therefore, the accuracy of image-based scene recognition can be improved, the problems of limited viewing angle range and shooting angle of a single camera to recognize a scene can be solved, and the recognition can be more accurate.

According to the first aspect, in a first possible implementation manner, the terminal device recognizes the same scene according to the image and the azimuth angle corresponding to the image, and obtains a scene recognition result, including: The azimuth angle is extracted from the azimuth angle feature corresponding to the image; the scene recognition model is used to identify the same scene based on the image and the azimuth angle feature corresponding to the image, and a scene recognition result is obtained, wherein the scene recognition model is a neural network. Model.

By assigning different weights (azimuth features) to the convolution kernels in the neural network model, and extracting the features most relevant to the scene at different azimuths, more accurate prediction results can be obtained.

According to a first possible implementation manner of the first aspect, in a second possible implementation manner, the scene recognition model includes multiple pairs of the first feature extraction layer and the first layer model, and each pair of the first feature extraction layer and The first layer model is used to process an image of an azimuth angle and an azimuth angle corresponding to the image of the one azimuth angle to obtain a first recognition result; wherein, the azimuth angle corresponding to the image of the one azimuth angle is the first recognition result. an azimuth angle corresponding to the recognition result, the first feature extraction layer is used to extract the feature of the image of the one azimuth angle to obtain a feature vector, and the first layer model is used to extract the feature vector according to the feature vector and the one azimuth The azimuth angle corresponding to the image of the angle is obtained, and the first recognition result is obtained; the scene recognition model further includes a second layer model, and the second layer model is used for the first recognition result and the corresponding first recognition result. The azimuth angle is used to obtain the scene recognition result.

The scene recognition method of the embodiment of the present application adopts a two-layer scene recognition model, collects images from multiple angles and combines the azimuth angles of the images to perform scene recognition by using a competition mechanism, and considers the results of local features and overall features. The accuracy of recognition in scene recognition without user perception reduces misjudgment.

According to the second possible implementation manner of the first aspect, in a third possible implementation manner, the first feature extraction layer is configured to extract the feature of the image at one azimuth angle to obtain a plurality of feature vectors; The first layer model is used to calculate the first weight corresponding to each feature vector in the plurality of feature vectors according to the azimuth angle corresponding to the image of the one azimuth angle; the first layer model is used to calculate the first weight corresponding to each feature vector according to the eigenvectors and a first weight corresponding to each eigenvector to obtain the first recognition result.

According to a second possible implementation manner of the first aspect, in a fourth possible implementation manner, the second layer model is used to calculate the azimuth angle of the first identification result according to the azimuth angle corresponding to the first identification result second weight; the second layer model is configured to obtain the scene recognition result according to the first recognition result and the second weight of the first recognition result.

According to a second possible implementation manner of the first aspect, in a fifth possible implementation manner, the second layer model presets a third weight corresponding to each of the first recognition results, and the second The layer model is used to obtain the scene recognition result according to the first recognition result and the third weight corresponding to the first recognition result.

According to a second possible implementation manner of the first aspect, in a sixth possible implementation manner, the second layer model is configured to determine each of the first recognition results according to the azimuth angle and a preset rule A corresponding fourth weight; wherein, the preset rule is a weight group set according to the azimuth angle, different azimuth angles correspond to different weight groups, and each weight group includes a fourth weight corresponding to each first identification result; The second-layer model is configured to obtain the scene recognition result according to the first recognition result and the fourth weight corresponding to the first recognition result.

For the time-limited method in the form of preset weights or weight mapping functions for the second-layer model, during training, only other parts of the neural network model need to be trained, and the second-layer model does not need to be trained, which can improve the training efficiency.

According to the first aspect, in a seventh possible implementation manner, the method further includes: the terminal device acquires the acceleration on the coordinate axis of the three-dimensional rectangular coordinate system corresponding to the gravity sensor when each camera captures the image, and obtains each The direction vector when each camera collects the image; wherein, the corresponding three-dimensional rectangular coordinate system when each camera collects the image takes each camera as the origin, the z direction is the direction along the camera, and x and y are respectively is a direction perpendicular to the z direction, and the plane where x and y are located is perpendicular to the z direction; the azimuth angle is calculated according to the direction vector and the gravity unit vector.

According to a first possible implementation manner of the first aspect, in an eighth possible implementation manner, before using a scene recognition model to identify the same scene based on the image and the azimuth feature corresponding to the image, the The method further includes: the terminal device preprocesses the image; wherein, the preprocessing includes one or a combination of the following processes: converting an image format, converting an image channel, unifying the image size, and normalizing the image, Converting image format refers to converting color images to black and white images. Converting image channels refers to converting images to red, green, and blue RGB channels. Unifying image size refers to adjusting multiple images to have the same length and width. Image normalization refers to Normalize the pixel values of the image.

In a second aspect, an embodiment of the present application provides a scene recognition apparatus, the apparatus includes: an image acquisition module, configured to collect images in the same scene from multiple azimuth angles through multiple cameras, wherein each camera collects images in the same scene. The azimuth angle of the image is the azimuth angle corresponding to the image, and the azimuth angle is the angle between the direction vector and the gravity unit vector when each camera collects the image; The same scene is recognized by the image and the azimuth angle corresponding to the image, and a scene recognition result is obtained.

The scene recognition device of the embodiment of the present application uses multiple cameras to capture images of multiple azimuth angles of the same scene, and recognizes the scene in combination with the multiple images and the azimuth angle corresponding to each image. Since more comprehensive scene information is obtained, Therefore, the accuracy of image-based scene recognition can be improved, the problems of limited viewing angle range and shooting angle of a single camera to recognize a scene can be solved, and the recognition can be more accurate.

According to the second aspect, in a first possible implementation manner, the scene recognition module includes: an azimuth feature extraction module, configured to extract the azimuth feature corresponding to the image from the azimuth angle corresponding to the image; scene recognition a model for recognizing the same scene based on the image and the azimuth feature corresponding to the image to obtain a scene recognition result, wherein the scene recognition model is a neural network model.

According to the first possible implementation manner of the second aspect, in the second possible implementation manner, the scene recognition model includes multiple pairs of the first feature extraction layer and the first layer model, and each pair of the first feature extraction layer and The first layer model is used to process an image of an azimuth angle and an azimuth angle corresponding to the image of the one azimuth angle to obtain a first recognition result; wherein, the azimuth angle corresponding to the image of the one azimuth angle is the first recognition result. an azimuth angle corresponding to the recognition result, the first feature extraction layer is used to extract the feature of the image of the one azimuth angle to obtain a feature vector, and the first layer model is used to extract the feature vector according to the feature vector and the one azimuth The azimuth angle corresponding to the image of the angle is obtained, and the first recognition result is obtained; the scene recognition model further includes a second layer model, and the second layer model is used for the first recognition result and the corresponding first recognition result. The azimuth angle is used to obtain the scene recognition result.

The scene recognition device of the embodiment of the present application adopts a two-layer scene recognition model, collects multi-angle images and combines the azimuth angles of the images to perform scene recognition by using a competition mechanism, and considers the results of local features and overall features. The accuracy of recognition in scene recognition without user perception reduces misjudgment.

According to a second possible implementation manner of the second aspect, in a third possible implementation manner, the first feature extraction layer is configured to extract the feature of the image at one azimuth angle to obtain a plurality of feature vectors;

The first layer model is used to calculate the first weight corresponding to each feature vector in the plurality of feature vectors according to the azimuth angle corresponding to the image of the one azimuth angle;

The first layer model is configured to obtain the first recognition result according to each feature vector and the first weight corresponding to each feature vector.

According to a second possible implementation manner of the second aspect, in a fourth possible implementation manner, the second layer model is used to calculate the azimuth of the first identification result according to the azimuth angle corresponding to the first identification result second weight; the second layer model is configured to obtain the scene recognition result according to the first recognition result and the second weight of the first recognition result.

According to a second possible implementation manner of the second aspect, in a fifth possible implementation manner, the second layer model presets a third weight corresponding to each of the first recognition results, and the second The layer model is used to obtain the scene recognition result according to the first recognition result and the third weight corresponding to the first recognition result.

According to a second possible implementation manner of the second aspect, in a sixth possible implementation manner, the second layer model is configured to determine each of the first recognition results according to the azimuth angle and a preset rule A corresponding fourth weight; wherein, the preset rule is a weight group set according to the azimuth angle, different azimuth angles correspond to different weight groups, and each weight group includes a fourth weight corresponding to each first identification result; The second-layer model is configured to obtain the scene recognition result according to the first recognition result and the fourth weight corresponding to the first recognition result.

According to the second aspect, in a seventh possible implementation manner, the device further includes: an azimuth angle acquisition module, configured to acquire the coordinate axis of the three-dimensional rectangular coordinate system corresponding to the gravity sensor when each camera collects the image to obtain the direction vector when each camera collects the image; wherein, the corresponding three-dimensional Cartesian coordinate system when each camera collects the image takes each camera as the origin, and the z direction is the direction along the camera. , x and y are directions perpendicular to the z direction respectively, and the planes where x and y are located are perpendicular to the z direction; the azimuth angle is calculated according to the direction vector and the gravity unit vector.

According to a first possible implementation manner of the second aspect, in an eighth possible implementation manner, the apparatus further includes: an image preprocessing module, configured to preprocess the image; wherein the preprocessing Includes one or a combination of the following processes: Convert Image Format, Convert Image Channels, Unify Image Size, Image Normalization, Convert Image Format refers to converting a color image to black and white, Convert Image Channel refers to Converting to red, green, and blue RGB channels, unified image size refers to adjusting the length and width of multiple images to be the same, and image normalization refers to normalizing the pixel values of the images.

In a third aspect, embodiments of the present application provide a terminal device, where the terminal device can execute the above first aspect or one or more of the scene recognition methods in multiple possible implementation manners of the first aspect.

In a fourth aspect, embodiments of the present application provide a computer program product, comprising computer-readable codes, or a non-volatile computer-readable storage medium carrying computer-readable codes, when the computer-readable codes are stored in an electronic When running in the device, the processor in the electronic device executes the first aspect or one or more of the scene recognition methods in the multiple possible implementation manners of the first aspect.

In a fifth aspect, embodiments of the present application provide a non-volatile computer-readable storage medium on which computer program instructions are stored, characterized in that, when the computer program instructions are executed by a processor, the above-mentioned first aspect is implemented Or one or more scene recognition methods in multiple possible implementation manners of the first aspect.

These and other aspects of the present application will be more clearly understood in the following description of the embodiment(s).

Description of drawings

The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate exemplary embodiments, features and aspects of the application and together with the description, serve to explain the principles of the application.

1a and 1b respectively show schematic diagrams of application scenarios according to an embodiment of the present application.

Fig. 2a shows a schematic diagram of scene recognition performed by a neural network model according to an embodiment of the present application.

FIG. 2b shows a flowchart of a scene recognition method according to an embodiment of the present application.

FIG. 3 shows a schematic diagram of an application scenario according to an embodiment of the present application.

FIG. 4a and FIG. 4b respectively show schematic diagrams of an azimuth angle determination manner according to an embodiment of the present application.

FIG. 5 is a block diagram showing the structure of a neural network model according to an embodiment of the present application.

FIG. 6 shows a schematic structural diagram of a first layer model according to an embodiment of the present application.

FIG. 7 shows a flowchart of a scene recognition method according to an embodiment of the present application.

FIG. 8 shows a flowchart of the method of step S701 according to an embodiment of the present application.

FIG. 9 shows a block diagram of a scene recognition apparatus according to an embodiment of the present application.

FIG. 10 shows a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed ways

Various exemplary embodiments, features and aspects of the present application will be described in detail below with reference to the accompanying drawings. The same reference numbers in the figures denote elements that have the same or similar functions. While various aspects of the embodiments are shown in the drawings, the drawings are not necessarily drawn to scale unless otherwise indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration." Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

In addition, in order to better illustrate the present application, numerous specific details are given in the following detailed description. It should be understood by those skilled in the art that the present application may be practiced without certain specific details. In some instances, methods, means, components and circuits well known to those skilled in the art have not been described in detail so as not to obscure the subject matter of the present application.

Glossary

Azimuth: the angle between the direction vector of the camera and the unit vector of gravity.

The direction vector of the camera: establish a three-dimensional rectangular coordinate system with the camera as the origin, the z direction is the direction along the camera, x and y are the directions perpendicular to the z direction, and the plane where x and y are located is perpendicular to the z direction, gravity The vector composed of the acceleration of the sensor in the three directions of x, y and z is the direction vector of the camera.

Gravity unit vector: (0,0,1).

Therefore, the technical problem to be solved by this application is how to improve the recognition accuracy of the image-based scene recognition method. The existing image-based scene recognition methods can be applied in the following scenarios: 1. The image is only collected by a single camera for recognition, the viewing angle range is small, the observed effective features are few, and the scene recognition recall rate is low; Therefore, similar objects or features are prone to misjudgment. For example, when the feature of the ceiling appears in the image, it is recognized as indoor, but it may actually be in the subway or plane; three , There is a large amount of object feature information in the same image, and the noise information may drown out the main features, resulting in misjudgment of the main features.

In order to solve the above-mentioned technical problems, the present application provides a scene recognition method, which uses multiple cameras to shoot images of multiple azimuth angles of the same scene, and obtains the azimuth angles when the multiple cameras shoot (collect) images, and each camera shoots ( The azimuth angle when collecting) images is the azimuth angle corresponding to the image, and the scene is recognized by combining multiple images and the azimuth angle corresponding to each image. Since more comprehensive scene information is obtained, the image-based scene recognition can be improved. The accuracy solves the problem of limited viewing angle range and shooting angle of single camera recognition scene, and the recognition is more accurate.

In a possible implementation manner, the multiple cameras of the embodiments of the present application may be set on the terminal device. For example, the multiple cameras may be the front camera and the rear camera set on the mobile phone, and the multiple cameras may be set Multiple cameras at different orientations of the vehicle body, multiple cameras can also be multiple cameras set at different orientations of the drone, and so on. It should be noted that the above application scenarios are only some examples of the present application, and the present application is not limited thereto.

1a and 1b respectively show schematic diagrams of application scenarios according to an embodiment of the present application. As shown in Figure 1a, the mobile phone can be provided with a front camera and a rear camera. Through the front camera and the rear camera of the mobile phone, images from two angles can be collected, and the azimuth angle of the front camera and the rear camera can also be obtained. , the scene can be more accurately recognized according to the images of the two angles combined with the corresponding azimuth angles. As shown in Figure 1b, multiple cameras can be set on the body of an autonomous vehicle, and multiple cameras can be set at different positions. Top and so on, the direction of each camera can also be adjusted individually, and a controller can also be set on the self-driving car, and the controller can be connected to multiple cameras. It should be noted that other sensors can also be set on the self-driving car, such as GPS, radar, accelerometer, gyroscope, etc., all sensors and cameras are connected to the controller, the controller can collect images of different angles through multiple cameras, and can also obtain the azimuth of each camera, according to the multiple images and each The azimuth corresponding to each image can more accurately identify the scene.

It should be noted that FIG. 1 a and FIG. 1 b are only examples of application scenarios provided by the present application, and the present application is not limited thereto. For example, the present application can also be applied to a scene in which a drone collects images for scene recognition.

The scene recognition method provided by the embodiments of the present application can be applied to terminal devices. For example, the terminal devices of the present application can be smart phones, netbooks, tablet computers, notebook computers, wearable electronic devices (such as smart bracelets, smart watches, etc. ), TVs, virtual reality devices, speakers, e-ink, and more. FIG. 10 shows a schematic structural diagram of a terminal device according to an embodiment of the present application. Taking the terminal device as a mobile phone as an example, FIG. 10 shows a schematic structural diagram of the mobile phone 200 , for details, please refer to the specific description below.

In a possible implementation manner of the present application, scene recognition is performed based on the front and rear cameras and acceleration sensors of the mobile phone. Among them, the front and rear cameras simultaneously collect images of different azimuth angles, and the acceleration sensor of the mobile phone is used to extract the angle between the current direction of the mobile phone camera and the direction of gravity, and this angle can be used as the azimuth angle of the camera.

In a possible implementation manner, the scene recognition method provided by this embodiment of the present application may be implemented by using a neural network model (scene recognition model). As shown in FIG. 2a, the neural network model used in this embodiment of the present application may include: multiple pairs of the first feature extraction layer and the first layer model, each pair of the first feature extraction layer and the first layer model is used to The image and the azimuth angle corresponding to the image of one azimuth angle are processed to obtain the first recognition result, and the azimuth angle corresponding to the image of one azimuth angle is the azimuth angle corresponding to the first recognition result. Among them, the first feature extraction layer is used to extract the features of the image to obtain the feature vector (feature map). The azimuth angle corresponding to the image of one azimuth angle is obtained, and the first recognition result is obtained.

As shown in FIG. 2a, the neural network model may further include a second layer model, and the second layer model is used to obtain the scene recognition result according to the first recognition result and the azimuth angle corresponding to the first recognition result.

The example shown in Figure 2a includes two pairs of the first feature extraction layer and the first layer model. The application scenario shown in Figure 2a can be applied to a dual-camera (front and rear dual-camera) scenario. Image 1 and Image 2 can be obtained through different An image acquired by a camera with an angle, for example, image 1 is acquired by the front camera of the mobile phone, and image 2 is acquired by the rear camera of the mobile phone.

In a possible implementation manner, the logarithm of the first feature extraction layer and the first layer model included in the neural network model can be configured according to the number of camera angles set in a specific application scenario. For example, the first feature extraction layer The logarithm to the first layer model can be greater than or equal to the number of angles.

In a possible implementation, the first feature extraction layer can be implemented by using Convolutional Neural Networks (CNN, Convolutional Neural Networks). The features of the input image are extracted through CNN to obtain a feature map. For example, the VGG model ( Visual Geometry Group Network), Inception model, MobileNet, ResNet, DenseNet, Transformer and other convolutional neural network models are used as the first feature extraction layer to extract feature maps, and the convolutional neural network structure can also be customized as the first feature extraction layer. This is not limited. Both the first-layer model and the second-layer model can be implemented based on the attention mechanism, and the second-layer model can also be implemented by presetting weights for each azimuth angle or presetting a weight mapping function according to the azimuth angle. Not limited.

The scene recognition method provided by the embodiment of the present application uses a two-layer scene recognition model based on a competition mechanism (attention mechanism): the first layer model can assign a convolution result (a feature obtained by CNN performing a convolution operation on an input image) according to the azimuth angle. vector) different weights, activate the neurons that identify the local features of different scenes, extract the key features of the image at the azimuth angle, perform the first scene classification, and obtain the first identification result; the second layer model calculates the difference in different scenes by calculating The azimuth angle is used to weight the scene classification results, and the weighted summation is used to obtain the classification results of multiple images from different perspectives, and the final scene recognition result is obtained. Using a two-layer competition mechanism-based scene recognition model combined with azimuth information can identify key features and filter irrelevant information, effectively reducing the probability of misrecognition. For example, aircraft and high-speed rail cannot be distinguished from an upward perspective (the ceiling features are similar and difficult to distinguish) , and can be distinguished from a side view (a round window of an airplane and a square window of a high-speed rail are easy to distinguish), and the scene recognition method of the embodiment of the present application helps to reduce misjudgments.

Therefore, using a two-layer scene recognition model, collecting images from multiple angles and combining the azimuth of the images, using a competitive mechanism for scene recognition, taking into account the results of local features and overall features at the same time, can improve the user's senseless scene recognition. Recognition accuracy and reduce misjudgment.

The azimuth feature extraction (dotted line box) shown in FIG. 2a can be realized by a neural network, that is to say, the neural network model provided in this embodiment of the present application may further include a second feature extraction layer, and the second feature extraction layer can also be implemented through a volume A neural network model implementation. The azimuth angle feature extraction (dotted line frame) shown in FIG. 2a can also be obtained by calculating the azimuth angle according to an existing function, which is not limited in this application.

The sensor shown in Figure 2a can be an accelerometer, a gyroscope, etc. The posture of the terminal device can be obtained through the motion data of the terminal device collected by the sensor, and the direction of the camera can be determined according to the posture of the terminal device and the orientation of the camera. The orientation, as well as the direction of gravity, can determine the azimuth of the camera.

Fig. 2b shows a flowchart of a scene recognition method according to an embodiment of the present application. The following describes the flowchart of the image processing method of the present application in detail with reference to Figs. 2a and 2b.

1. Image acquisition

Multiple cameras with different orientations set on the terminal device collect images in the same scene, and the terminal device obtains images from different viewing angles (azimuth angles) of the same scene.

For example, the front camera and the rear camera of the mobile phone are used to capture images in the same scene at the same time, and images from different perspectives of the same scene are obtained. The image captured by the front camera may be an image captured by a single camera, or may be an image obtained by combining images captured by multiple cameras; the image captured by the rear camera may also be an image captured by a single camera, or an image captured by a single camera. It can be a composite image of images captured by multiple cameras. Wherein, in a scenario where images captured by multiple cameras are combined into one image, the azimuth angle of this image is the same as the azimuth angle of a single camera.

The images captured by the cameras in the embodiments of the present application may be black and white images, RGB (Red, Green, Blue) color images, or RGB-D (RGB-Depth) depth images (D refers to depth information), or It can be an infrared image, which is not limited in this application.

FIG. 3 shows a schematic diagram of an application scenario according to an embodiment of the present application. As shown in Figure 3, taking the scene of taking the subway as an example, the user looks at the mobile phone on the subway, and the mobile phone is inclined at a certain angle to the surface of the subway floor. Therefore, the front camera of the mobile phone can capture the picture of the top of the subway, and the The rear camera can capture the picture of the subway floor.

2. Azimuth acquisition

The terminal device can collect the azimuth angles of cameras from multiple viewing angles while collecting images. The azimuth angle of the camera may refer to the angle between the direction vector of the camera and the unit vector of gravity.

In the embodiment of the present application, the direction vector of the camera may be acquired by a sensor, such as a gravity sensor, an accelerometer, a gyroscope, etc., which is not limited in this application. The gravity unit vector is g _gravity = (0,0,1). Therefore, in the embodiment of the present application, the azimuth angle of the camera can be obtained by collecting the direction vector of the camera and calculating the azimuth angle of the camera according to the direction vector of the camera and g _gravity =(0,0,1).

Taking a mobile phone as an example, the direction vector of the front camera can be obtained by a gravity sensor. FIG. 4a and FIG. 4b respectively show schematic diagrams of an azimuth angle determination manner according to an embodiment of the present application. As shown in Figure 4a, a three-dimensional rectangular coordinate system can be established with the front camera as the origin, where the z direction is the direction along the front camera and is perpendicular to the plane of the mobile phone, and x and y are respectively parallel to the border of the mobile phone , and the direction perpendicular to the z direction. According to the acceleration of the gravity sensor in the three directions of x, y and z, the direction vector g _camera = (Acc_x, Acc_y, Acc_z) of the front camera can be obtained. Assuming that the azimuth of the front camera is θ, then the azimuth of the rear camera The angle can be π-θ. Therefore, according to the g _camera and the gravity unit vector g _gravity = (0,0,1), the azimuth angle θ of the front camera can be calculated, and the azimuth angle of the front camera and the rear camera of the mobile phone can be obtained.

Specifically, as shown in Figure 4b, the angle θ between the direction vector g _camera of the front camera of the mobile phone and the gravity unit vector g _gravity satisfies formula (1):

Therefore, the azimuth angle of the front camera can be calculated by formula (2):

So the azimuth of the phone's rear camera is

3. Image preprocessing

The terminal device may separately preprocess the images collected by each camera, wherein the preprocessing may include one or more processing methods among image format conversion, image size unification, image normalization, and image channel conversion.

The image format conversion may refer to converting a color image into a black and white image.

Image channel conversion may refer to converting a color image to three RGB channels, and the three channels are Red, Green, and Blue in turn.

Unifying the image size may refer to unifying the length and width of the images collected by each camera, for example, the length of the unified image is 800 pixels and the width is 600 pixels.

The purpose of image normalization is to ensure that the mean values of the extracted feature maps (feature vectors) are all around 0. Therefore, for black and white images, image normalization can refer to subtracting the pixel value of the black and white image by the mean value of 127.5, and then dividing the Take 128, that is, p_Normalized=(p-127.5)/128, where p can represent the pixel value of the black and white image, and p_Normalized can represent the normalized pixel value; for color images, such as images represented in RGB format , the value of the pixel can be subtracted from the mean [103.939, 116.779, 123.68], that is:

p_R_Normalized=p_R-103.939;

p_G_Normalized=p_G-116.779;

p_B_Normalized=p_B-123.68;

Among them, p_R, p_G, and p_B represent the pixel values before normalization, respectively, and p_R_Normalized, p_G_Normalized, and p_B_Normalized represent the pixel values after normalization, respectively.

The collected images may be processed through one or more of the above preprocessing methods. For example, in one possible implementation, the preprocessing process may include:

Step1: Image format conversion, convert the collected color image into a black and white image, and convert the color image into a black and white image. The relevant grayscale formula method or the average method can be used to calculate the value of the pixel after conversion according to the value of the pixel before conversion;

Step2: Unify the image size, unify the size of the image size, for example, unify the size of the image to: length 800, width 600;

Step3: Image normalization, for black and white images p_Normalized=(p-127.5)/128.

In another possible implementation, the preprocessing process may include:

Step1: Image channel conversion. For color images, it can be uniformly converted into RGB three channels, that is, the channels are Red, Green, and Blue in turn; if the images before preprocessing are all in RGB format, the process of image channel conversion can be omitted.

Step3: Image normalization. For RGB images, the normalization processing method is:

R channel: p_R_Normalized=(p_R-103.939)/1.0;

G channel: p_G_Normalized=(p_G-116.779)/1.0;

B channel: p_B_Normalized=(p_B-123.68)/1.0.

In the embodiment of the present application, the preprocessing manner of the terminal device for the images collected by the cameras at each angle may be the same or different, which is not limited in the present application. By preprocessing the collected images, the format of the images can be unified, which is beneficial to the subsequent process of feature extraction and scene recognition.

4. Azimuth feature extraction

Perform feature extraction on the azimuth angle collected in step 2 to obtain the azimuth angle feature. To extract the azimuth angle feature, one or more of the following processing methods can be performed on the azimuth angle: numerical normalization, discretization, one-bit Efficient encoding, trigonometric transformations, and more. This application provides a variety of different feature extraction methods, the following are a few examples.

Example 1, the terminal device can discretize the collected azimuth, and then perform one-hot encoding (one-bit effective encoding). Discretization is to map individuals into a limited space without changing the relative size of the data. For example, for the azimuth angle of the embodiment of the present application, the azimuth angle can be discretely divided into [0°, 45°), [45°, 90°), [90°, 135°), [135°, 180°] four intervals, the corresponding code of the interval to which the azimuth is mapped is 1, and the code of other intervals is 0, and the corresponding code of the four intervals is 1. The feature vectors are [1,0,0,0], [0,1,0,0], [0,0,1,0], [0,0,0,1] respectively. For example, the azimuth angle θ is 30°, which is mapped to the interval [0°, 45°], then the corresponding azimuth angle feature is [1, 0, 0, 0].

Example 2, the terminal device may directly perform trigonometric function transformation on the azimuth angle, and the value obtained after the transformation is normalized to the [0,1] interval as the azimuth angle feature. Among them, the trigonometric function change can refer to sinθ, cosθ, tanθ, etc.

Example 3, discretize the value [0, 1] interval of the normalized trigonometric function into four intervals of [0, 0.25), [0.25, 0.5), [0.5, 0.75), [0.75, 1.0], terminal The device can first perform trigonometric function transformation on the azimuth and normalize it, and determine the azimuth feature according to the interval to which the value of the normalized trigonometric function is mapped. For example, the azimuth angle θ is 30°, sinθ=1/2, which maps to the interval [0.5, 0.75), so the azimuth angle feature of the azimuth angle θ is [0, 0, 1, 0].

5. Scene recognition

In the embodiment of the present application, the scene recognition is performed through a neural network model, and the process of scene recognition is described in conjunction with the framework of the neural network model shown in FIG. 2a.

The terminal device uses the preprocessed images in step 3 as the input data of multiple first feature extraction layers, respectively uses the azimuth angle features obtained in step 4 as the input data for multiple first-layer models, and each pair of first The image and azimuth angle received by the feature extraction layer and the first layer model are associated, that is, the image captured by the camera with the same azimuth angle and the azimuth angle feature corresponding to the camera are used as a pair of the first feature extraction layer and the first Input data for the layer model.

For example, the front camera of the mobile phone captures an image and preprocesses the image to obtain image 1. The azimuth angle of the front camera is θ1, and the azimuth angle feature of θ1 is C1. The neural network model includes feature extraction layer 1 and the first layer model 1. The output of the feature extraction layer 1 is the input of the first layer model 1. The neural network model can also include the feature extraction layer 2 and the first layer model 2. The feature extraction layer The output of 2 is the input of the first layer model 2. The terminal device can use image 1 as the input of the feature extraction layer 1 and C1 as the input of the first layer model 1, or the terminal device can also use the image 1 as the input of the feature extraction layer 2 and C1 as the first layer model 2 input of. That is to say, the serial numbers of image 1, image 2, feature extraction layer 1, feature extraction layer 2, first-layer model 1, and first-layer model 2 in this application are not intended to limit the order and correspondence, but are only to distinguish different The numbers set for the modules are not to be construed as limitations on this application.

In this way, the first feature extraction layer extracts the features of the image to obtain a feature map (feature vector), and inputs the feature map (feature vector) into the first layer model. The first recognition result can be obtained by recognizing (classifying) the scene of the image.

The terminal device can also use the azimuth angle feature as the input data of the second-layer model, and the second-layer model can further identify (classify) the scene according to the first recognition result output by the first-layer model and the corresponding azimuth angle feature, and obtain Scene recognition results.

FIG. 5 is a block diagram showing the structure of a neural network model according to an embodiment of the present application. FIG. 6 shows a schematic structural diagram of a first layer model according to an embodiment of the present application.

Assuming that in the embodiment of the present application, the number of feature vectors of the image extracted by the CNN is n, the terminal device of the present application performs image acquisition on J angles (azimuth angles), that is, the camera with J angles is used to collect images. Image from J angles.

As shown in Figure 5, assuming that in the example of Figure 5, J is 2, the cameras at two angles have collected image 1 and image 2 respectively, the azimuth angle corresponding to image 1 is θ1, and the azimuth angle corresponding to image 2 is θ2, The azimuth angle characteristic of the azimuth angle θ1 is C1, and the azimuth angle characteristic of the azimuth angle θ2 is C2. Image 1 is used as the input data of the CNN on the upper side, and the CNN on the upper side performs feature extraction on the input image 1 to obtain a feature vector y _i , where i is a positive integer from 1 to n. The image 2 is used as the input data of the CNN on the lower side, and the CNN on the lower side performs feature extraction on the input image 2 to obtain a feature vector _xi .

The terminal device takes the feature vector _yi and the azimuth angle feature C1 as the input data of the first layer model on the upper side. As shown in Figure 6, the first layer model can calculate the first recognition result Z _j according to the feature vector y _i and the azimuth angle feature C1, where j is a positive integer from 1 to J, and J represents the angle of the collected image (azimuth angle ), J is equal to 2 in this example, and j is equal to 1 in the example of FIG. 6 . The first layer model can be implemented based on the attention mechanism, and specifically, it can include the activation function (tanh), the softmax function, and the weighted average process shown in FIG. 6 . Among them, tanh is one of the activation functions, and the present application is not limited to this, and other activation functions can also be used, such as: sigmoid, ReLU, LReLU, ELU function, and so on.

The specific calculation method is shown in the following formula (3):

Among them, C _j represents the azimuth feature corresponding to the image at this angle, y _i is the feature vector output by CNN, Wi and _bi are the weight and bias value of the activation function, _{respectively, [C j} _, y _i ] represents the feature The vector obtained by splicing the vector and the azimuth feature,

Represents the parameters of the softmax function. According to the mi calculated by the _tanh function, it can be determined whether to activate the corresponding neuron, and the corresponding features of the tanh function are extracted as the basis for classification. The _softmax function normalizes the mi to obtain the weight of each feature vector. . Therefore, the azimuth feature can affect the weight s _i of the calculated eigenvector y _i . If the calculated s _i is relatively large, then the eigenvector has a greater impact on the classification result. If the calculated _si is relatively small, then This feature vector has less influence on the classification result. Therefore, the scene recognition model provided by the present application can identify key features according to the azimuth angle, filter irrelevant information, reduce noise, improve recognition accuracy, and realize scene recognition without user perception.

The weight _si calculated according to formula (3) can represent the weight value assigned to different features extracted from the image, and the feature vector and the corresponding weight are weighted and summed to obtain the first recognition result Z ₁ of the first layer model.

The terminal device uses the feature vector x _i (i is a positive integer from 1 to n', and n' and n may be equal or not equal, which is not limited in this application) and the azimuth angle feature C ₂ as the first layer of the lower side. The input data of the model can be calculated to obtain the first recognition result Z ₂ in the same manner as the first layer model on the upper side.

The neural network model of the embodiment of the present application extracts feature maps through CNN, and assigns different weights to different convolution results of images in combination with azimuth angles. The first-layer model identifies key features and filters irrelevant information through a competitive mechanism, which can effectively reduce the probability of misrecognition. .

In a possible implementation manner, the structure of the second layer model is shown in Figure 5, f ₁ and f ₂ represent the activation function tanh, wherein the number of tanh functions in the second layer model and the number of angles at which images are collected Relatedly, the number of tanh functions can be equal to or greater than the number of angles at which images are acquired. The input of the tanh function includes the output result Z _j of the first layer model and the azimuth angle feature C _j , the second layer model also includes the softmax function, and the softmax function calculates the weight of the output result Z _j of the first layer model according to the calculation result of the tanh function. S _j , and finally, the second layer model calculates the final scene recognition result Z according to the calculated weight S _j and the output result Z _j of the first layer model. The specific calculation method is shown in the following formula (4):

Among them, [C _j , Z _j ] represents the vector obtained by splicing the first recognition result Z _j of image 1 and the corresponding azimuth feature C _j , W _j and b _j represent the tanh function weight and bias value, respectively,

Represents the parameters of the softmax function. According to the M _j calculated by the tanh function, it can be determined whether to activate the corresponding neuron, and the feature corresponding to the tanh function (the first recognition result) is extracted as the basis for classification. The softmax function normalizes M _j to get The weight of each first recognition result. Therefore, the azimuth angle feature can affect the weight S _j of the calculated first identification result Z _j . If the calculated S _j is relatively large, then the first identification result has a greater impact on the classification result. If the calculated S _j is relatively small, then the first recognition result has little influence on the classification result. Therefore, the scene recognition model provided by the present application can extract the features of key angles according to the azimuth angle, filter irrelevant angles, improve the recognition accuracy, and realize scene recognition without user perception.

The weight S _j calculated according to formula (4) can represent the weight value assigned to different first recognition results, and the first recognition result and the corresponding weight are weighted and summed to obtain the scene recognition result Z of the second layer model.

Among them, W _i , _bi ,

W _j , b _j and

are model parameters, the parameter values can be obtained by training the neural network model shown in FIG. 5 using sample data, and the neural network model of the present application can be trained by using the training methods in the related art. Repeat.

In another possible implementation manner, the second-layer model may also be implemented by presetting weights for each azimuth angle, voting, or presetting a weight mapping function according to the azimuth angle. That is to say, in another embodiment of the present application, the neural network model includes multiple pairs of feature extraction layers and first-layer models, and also includes a second-layer model. It is realized by the way of preset weight mapping function according to azimuth angle.

For example, the second-layer model can preset weights for each azimuth angle. For each first recognition result Z _j output by the first-layer model, the preset corresponding weight is S _j , and the second-layer model is based on each first recognition result Z j . The first recognition result Z _j and the corresponding weighted weight S _j can be calculated to obtain the scene recognition result

The first layer model may also preset a weight mapping function according to the azimuth angle, that is, different azimuth angles correspond to different preset weight groups, and the preset weight groups may include weights corresponding to each first identification result Z _j . For example, it is assumed that the terminal device collects images from two angles through the front camera and the rear camera, and recognizes the images from the two angles to obtain two first recognition results Z ₁ and Z ₂ , and the corresponding weights are S respectively. ₁ and S ₂ , in one example, the preset weight mapping function according to the azimuth angle can be as follows:

(1) If the azimuth angle θ of the front camera belongs to the interval [0°, 45°), S ₁ =1.0, S ₂ =0.0;

(2) If the azimuth angle θ of the front camera belongs to the interval [45°, 90°), S ₁ =0.7, S ₂ =0.3;

(3) If the azimuth angle θ of the front camera belongs to the interval [90°, 135°), S ₁ =0.3, S ₂ =0.7;

(4) If the azimuth angle θ of the front camera belongs to the interval [135°, 180°], S ₁ =0.0, S ₂ =1.0.

The above are just some examples of implementations of the second-layer model, and the present application is not limited thereto.

6. Output results

The terminal device may output the final recognition result according to the scene recognition result and the preset strategy, wherein the preset strategy may include outputting the final recognition result after filtering the scene recognition result according to the confidence threshold, or combining multiple categories into one category Output the final recognition result, and so on.

Example 1: Assuming that the reliability threshold is set to threshold=0.8, when the confidence of the category corresponding to the scene recognition result is greater than or equal to the threshold, the category corresponding to the scene recognition result can be predicted as the final recognition result.

Example 2: Assuming that the scene recognition result includes 100 categories, the terminal device can combine several categories into a larger category output, for example, combine cars and buses into the car category as the final recognition result output.

Based on the above examples of the present application, the present application provides a scene recognition method, and FIG. 7 shows a flowchart of a scene recognition method according to an embodiment of the present application. As shown in Figure 7, the scene recognition method may include the following steps:

Step S700, the terminal device collects images in the same scene from multiple azimuth angles through multiple cameras, wherein the azimuth angle when each camera collects the image is the azimuth angle corresponding to the image, and the azimuth angle is the azimuth angle corresponding to the image. The angle between the direction vector and the gravity unit vector when each camera captures the image;

Step S701, the terminal device recognizes the same scene according to the image and the azimuth angle corresponding to the image, and obtains a scene recognition result.

In this embodiment of the present application, an image of an azimuth angle may be an image captured by one camera, or may be an image obtained by combining images captured by multiple cameras. For example, a mobile phone may include multiple rear cameras, and the image captured by the rear camera may be an image obtained by combining images captured by the multiple cameras.

In a possible implementation manner, before identifying the same scene according to the image and the azimuth angle corresponding to the image, the method may further include:

The terminal device preprocesses the image; wherein, the preprocessing includes one or a combination of the following processes: converting image format, converting image channels, unifying image size, and image normalizing, where the converted image format is It refers to converting a color image into a black and white image. Converting an image channel refers to converting the image to red, green and blue RGB channels. Unifying image size refers to adjusting multiple images to have the same length and width. Image normalization refers to converting the pixels of an image. Value normalization.

The specific process can refer to the process shown in Section 3 above. Each azimuth angle can adopt the same preprocessing method, or can adopt different preprocessing methods according to the images collected at each azimuth angle, which is not limited in this application.

In the embodiment of the present application, the terminal device obtains the acceleration on the coordinate axis of the three-dimensional rectangular coordinate system corresponding to the gravity sensor when each camera collects the image, and can obtain the direction vector when each camera collects the image; When the camera collects the image, the corresponding three-dimensional rectangular coordinate system takes each camera as the origin, the z direction is the direction along the camera, and x and y are the directions perpendicular to the z direction respectively; The azimuth is calculated using the gravity unit vector. For the specific process, please refer to the content of Part 2 above, which will not be repeated.

For step S701, the terminal device may preset a weight corresponding to each azimuth angle, and weight the image recognition result of each azimuth angle according to the weight of each azimuth angle to obtain the final scene recognition result. Alternatively, the image and the azimuth corresponding to the image can also be input into the trained neural network model to identify the same scene, and the scene identification result can be obtained. The present application does not limit the specific manner of scene recognition.

FIG. 8 shows a flowchart of the method of step S701 according to an embodiment of the present application. As shown in FIG. 8, in a possible implementation manner, in step S701, the terminal device recognizes the same scene according to the image and the azimuth angle corresponding to the image, and obtains a scene recognition result, which may include:

Step S7010, the terminal device extracts the azimuth angle feature corresponding to the image from the azimuth angle corresponding to the image;

Step S7011, using a scene recognition model to recognize the same scene based on the image and the azimuth feature corresponding to the image, to obtain a scene recognition result, wherein the scene recognition model is a neural network model.

For the step S7010, for the specific process, reference may be made to the introduction in Part 4 above, and details are not repeated here. As shown in Figure 2a, after the azimuth angle feature is extracted, the image and the azimuth angle feature corresponding to the image can be input into the scene recognition model, and the scene recognition model can be used to identify the image based on the image and the azimuth angle feature corresponding to the image. The same scene, get the scene recognition result.

In the embodiment of the present application, the scene recognition model can be implemented through a variety of different neural network structures. In a possible implementation manner, as shown in FIG. 2a, the scene recognition model includes multiple pairs of first feature extraction layers and In the first layer model, each pair of the first feature extraction layer and the first layer model is used to process an image of an azimuth angle and an azimuth angle corresponding to the image of the one azimuth angle to obtain a first recognition result.

An example of the first feature extraction layer may be image feature extraction layer 1 and feature extraction layer 2 as shown in FIG. 2a. In the example of FIG. 2a, only two pairs of the first feature extraction layer and the first layer model are drawn. The present application is not limited to this, the logarithm of the first feature extraction layer and the first layer model included in the neural network model can be configured according to the number of camera angles set in specific application scenarios. For example, the first feature extraction layer and the first The logarithm of the layer model can be greater than or equal to the number of angles.

Wherein, the azimuth angle corresponding to the image of the one azimuth angle is the azimuth angle corresponding to the first recognition result, and the first feature extraction layer is used to extract the feature of the image of the one azimuth angle to obtain the feature vector, and the The first layer model is used to obtain the first recognition result according to the feature vector and the azimuth angle corresponding to the image of the one azimuth angle.

As shown in Figure 2a, the azimuth feature 1 corresponding to the image 1 extracted from the azimuth corresponding to the image 1 can be used as the input of the first layer model 1, and the azimuth feature extracted from the azimuth corresponding to the image 2. The azimuth feature 2 corresponding to the image 2 can be used as the input of the first layer model 2; the feature extraction layer 1 can extract the feature vector 1 of the image 1 and output it to the first layer model 1, and the feature extraction layer 2 can extract the feature vector of the image 2 2 and output to the first layer model 2. The first layer model 1 can combine the feature vector 1 and the azimuth feature 1 to perform scene recognition to obtain the first recognition result 1, and the first layer model 2 can combine the feature vector 2 and the azimuth feature 2 to perform scene recognition to obtain the first recognition result 2.

As shown in FIG. 2a, the scene recognition model further includes a second layer model, and the first layer model 1 and the first layer model 2 can respectively output the first recognition result 1 and the first recognition result 2 to the second layer model. The azimuth feature 1 corresponding to the image 1 extracted from the azimuth corresponding to the image 1 can be used as the input of the second layer model. The azimuth feature 1 corresponds to the first recognition result 1, and the azimuth feature is extracted from the The azimuth feature 2 corresponding to the image 2 extracted from the azimuth angle can be used as the input of the second layer model, and the azimuth feature 2 corresponds to the first recognition result 2 . The second layer model is used to obtain the scene recognition result according to the first recognition result and the azimuth angle corresponding to the first recognition result.

In a possible implementation manner, the first feature extraction layer is used to extract the features of the image of the one azimuth angle to obtain multiple feature vectors; the first layer model is used to extract the features of the one azimuth angle according to the The azimuth angle corresponding to the image calculates the first weight corresponding to each feature vector in the plurality of feature vectors; the first layer model is used to calculate the first weight corresponding to each feature vector and each feature vector, The first identification result is obtained. The second layer model is used to calculate the second weight of the first recognition result according to the azimuth angle corresponding to the first recognition result; the second layer model is used to calculate the second weight of the first recognition result according to the first recognition result and the first recognition result. A second weight of the recognition result to obtain the scene recognition result.

As shown in Figure 5 and Figure 6, the first layer model may include an activation function and a softmax function, where the activation function may be a tanh function, and the activation function may also use other types of activation functions, not limited to those shown in Figures 5 and 6 For example, Sigmoid activation function, Relu activation function can also be used. The number of activation functions in the first layer model can be set according to the number of feature vectors extracted by the feature extraction layer, and can be greater than or equal to the number of extracted feature vectors. The activation function is used to determine whether to activate the corresponding neuron according to the feature vector and the azimuth feature, and the feature corresponding to the activation function is extracted as the basis for classification. The activation function and the softmax function are used to calculate the first weight corresponding to each feature vector, as above. For the _si calculated by the formula (3), the specific process can refer to the above, and will not be repeated here.

The azimuth feature can affect the weight s _i of the calculated feature vector y _i . If the calculated _si is relatively large, the feature vector has a greater impact on the classification result. If the calculated _si is relatively small, then the feature Vectors have less influence on the results of classification. Therefore, the scene recognition model provided by the present application can identify key features according to the azimuth angle, filter irrelevant information, reduce noise, improve recognition accuracy, and realize scene recognition without user perception.

Similarly, the second-layer model may include an activation function and a softmax function, wherein the activation function may be a tanh function, and the activation function may also adopt other types of activation functions, which are not limited to the examples shown in FIG. 5 and FIG. 6 . The number of activation functions in the second-layer model is related to the number of angles at which images are collected, and the number of activation functions may be equal to or greater than the number of angles at which images are collected. The input of the activation function includes the output result Z _j of the first layer model and the azimuth angle feature C _j , the second layer model also includes a softmax function, and the softmax function calculates the weight of the output result Z _j of the first layer model according to the calculation result of the activation function. S _j , the specific calculation process can refer to formula (4) and the description in the above section 5, and will not be repeated here.

The azimuth angle feature can affect the weight S _j of the calculated first identification result Z _j . If the calculated S _j is relatively large, the first identification result has a greater impact on the classification result. If the calculated S _j is relatively small , then the first recognition result has less influence on the classification result. Therefore, the scene recognition model provided by the present application can extract the features of key angles according to the azimuth angle, filter irrelevant angles, improve the recognition accuracy, and realize scene recognition without user perception.

In a possible implementation manner, the second-layer model presets a third weight corresponding to each of the first recognition results, and the second-layer model is configured to use the first recognition result and the The third weight corresponding to the first recognition result is used to obtain the scene recognition result. For example, the second-layer model can preset weights for each azimuth angle. For each first recognition result Z _j output by the first-layer model, the preset corresponding weight is S _j , and the second-layer model is based on each first recognition result Z j . The first recognition result Z _j and the corresponding weighted weight S _j can be calculated to obtain the scene recognition result

In a possible implementation manner, the second layer model is configured to determine a fourth weight corresponding to each of the first recognition results according to the azimuth angle and a preset rule; wherein the preset rule is: According to the weight group set by the azimuth angle, different azimuth angles correspond to different weight groups, and each weight group includes a fourth weight corresponding to each first identification result; the second layer model is used for identifying according to the first identification result. The result and the fourth weight corresponding to the first recognition result are used to obtain the scene recognition result.

For example, the first layer model may also preset a weight mapping function according to the azimuth angle, that is, different azimuth angles correspond to different preset weight groups, and the preset weight groups may include each first identification result Z _j . corresponding weight. For example, it is assumed that the terminal device collects images from two angles through the front camera and the rear camera, and recognizes the images from the two angles to obtain two first recognition results Z ₁ and Z ₂ , and the corresponding weights are S respectively. ₁ and S ₂ , in one example, the preset weight mapping function according to the azimuth angle can be as follows:

An embodiment of the present application further provides a scene recognition apparatus, and FIG. 9 shows a block diagram of a scene recognition apparatus according to an embodiment of the present application. As shown in Figure 9, the apparatus may include:

The image acquisition module is used to collect images in the same scene from multiple azimuth angles through multiple cameras, wherein the azimuth angle when each camera collects the image is the azimuth angle corresponding to the image, and the azimuth angle is the azimuth angle corresponding to the image. the angle between the direction vector and the unit vector of gravity when each camera collects the image;

A scene recognition module, configured to recognize the same scene according to the image and the azimuth angle corresponding to the image, and obtain a scene recognition result.

In a possible implementation, the scene recognition module includes:

an azimuth feature extraction module, used for extracting the azimuth feature corresponding to the image from the azimuth angle corresponding to the image;

A scene recognition model, configured to recognize the same scene based on the image and the azimuth angle feature corresponding to the image, and obtain a scene recognition result, wherein the scene recognition model is a neural network model.

In a possible implementation manner, the scene recognition model includes multiple pairs of the first feature extraction layer and the first layer model, and each pair of the first feature extraction layer and the first layer model is used to compare the image of an azimuth angle and all the The azimuth angle corresponding to the image of one azimuth angle is processed to obtain the first recognition result; wherein, the azimuth angle corresponding to the image of the one azimuth angle is the azimuth angle corresponding to the first recognition result, and the first feature extraction The layer is used to extract the feature of the image of the one azimuth angle to obtain a feature vector, and the first layer model is used to obtain the first identification according to the feature vector and the azimuth angle corresponding to the image of the one azimuth angle. Result; the scene recognition model further includes a second layer model, and the second layer model is used to obtain the scene recognition result according to the first recognition result and the azimuth angle corresponding to the first recognition result.

In a possible implementation manner, the first feature extraction layer is used to extract the features of the image of the one azimuth angle to obtain multiple feature vectors; the first layer model is used to extract the features of the one azimuth angle according to the The azimuth angle corresponding to the image calculates the first weight corresponding to each feature vector in the plurality of feature vectors; the first layer model is used to calculate the first weight corresponding to each feature vector and each feature vector, The first identification result is obtained.

In a possible implementation manner, the second layer model is used to calculate the second weight of the first recognition result according to the azimuth angle corresponding to the first recognition result; the second layer model is used to calculate the second weight of the first recognition result according to the azimuth angle corresponding to the first recognition result; The first recognition result and the second weight of the first recognition result are used to obtain the scene recognition result.

In a possible implementation manner, the second-layer model presets a third weight corresponding to each of the first recognition results, and the second-layer model is configured to use the first recognition result and the The third weight corresponding to the first recognition result is used to obtain the scene recognition result.

In a possible implementation manner, the device further includes: an azimuth angle acquisition module, configured to acquire the acceleration on the coordinate axis of the three-dimensional rectangular coordinate system corresponding to the gravity sensor when each camera collects the image, and obtain each The direction vector when the camera collects the image; wherein, the corresponding three-dimensional rectangular coordinate system when each camera collects the image takes each camera as the origin, the z direction is the direction along the camera, and x and y are respectively The direction perpendicular to the z direction, and the plane where x and y are located is perpendicular to the z direction; the azimuth angle is calculated according to the direction vector and the gravity unit vector.

In a possible implementation manner, the apparatus further includes: an image preprocessing module, configured to perform preprocessing on the image; wherein, the preprocessing includes a combination of one or more of the following processes: converting Image format, convert image channel, unify image size, image normalization, convert image format refers to converting color image to black and white image, convert image channel refers to converting image to red, green and blue RGB channel, and unify image size refers to adjusting Multiple images have the same length and the same width, and image normalization refers to normalizing the pixel values of the images.

In a possible implementation manner, the apparatus further includes: a result output module, configured to output the scene recognition result.

FIG. 10 shows a schematic structural diagram of a terminal device according to an embodiment of the present application. Taking the terminal device being a mobile phone as an example, FIG. 10 shows a schematic structural diagram of the mobile phone 200.

The mobile phone 200 may include a processor 210, an external memory interface 220, an internal memory 221, a USB interface 230, a charging management module 240, a power management module 241, a battery 242, an antenna 1, an antenna 2, a mobile communication module 251, a wireless communication module 252, Audio module 270, speaker 270A, receiver 270B, microphone 270C, headphone jack 270D, sensor module 280, buttons 290, motor 291, indicator 292, camera 293, display screen 294, SIM card interface 295, etc. The sensor module 280 may include a gyroscope sensor 280A, an acceleration sensor 280B, a proximity light sensor 280G, a fingerprint sensor 280H, and a touch sensor 280K (of course, the mobile phone 200 may also include other sensors, such as a temperature sensor, a pressure sensor, a distance sensor, and a magnetic sensor. , ambient light sensor, air pressure sensor, bone conduction sensor, etc., not shown in the figure).

It can be understood that the structures illustrated in the embodiments of the present application do not constitute a specific limitation on the mobile phone 200 . In other embodiments of the present application, the mobile phone 200 may include more or less components than shown, or combine some components, or separate some components, or arrange different components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 210 may include one or more processing units, for example, the processor 210 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or Neural-network Processing Unit (NPU) Wait. Wherein, different processing units may be independent devices, or may be integrated in one or more processors. The controller may be the nerve center and command center of the mobile phone 200 . The controller can generate an operation control signal according to the instruction operation code and timing signal, and complete the control of fetching and executing instructions.

A memory may also be provided in the processor 210 for storing instructions and data. In some embodiments, the memory in processor 210 is cache memory. The memory may hold instructions or data that have just been used or recycled by the processor 210 . If the processor 210 needs to use the instruction or data again, it can be called directly from the memory. Repeated accesses are avoided, and the waiting time of the processor 210 is reduced, thereby improving the efficiency of the system.

The processor 210 can run the scene recognition method provided by the embodiment of the present application, so as to recognize the scene in combination with multiple images and the azimuth angle corresponding to each image, obtain more comprehensive scene information, and improve the accuracy of image-based scene recognition. , solves the problem of limited viewing angle range and shooting angle of single camera recognition scene, and the recognition is more accurate. The processor 210 may include different devices. For example, when a CPU and a GPU are integrated, the CPU and the GPU may cooperate to execute the scene recognition method provided by the embodiments of the present application. For example, some algorithms in the scene recognition method are executed by the CPU, and another part of the algorithms are executed by the GPU. for faster processing efficiency.

Display screen 294 is used to display images, videos, and the like. Display screen 294 includes a display panel. The display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode or an active-matrix organic light-emitting diode (active-matrix organic light). emitting diode, AMOLED), flexible light-emitting diode (flex light-emitting diode, FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diode (quantum dot light emitting diodes, QLED) and so on. In some embodiments, cell phone 200 may include 1 or N display screens 294, where N is a positive integer greater than 1. The display screen 294 may be used to display information entered by or provided to the user as well as various graphical user interfaces (GUIs). For example, display 294 may display photos, videos, web pages, or documents, and the like. As another example, display 294 may display a graphical user interface. The GUI includes a status bar, a hideable navigation bar, a time and weather widget, and an application icon, such as a browser icon. The status bar includes operator name (eg China Mobile), mobile network (eg 4G), time and remaining battery. The navigation bar includes a back button icon, a home button icon, and a forward button icon. In addition, it can be understood that, in some embodiments, the status bar may further include a Bluetooth icon, a Wi-Fi icon, an external device icon, and the like. It can also be understood that, in other embodiments, the graphical user interface may further include a Dock bar, and the Dock bar may include commonly used application icons and the like. After the processor 210 detects a touch event of the user's finger (or stylus, etc.) on an application icon, in response to the touch event, the user interface of the application corresponding to the application icon is opened and displayed on the display 294 The user interface of the application.

In this embodiment of the present application, the display screen 294 may be an integrated flexible display screen, or a spliced display screen composed of two rigid screens and a flexible screen located between the two rigid screens.

Cameras 293 (front camera, rear camera, both front and rear cameras may include one or more cameras) are used to capture still images or video. Generally, the camera 293 may include a photosensitive element such as a lens group and an image sensor, wherein the lens group includes a plurality of lenses (convex or concave) for collecting the light signal reflected by the object to be photographed, and transmitting the collected light signal to the image sensor . The image sensor generates an original image of the object to be photographed according to the light signal. In the embodiment of the present application, multiple cameras are used to collect images in the same scene from multiple azimuth angles, so that the scene can be identified in combination with the multiple images and the azimuth angle corresponding to each image, and more comprehensive scene information can be obtained, Improve the accuracy of image-based scene recognition, solve the problem of limited viewing angle range and shooting angle of single camera recognition scene, and make the recognition more accurate.

Internal memory 221 may be used to store computer executable program code, which includes instructions. The processor 210 executes various functional applications and data processing of the mobile phone 200 by executing the instructions stored in the internal memory 221 . The internal memory 221 may include a storage program area and a storage data area. The storage program area may store operating system, code of application programs (such as camera application, WeChat application, etc.), and the like. The storage data area may store data created during the use of the mobile phone 200 (such as images and videos collected by the camera application) and the like.

The internal memory 221 may also store one or more computer programs 1310 corresponding to the scene recognition method provided by the embodiment of the present application. The one or more computer programs 1304 are stored in the aforementioned memory 221 and configured to be executed by the one or more processors 210, and the one or more computer programs 1310 include instructions that may be used to carry out the implementation of the present application For each step in the scene recognition method provided by the example, the computer program 1310 may include: an image acquisition module for acquiring images under the same scene from multiple azimuths through a plurality of cameras; a scene recognition module for according to the image Recognize the same scene with the azimuth angle corresponding to the image, and obtain the scene recognition result; the azimuth angle acquisition module is used to acquire the acceleration on the coordinate axis of the three-dimensional rectangular coordinate system corresponding to the gravity sensor when each camera collects the image , obtains the direction vector when each camera collects the image, and calculates the azimuth angle according to the direction vector and the gravity unit vector; an image preprocessing module is used to preprocess the image.

In addition, the internal memory 221 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), and the like.

Certainly, the code of the scene recognition method provided by the embodiment of the present application may also be stored in an external memory. In this case, the processor 210 may execute the code of the scene recognition method stored in the external memory through the external memory interface 220 .

The function of the sensor module 280 is described below.

The gyro sensor 280A can be used to determine the movement posture of the mobile phone 200 . In some embodiments, the angular velocity of cell phone 200 about three axes (ie, x, y, and z axes) may be determined by gyro sensor 280A. That is, the gyro sensor 280A can be used to detect the current motion state of the mobile phone 200, such as shaking or still.

When the display screen in the embodiment of the present application is a foldable screen, the gyro sensor 280A can be used to detect a folding or unfolding operation acting on the display screen 294 . The gyroscope sensor 280A may report the detected folding operation or unfolding operation to the processor 210 as an event to determine the folding state or unfolding state of the display screen 294 .

The acceleration sensor 280B can detect the magnitude of the acceleration of the mobile phone 200 in various directions (generally three axes). That is, the gyro sensor 280A can be used to detect the current motion state of the mobile phone 200, such as shaking or still. When the display screen in the embodiment of the present application is a foldable screen, the acceleration sensor 280B can be used to detect a folding or unfolding operation acting on the display screen 294 . The acceleration sensor 280B may report the detected folding operation or unfolding operation to the processor 210 as an event to determine the folding state or unfolding state of the display screen 294 .

In the embodiment of the present application, the terminal device can obtain the acceleration on the coordinate axis of the three-dimensional rectangular coordinate system corresponding to each camera when collecting the image through the acceleration sensor 280B, and obtain the direction vector when each camera collects the image , and calculate the azimuth angle according to the direction vector and the gravity unit vector.

Proximity light sensor 280G may include, for example, light emitting diodes (LEDs) and light detectors, such as photodiodes. The light emitting diodes may be infrared light emitting diodes. The mobile phone emits infrared light outward through light-emitting diodes. Phones use photodiodes to detect reflected infrared light from nearby objects. When sufficient reflected light is detected, it can be determined that there is an object near the phone. When insufficient reflected light is detected, the phone can determine that there are no objects near the phone. When the display screen in the embodiment of the present application is a foldable screen, the proximity light sensor 280G can be arranged on the first screen of the foldable display screen 294, and the proximity light sensor 280G can detect the first screen according to the optical path difference of the infrared signal. The size of the folding or unfolding angle between the screen and the second screen.

The gyroscope sensor 280A (or the acceleration sensor 280B) may send the detected motion state information (such as angular velocity) to the processor 210 . The processor 210 determines, based on the motion state information, whether the current state is the hand-held state or the tripod state (for example, when the angular velocity is not 0, it means that the mobile phone 200 is in the hand-held state).

The fingerprint sensor 280H is used to collect fingerprints. The mobile phone 200 can use the collected fingerprint characteristics to realize fingerprint unlocking, accessing application locks, taking photos with fingerprints, answering incoming calls with fingerprints, and the like.

Touch sensor 280K, also called "touch panel". The touch sensor 280K may be disposed on the display screen 294, and the touch sensor 280K and the display screen 294 form a touch screen, also called a "touch screen". The touch sensor 280K is used to detect a touch operation on or near it. The touch sensor can pass the detected touch operation to the application processor to determine the type of touch event. Visual output related to touch operations may be provided through display screen 294 . In other embodiments, the touch sensor 280K may also be disposed on the surface of the mobile phone 200 , which is different from the location where the display screen 294 is located.

Exemplarily, the display screen 294 of the mobile phone 200 displays a main interface, and the main interface includes icons of multiple applications (such as a camera application, a WeChat application, etc.). The user clicks the icon of the camera application in the main interface through the touch sensor 280K, which triggers the processor 210 to start the camera application and turn on the camera 293 . Display screen 294 displays an interface of a camera application, such as a viewfinder interface. Display screen 294 may also be used to display scene recognition results.

The wireless communication function of the mobile phone 200 can be realized by the antenna 1, the antenna 2, the mobile communication module 251, the wireless communication module 252, the modulation and demodulation processor, the baseband processor, and the like.

Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals. Each antenna in handset 200 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization. For example, the antenna 1 can be multiplexed as a diversity antenna of the wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 251 can provide a wireless communication solution including 2G/3G/4G/5G, etc. applied on the mobile phone 200 . The mobile communication module 251 may include at least one filter, switch, power amplifier, low noise amplifier (LNA) and the like. The mobile communication module 251 can receive electromagnetic waves from the antenna 1, filter and amplify the received electromagnetic waves, and transmit them to the modulation and demodulation processor for demodulation. The mobile communication module 251 can also amplify the signal modulated by the modulation and demodulation processor, and then turn it into an electromagnetic wave for radiation through the antenna 1 . In some embodiments, at least part of the functional modules of the mobile communication module 251 may be provided in the processor 210 . In some embodiments, at least part of the functional modules of the mobile communication module 251 may be provided in the same device as at least part of the modules of the processor 210 .

The modem processor may include a modulator and a demodulator. Wherein, the modulator is used to modulate the low frequency baseband signal to be sent into a medium and high frequency signal. The demodulator is used to demodulate the received electromagnetic wave signal into a low frequency baseband signal. Then the demodulator transmits the demodulated low-frequency baseband signal to the baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and passed to the application processor. The application processor outputs sound signals through audio devices (not limited to the speaker 270A, the receiver 270B, etc.), or displays images or videos through the display screen 294 . In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modulation and demodulation processor may be independent of the processor 210, and may be provided in the same device as the mobile communication module 251 or other functional modules.

The wireless communication module 252 can provide applications on the mobile phone 200 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) network), bluetooth (BT), global navigation satellite system (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions. The wireless communication module 252 may be one or more devices integrating at least one communication processing module. The wireless communication module 252 receives electromagnetic waves via the antenna 2 , frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 210 . The wireless communication module 252 can also receive the signal to be sent from the processor 210 , perform frequency modulation on the signal, amplify the signal, and then convert it into an electromagnetic wave for radiation through the antenna 2 . In this embodiment of the present application, the wireless communication module 252 is configured to transmit data with other terminal devices under the control of the processor 210 .

In addition, the mobile phone 200 can implement audio functions through an audio module 270, a speaker 270A, a receiver 270B, a microphone 270C, an earphone interface 270D, and an application processor. Such as music playback, recording, etc. The cell phone 200 can receive key 290 input and generate key signal input related to user settings and function control of the cell phone 200 . The mobile phone 200 can use the motor 291 to generate vibration alerts (eg, vibration alerts for incoming calls). The indicator 292 in the mobile phone 200 may be an indicator light, which may be used to indicate a charging state, a change in power, and may also be used to indicate a message, a missed call, a notification, and the like. The SIM card interface 295 in the mobile phone 200 is used to connect the SIM card. The SIM card can be contacted and separated from the mobile phone 200 by inserting into the SIM card interface 295 or pulling out from the SIM card interface 295 .

It should be understood that, in practical applications, the mobile phone 200 may include more or less components than those shown in FIG. 10 , which are not limited in this embodiment of the present application. The illustrated handset 200 is merely an example, and the handset 200 may have more or fewer components than those shown, two or more components may be combined, or may have different component configurations. The various components shown in the figures may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing and/or application specific integrated circuits.

The software system of the terminal device can adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. The embodiments of the present application take an Android system with a layered architecture as an example to exemplarily describe the software structure of a terminal device.

An embodiment of the present application provides a scene recognition apparatus, including: a processor and a memory for storing instructions executable by the processor; wherein the processor is configured to implement the above method when executing the instructions.

Embodiments of the present application provide a non-volatile computer-readable storage medium on which computer program instructions are stored, and when the computer program instructions are executed by a processor, implement the above method.

Embodiments of the present application provide a computer program product, including computer-readable codes, or a non-volatile computer-readable storage medium carrying computer-readable codes, when the computer-readable codes are stored in a processor of an electronic device When running in the electronic device, the processor in the electronic device executes the above method.

A computer-readable storage medium may be a tangible device that can hold and store instructions for use by the instruction execution device. The computer-readable storage medium may be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read-only memory (Electrically Programmable Read-Only-Memory, EPROM or flash memory), static random access memory (Static Random-Access Memory, SRAM), portable compact disk read-only memory (Compact Disc Read-Only Memory, CD - ROM), Digital Video Disc (DVD), memory sticks, floppy disks, mechanically encoded devices, such as punch cards or raised structures in grooves on which instructions are stored, and any suitable combination of the foregoing .

Computer readable program instructions or code described herein may be downloaded to various computing/processing devices from a computer readable storage medium, or to an external computer or external storage device over a network such as the Internet, a local area network, a wide area network and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer-readable program instructions from a network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .

The computer program instructions used to perform the operations of the present application may be assembly instructions, Instruction Set Architecture (ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, state setting data, or in one or more source or object code written in any combination of programming languages, including object-oriented programming languages such as Smalltalk, C++, etc., and conventional procedural programming languages such as the "C" language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network—including a Local Area Network (LAN) or a Wide Area Network (WAN)—or, may be connected to an external computer (eg, use an internet service provider to connect via the internet). In some embodiments, electronic circuits, such as programmable logic circuits, Field-Programmable Gate Arrays (FPGA), or Programmable Logic Arrays (Programmable Logic Arrays), are personalized by utilizing state information of computer-readable program instructions. Logic Array, PLA), the electronic circuit can execute computer readable program instructions to implement various aspects of the present application.

Aspects of the present application are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the present application. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer or other programmable data processing apparatus to produce a machine that causes the instructions when executed by the processor of the computer or other programmable data processing apparatus , resulting in means for implementing the functions/acts specified in one or more blocks of the flowchart and/or block diagrams. These computer readable program instructions can also be stored in a computer readable storage medium, these instructions cause a computer, programmable data processing apparatus and/or other equipment to operate in a specific manner, so that the computer readable medium on which the instructions are stored includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.

Computer readable program instructions can also be loaded onto a computer, other programmable data processing apparatus, or other equipment to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process , thereby causing instructions executing on a computer, other programmable data processing apparatus, or other device to implement the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of apparatuses, systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more functions for implementing the specified logical function(s) executable instructions. In some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.

It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in hardware (eg, circuits or ASICs (Application) that perform the corresponding functions or actions. Specific Integrated Circuit, application-specific integrated circuit)), or can be implemented by a combination of hardware and software, such as firmware.

While the invention has been described herein in connection with various embodiments, those skilled in the art will understand and understand from a review of the drawings, the disclosure, and the appended claims in practicing the claimed invention. Other variations of the disclosed embodiments are implemented. In the claims, the word "comprising" does not exclude other components or steps, and "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that these measures cannot be combined to advantage.

Various embodiments of the present application have been described above, and the foregoing descriptions are exemplary, not exhaustive, and not limiting of the disclosed embodiments. Numerous modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the various embodiments, the practical application or improvement over the technology in the marketplace, or to enable others of ordinary skill in the art to understand the various embodiments disclosed herein.

Claims

A scene recognition method, characterized in that the method comprises:

The terminal device collects images in the same scene from multiple azimuth angles through multiple cameras, wherein the azimuth angle at which each camera collects the image is the azimuth angle corresponding to the image, and the azimuth angle is the azimuth angle of each camera the angle between the direction vector and the gravity unit vector when the image is collected;

The terminal device recognizes the same scene according to the image and the azimuth angle corresponding to the image, and obtains a scene recognition result.
The method according to claim 1, wherein the terminal device recognizes the same scene according to the image and the azimuth angle corresponding to the image, and obtains a scene recognition result, comprising:

The terminal device extracts the azimuth angle feature corresponding to the image from the azimuth angle corresponding to the image;

A scene recognition model is used to recognize the same scene based on the image and the azimuth feature corresponding to the image, and a scene recognition result is obtained, wherein the scene recognition model is a neural network model.
The method according to claim 2, wherein the scene recognition model comprises multiple pairs of the first feature extraction layer and the first layer model,

Each pair of the first feature extraction layer and the first layer model is used to process an image of an azimuth angle and an azimuth angle corresponding to the image of the one azimuth angle to obtain a first recognition result;

Wherein, the azimuth angle corresponding to the image of the one azimuth angle is the azimuth angle corresponding to the first recognition result, and the first feature extraction layer is used to extract the feature of the image of the one azimuth angle to obtain the feature vector, and the The first layer model is used to obtain the first recognition result according to the feature vector and the azimuth angle corresponding to the image of the one azimuth angle;

The scene recognition model further includes a second-layer model, and the second-layer model is used to obtain the scene recognition result according to the first recognition result and the azimuth angle corresponding to the first recognition result.
The method of claim 3, wherein:

The first feature extraction layer is used to extract the feature of the image of one azimuth angle to obtain a plurality of feature vectors;

The first layer model is used to calculate the first weight corresponding to each feature vector in the plurality of feature vectors according to the azimuth angle corresponding to the image of the one azimuth angle;

The first layer model is configured to obtain the first recognition result according to each feature vector and the first weight corresponding to each feature vector.
The method of claim 3, wherein:

The second layer model is used to calculate the second weight of the first identification result according to the azimuth angle corresponding to the first identification result;

The second layer model is configured to obtain the scene recognition result according to the first recognition result and the second weight of the first recognition result.
The method of claim 3, wherein:

The second layer model presets a third weight corresponding to each of the first recognition results,

The second layer model is configured to obtain the scene recognition result according to the first recognition result and the third weight corresponding to the first recognition result.
The method of claim 3, wherein:

The second-layer model is used to determine the fourth weight corresponding to each of the first recognition results according to the azimuth angle and a preset rule; wherein, the preset rule is a weight group set according to the azimuth angle, and different The azimuth angles of , correspond to different weight groups, and each weight group includes the fourth weight corresponding to each first recognition result;

The second-layer model is configured to obtain the scene recognition result according to the first recognition result and the fourth weight corresponding to the first recognition result.
The method according to claim 1, wherein the method further comprises:

The terminal device obtains the acceleration on the coordinate axis of the three-dimensional rectangular coordinate system corresponding to the gravity sensor when each camera collects the image, and obtains the direction vector when each camera collects the image;

Wherein, the three-dimensional rectangular coordinate system corresponding to each camera when collecting the image takes each camera as the origin, the z direction is the direction along the camera, x and y are the directions perpendicular to the z direction, and x and y are respectively the directions perpendicular to the z direction. The plane in which y is located is perpendicular to the z direction;

The azimuth is calculated from the direction vector and the gravity unit vector.
The method according to claim 2, wherein before using a scene recognition model to identify the same scene based on the image and the azimuth feature corresponding to the image, the method further comprises:

The terminal device preprocesses the image;

Wherein, the preprocessing includes one or a combination of the following processes: converting image format, converting image channels, unifying image size, and image normalizing, converting image format refers to converting a color image into a black-and-white image, converting Image channel refers to converting the image to red, green, and blue RGB channels, uniform image size refers to adjusting multiple images to have the same length and width, and image normalization refers to normalizing the pixel values of the image.
A scene recognition device, characterized in that the device comprises:

The image acquisition module is used to collect images in the same scene from multiple azimuth angles through multiple cameras, wherein the azimuth angle when each camera collects the image is the azimuth angle corresponding to the image, and the azimuth angle is the azimuth angle corresponding to the image. the angle between the direction vector and the unit vector of gravity when each camera collects the image;

A scene recognition module, configured to recognize the same scene according to the image and the azimuth angle corresponding to the image, and obtain a scene recognition result.
The device according to claim 10, wherein the scene recognition module comprises:

an azimuth feature extraction module, used for extracting the azimuth feature corresponding to the image from the azimuth angle corresponding to the image;

A scene recognition model, configured to recognize the same scene based on the image and the azimuth angle feature corresponding to the image, and obtain a scene recognition result, wherein the scene recognition model is a neural network model.
The device according to claim 11, wherein the scene recognition model comprises a plurality of pairs of the first feature extraction layer and the first layer model,

Each pair of the first feature extraction layer and the first layer model is used to process an image of an azimuth angle and an azimuth angle corresponding to the image of the one azimuth angle to obtain a first recognition result;

Wherein, the azimuth angle corresponding to the image of the one azimuth angle is the azimuth angle corresponding to the first recognition result, and the first feature extraction layer is used to extract the feature of the image of the one azimuth angle to obtain the feature vector, and the The first layer model is used to obtain the first recognition result according to the feature vector and the azimuth angle corresponding to the image of the one azimuth angle;

The scene recognition model further includes a second layer model, and the second layer model is used to obtain the scene recognition result according to the first recognition result and the azimuth angle corresponding to the first recognition result.
The device of claim 12, wherein:

The first feature extraction layer is used to extract the feature of the image of one azimuth angle to obtain a plurality of feature vectors;

The first layer model is used to calculate the first weight corresponding to each feature vector in the plurality of feature vectors according to the azimuth angle corresponding to the image of the one azimuth angle;

The first layer model is configured to obtain the first recognition result according to each feature vector and the first weight corresponding to each feature vector.
The device of claim 12, wherein:

The second layer model is used to calculate the second weight of the first identification result according to the azimuth angle corresponding to the first identification result;

The second layer model is configured to obtain the scene recognition result according to the first recognition result and the second weight of the first recognition result.
The device of claim 12, wherein:

The second layer model presets a third weight corresponding to each of the first recognition results,

The second layer model is configured to obtain the scene recognition result according to the first recognition result and the third weight corresponding to the first recognition result.
The device of claim 12, wherein:

The second layer model is used to determine the fourth weight corresponding to each of the first recognition results according to the azimuth angle and the preset rule;

The preset rules are weight groups set according to azimuth angles, different azimuth angles correspond to different weight groups, and each weight group includes a fourth weight corresponding to each first identification result;

The second-layer model is configured to obtain the scene recognition result according to the first recognition result and the fourth weight corresponding to the first recognition result.
The apparatus of claim 10, wherein the apparatus further comprises:

an azimuth angle acquisition module, configured to acquire the acceleration on the coordinate axis of the three-dimensional rectangular coordinate system corresponding to the gravity sensor when each camera collects the image, and obtain the direction vector when each camera collects the image;

Wherein, the three-dimensional rectangular coordinate system corresponding to each camera when collecting the image takes each camera as the origin, the z direction is the direction along the camera, x and y are the directions perpendicular to the z direction, and x and y are respectively the directions perpendicular to the z direction. The plane where y is located is perpendicular to the z direction; the azimuth angle is calculated according to the direction vector and the gravity unit vector.
The apparatus of claim 11, wherein the apparatus further comprises:

an image preprocessing module for preprocessing the image;

Wherein, the preprocessing includes one or a combination of the following processes: converting image format, converting image channels, unifying image size, and image normalizing, converting image format refers to converting a color image into a black-and-white image, converting Image channel refers to converting the image to red, green, and blue RGB channels, uniform image size refers to adjusting multiple images to have the same length and width, and image normalization refers to normalizing the pixel values of the image.
A computer program product comprising computer readable code, when the computer readable code is executed in an electronic device, a processor in the electronic device executes the method of any one of claims 1-9.
A non-volatile computer-readable storage medium on which computer program instructions are stored, characterized in that, when the computer program instructions are executed by a processor, the method described in any one of claims 1-9 is implemented.
A terminal device, characterized in that it includes:

processor;

memory for storing processor-executable instructions;

Wherein, the processor is configured to implement the method of any one of claims 1-9 when executing the instructions.