WO2022179281A1 - Environment identification method and apparatus - Google Patents

Environment identification method and apparatus Download PDF

Info

Publication number
WO2022179281A1
WO2022179281A1 PCT/CN2021/140833 CN2021140833W WO2022179281A1 WO 2022179281 A1 WO2022179281 A1 WO 2022179281A1 CN 2021140833 W CN2021140833 W CN 2021140833W WO 2022179281 A1 WO2022179281 A1 WO 2022179281A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
azimuth angle
recognition result
scene
feature
Prior art date
Application number
PCT/CN2021/140833
Other languages
French (fr)
Chinese (zh)
Inventor
彭璐
赵安
黄维
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022179281A1 publication Critical patent/WO2022179281A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements

Definitions

  • the present application relates to the technical field of image processing, and in particular, to a scene recognition method and device.
  • the scene recognition technology based on mobile terminals such as mobile phones is an important basic perception capability, which can serve a variety of services such as smart travel, direct service, intention decision-making, smart noise reduction of headphones, search, and recommendation.
  • Existing scene recognition technologies mainly include: image-based scene recognition methods, sensor (eg, WiFi, Bluetooth, location sensors, etc.) positioning-based scene recognition methods, or signal fingerprint comparison-based scene recognition methods.
  • the image-based scene recognition method is affected by the camera's viewing angle range, camera angle, object occlusion and other factors, the robustness in complex scenes is greatly challenged, and it is difficult to realize real-time scene recognition without user perception.
  • the front or rear camera is used to collect images for recognition. Due to the small viewing angle range and few effective features, or the shooting angle is arbitrary, or there is a large amount of object feature information in the same image, noise information will be Overwhelmed by key features, these are prone to miscalculation. For example, when a ceiling feature appears in the image, it is recognized as indoors, but it may actually be on a subway or an airplane. Therefore, the technical problem to be solved by this application is how to improve the recognition accuracy of the image-based scene recognition method.
  • a scene recognition method and device which combine multiple images and the azimuth angle corresponding to each image to recognize the scene. Since more comprehensive scene information is obtained, the image-based scene recognition can be improved. The accuracy solves the problem of limited viewing angle range and shooting angle of single camera recognition scene, and the recognition is more accurate.
  • an embodiment of the present application provides a scene recognition method, the method includes: a terminal device collects images in the same scene from multiple azimuths through multiple cameras, wherein when each camera collects the image
  • the azimuth angle is the azimuth angle corresponding to the image
  • the azimuth angle is the angle between the direction vector and the gravity unit vector when each camera collects the image
  • the same scene is identified by the azimuth, and the scene identification result is obtained.
  • multiple cameras are used to shoot images of multiple azimuth angles of the same scene, and the scene is recognized by combining the multiple images and the azimuth angles corresponding to each image. Since more comprehensive scene information is obtained, Therefore, the accuracy of image-based scene recognition can be improved, the problems of limited viewing angle range and shooting angle of a single camera to recognize a scene can be solved, and the recognition can be more accurate.
  • the terminal device recognizes the same scene according to the image and the azimuth angle corresponding to the image, and obtains a scene recognition result, including: The azimuth angle is extracted from the azimuth angle feature corresponding to the image; the scene recognition model is used to identify the same scene based on the image and the azimuth angle feature corresponding to the image, and a scene recognition result is obtained, wherein the scene recognition model is a neural network. Model.
  • the scene recognition model includes multiple pairs of the first feature extraction layer and the first layer model, and each pair of the first feature extraction layer and The first layer model is used to process an image of an azimuth angle and an azimuth angle corresponding to the image of the one azimuth angle to obtain a first recognition result; wherein, the azimuth angle corresponding to the image of the one azimuth angle is the first recognition result.
  • the first feature extraction layer is used to extract the feature of the image of the one azimuth angle to obtain a feature vector
  • the first layer model is used to extract the feature vector according to the feature vector and the one azimuth
  • the azimuth angle corresponding to the image of the angle is obtained, and the first recognition result is obtained;
  • the scene recognition model further includes a second layer model, and the second layer model is used for the first recognition result and the corresponding first recognition result.
  • the azimuth angle is used to obtain the scene recognition result.
  • the scene recognition method of the embodiment of the present application adopts a two-layer scene recognition model, collects images from multiple angles and combines the azimuth angles of the images to perform scene recognition by using a competition mechanism, and considers the results of local features and overall features.
  • the accuracy of recognition in scene recognition without user perception reduces misjudgment.
  • the first feature extraction layer is configured to extract the feature of the image at one azimuth angle to obtain a plurality of feature vectors;
  • the first layer model is used to calculate the first weight corresponding to each feature vector in the plurality of feature vectors according to the azimuth angle corresponding to the image of the one azimuth angle;
  • the first layer model is used to calculate the first weight corresponding to each feature vector according to the eigenvectors and a first weight corresponding to each eigenvector to obtain the first recognition result.
  • the second layer model is used to calculate the azimuth angle of the first identification result according to the azimuth angle corresponding to the first identification result second weight; the second layer model is configured to obtain the scene recognition result according to the first recognition result and the second weight of the first recognition result.
  • the second layer model presets a third weight corresponding to each of the first recognition results, and the second The layer model is used to obtain the scene recognition result according to the first recognition result and the third weight corresponding to the first recognition result.
  • the second layer model is configured to determine each of the first recognition results according to the azimuth angle and a preset rule A corresponding fourth weight; wherein, the preset rule is a weight group set according to the azimuth angle, different azimuth angles correspond to different weight groups, and each weight group includes a fourth weight corresponding to each first identification result;
  • the second-layer model is configured to obtain the scene recognition result according to the first recognition result and the fourth weight corresponding to the first recognition result.
  • the method further includes: the terminal device acquires the acceleration on the coordinate axis of the three-dimensional rectangular coordinate system corresponding to the gravity sensor when each camera captures the image, and obtains each The direction vector when each camera collects the image; wherein, the corresponding three-dimensional rectangular coordinate system when each camera collects the image takes each camera as the origin, the z direction is the direction along the camera, and x and y are respectively is a direction perpendicular to the z direction, and the plane where x and y are located is perpendicular to the z direction; the azimuth angle is calculated according to the direction vector and the gravity unit vector.
  • the The method before using a scene recognition model to identify the same scene based on the image and the azimuth feature corresponding to the image, the The method further includes: the terminal device preprocesses the image; wherein, the preprocessing includes one or a combination of the following processes: converting an image format, converting an image channel, unifying the image size, and normalizing the image, Converting image format refers to converting color images to black and white images. Converting image channels refers to converting images to red, green, and blue RGB channels. Unifying image size refers to adjusting multiple images to have the same length and width. Image normalization refers to Normalize the pixel values of the image.
  • an embodiment of the present application provides a scene recognition apparatus, the apparatus includes: an image acquisition module, configured to collect images in the same scene from multiple azimuth angles through multiple cameras, wherein each camera collects images in the same scene.
  • the azimuth angle of the image is the azimuth angle corresponding to the image, and the azimuth angle is the angle between the direction vector and the gravity unit vector when each camera collects the image;
  • the same scene is recognized by the image and the azimuth angle corresponding to the image, and a scene recognition result is obtained.
  • the scene recognition device of the embodiment of the present application uses multiple cameras to capture images of multiple azimuth angles of the same scene, and recognizes the scene in combination with the multiple images and the azimuth angle corresponding to each image. Since more comprehensive scene information is obtained, Therefore, the accuracy of image-based scene recognition can be improved, the problems of limited viewing angle range and shooting angle of a single camera to recognize a scene can be solved, and the recognition can be more accurate.
  • the scene recognition module includes: an azimuth feature extraction module, configured to extract the azimuth feature corresponding to the image from the azimuth angle corresponding to the image; scene recognition a model for recognizing the same scene based on the image and the azimuth feature corresponding to the image to obtain a scene recognition result, wherein the scene recognition model is a neural network model.
  • the scene recognition model includes multiple pairs of the first feature extraction layer and the first layer model, and each pair of the first feature extraction layer and The first layer model is used to process an image of an azimuth angle and an azimuth angle corresponding to the image of the one azimuth angle to obtain a first recognition result; wherein, the azimuth angle corresponding to the image of the one azimuth angle is the first recognition result.
  • the first feature extraction layer is used to extract the feature of the image of the one azimuth angle to obtain a feature vector
  • the first layer model is used to extract the feature vector according to the feature vector and the one azimuth
  • the azimuth angle corresponding to the image of the angle is obtained, and the first recognition result is obtained;
  • the scene recognition model further includes a second layer model, and the second layer model is used for the first recognition result and the corresponding first recognition result.
  • the azimuth angle is used to obtain the scene recognition result.
  • the scene recognition device of the embodiment of the present application adopts a two-layer scene recognition model, collects multi-angle images and combines the azimuth angles of the images to perform scene recognition by using a competition mechanism, and considers the results of local features and overall features.
  • the accuracy of recognition in scene recognition without user perception reduces misjudgment.
  • the first feature extraction layer is configured to extract the feature of the image at one azimuth angle to obtain a plurality of feature vectors
  • the first layer model is used to calculate the first weight corresponding to each feature vector in the plurality of feature vectors according to the azimuth angle corresponding to the image of the one azimuth angle;
  • the first layer model is configured to obtain the first recognition result according to each feature vector and the first weight corresponding to each feature vector.
  • the second layer model is used to calculate the azimuth of the first identification result according to the azimuth angle corresponding to the first identification result second weight; the second layer model is configured to obtain the scene recognition result according to the first recognition result and the second weight of the first recognition result.
  • the second layer model presets a third weight corresponding to each of the first recognition results, and the second The layer model is used to obtain the scene recognition result according to the first recognition result and the third weight corresponding to the first recognition result.
  • the second layer model is configured to determine each of the first recognition results according to the azimuth angle and a preset rule A corresponding fourth weight; wherein, the preset rule is a weight group set according to the azimuth angle, different azimuth angles correspond to different weight groups, and each weight group includes a fourth weight corresponding to each first identification result;
  • the second-layer model is configured to obtain the scene recognition result according to the first recognition result and the fourth weight corresponding to the first recognition result.
  • the device further includes: an azimuth angle acquisition module, configured to acquire the coordinate axis of the three-dimensional rectangular coordinate system corresponding to the gravity sensor when each camera collects the image to obtain the direction vector when each camera collects the image; wherein, the corresponding three-dimensional Cartesian coordinate system when each camera collects the image takes each camera as the origin, and the z direction is the direction along the camera.
  • x and y are directions perpendicular to the z direction respectively, and the planes where x and y are located are perpendicular to the z direction; the azimuth angle is calculated according to the direction vector and the gravity unit vector.
  • the apparatus further includes: an image preprocessing module, configured to preprocess the image; wherein the preprocessing Includes one or a combination of the following processes: Convert Image Format, Convert Image Channels, Unify Image Size, Image Normalization, Convert Image Format refers to converting a color image to black and white, Convert Image Channel refers to Converting to red, green, and blue RGB channels, unified image size refers to adjusting the length and width of multiple images to be the same, and image normalization refers to normalizing the pixel values of the images.
  • an image preprocessing module configured to preprocess the image; wherein the preprocessing Includes one or a combination of the following processes: Convert Image Format, Convert Image Channels, Unify Image Size, Image Normalization, Convert Image Format refers to converting a color image to black and white, Convert Image Channel refers to Converting to red, green, and blue RGB channels, unified image size refers to adjusting the length and width of multiple images to be the same, and image normalization refers to normalizing the
  • embodiments of the present application provide a terminal device, where the terminal device can execute the above first aspect or one or more of the scene recognition methods in multiple possible implementation manners of the first aspect.
  • embodiments of the present application provide a computer program product, comprising computer-readable codes, or a non-volatile computer-readable storage medium carrying computer-readable codes, when the computer-readable codes are stored in an electronic
  • the processor in the electronic device executes the first aspect or one or more of the scene recognition methods in the multiple possible implementation manners of the first aspect.
  • embodiments of the present application provide a non-volatile computer-readable storage medium on which computer program instructions are stored, characterized in that, when the computer program instructions are executed by a processor, the above-mentioned first aspect is implemented Or one or more scene recognition methods in multiple possible implementation manners of the first aspect.
  • FIG. 1a and 1b respectively show schematic diagrams of application scenarios according to an embodiment of the present application.
  • Fig. 2a shows a schematic diagram of scene recognition performed by a neural network model according to an embodiment of the present application.
  • FIG. 2b shows a flowchart of a scene recognition method according to an embodiment of the present application.
  • FIG. 3 shows a schematic diagram of an application scenario according to an embodiment of the present application.
  • FIG. 4a and FIG. 4b respectively show schematic diagrams of an azimuth angle determination manner according to an embodiment of the present application.
  • FIG. 5 is a block diagram showing the structure of a neural network model according to an embodiment of the present application.
  • FIG. 6 shows a schematic structural diagram of a first layer model according to an embodiment of the present application.
  • FIG. 7 shows a flowchart of a scene recognition method according to an embodiment of the present application.
  • FIG. 8 shows a flowchart of the method of step S701 according to an embodiment of the present application.
  • FIG. 9 shows a block diagram of a scene recognition apparatus according to an embodiment of the present application.
  • FIG. 10 shows a schematic structural diagram of a terminal device according to an embodiment of the present application.
  • Azimuth the angle between the direction vector of the camera and the unit vector of gravity.
  • the direction vector of the camera establish a three-dimensional rectangular coordinate system with the camera as the origin, the z direction is the direction along the camera, x and y are the directions perpendicular to the z direction, and the plane where x and y are located is perpendicular to the z direction, gravity
  • the vector composed of the acceleration of the sensor in the three directions of x, y and z is the direction vector of the camera.
  • the existing image-based scene recognition methods can be applied in the following scenarios: 1.
  • the image is only collected by a single camera for recognition, the viewing angle range is small, the observed effective features are few, and the scene recognition recall rate is low; Therefore, similar objects or features are prone to misjudgment.
  • the feature of the ceiling appears in the image, it is recognized as indoor, but it may actually be in the subway or plane; three , There is a large amount of object feature information in the same image, and the noise information may drown out the main features, resulting in misjudgment of the main features.
  • the present application provides a scene recognition method, which uses multiple cameras to shoot images of multiple azimuth angles of the same scene, and obtains the azimuth angles when the multiple cameras shoot (collect) images, and each camera shoots ( The azimuth angle when collecting) images is the azimuth angle corresponding to the image, and the scene is recognized by combining multiple images and the azimuth angle corresponding to each image. Since more comprehensive scene information is obtained, the image-based scene recognition can be improved. The accuracy solves the problem of limited viewing angle range and shooting angle of single camera recognition scene, and the recognition is more accurate.
  • the multiple cameras of the embodiments of the present application may be set on the terminal device.
  • the multiple cameras may be the front camera and the rear camera set on the mobile phone, and the multiple cameras may be set Multiple cameras at different orientations of the vehicle body, multiple cameras can also be multiple cameras set at different orientations of the drone, and so on. It should be noted that the above application scenarios are only some examples of the present application, and the present application is not limited thereto.
  • the mobile phone can be provided with a front camera and a rear camera. Through the front camera and the rear camera of the mobile phone, images from two angles can be collected, and the azimuth angle of the front camera and the rear camera can also be obtained. , the scene can be more accurately recognized according to the images of the two angles combined with the corresponding azimuth angles.
  • multiple cameras can be set on the body of an autonomous vehicle, and multiple cameras can be set at different positions. Top and so on, the direction of each camera can also be adjusted individually, and a controller can also be set on the self-driving car, and the controller can be connected to multiple cameras.
  • sensors can also be set on the self-driving car, such as GPS, radar, accelerometer, gyroscope, etc., all sensors and cameras are connected to the controller, the controller can collect images of different angles through multiple cameras, and can also obtain the azimuth of each camera, according to the multiple images and each The azimuth corresponding to each image can more accurately identify the scene.
  • FIG. 1 a and FIG. 1 b are only examples of application scenarios provided by the present application, and the present application is not limited thereto.
  • the present application can also be applied to a scene in which a drone collects images for scene recognition.
  • the scene recognition method provided by the embodiments of the present application can be applied to terminal devices.
  • the terminal devices of the present application can be smart phones, netbooks, tablet computers, notebook computers, wearable electronic devices (such as smart bracelets, smart watches, etc. ), TVs, virtual reality devices, speakers, e-ink, and more.
  • FIG. 10 shows a schematic structural diagram of a terminal device according to an embodiment of the present application. Taking the terminal device as a mobile phone as an example, FIG. 10 shows a schematic structural diagram of the mobile phone 200 , for details, please refer to the specific description below.
  • scene recognition is performed based on the front and rear cameras and acceleration sensors of the mobile phone.
  • the front and rear cameras simultaneously collect images of different azimuth angles
  • the acceleration sensor of the mobile phone is used to extract the angle between the current direction of the mobile phone camera and the direction of gravity, and this angle can be used as the azimuth angle of the camera.
  • the scene recognition method provided by this embodiment of the present application may be implemented by using a neural network model (scene recognition model).
  • the neural network model used in this embodiment of the present application may include: multiple pairs of the first feature extraction layer and the first layer model, each pair of the first feature extraction layer and the first layer model is used to The image and the azimuth angle corresponding to the image of one azimuth angle are processed to obtain the first recognition result, and the azimuth angle corresponding to the image of one azimuth angle is the azimuth angle corresponding to the first recognition result.
  • the first feature extraction layer is used to extract the features of the image to obtain the feature vector (feature map).
  • the azimuth angle corresponding to the image of one azimuth angle is obtained, and the first recognition result is obtained.
  • the neural network model may further include a second layer model, and the second layer model is used to obtain the scene recognition result according to the first recognition result and the azimuth angle corresponding to the first recognition result.
  • the example shown in Figure 2a includes two pairs of the first feature extraction layer and the first layer model.
  • the application scenario shown in Figure 2a can be applied to a dual-camera (front and rear dual-camera) scenario.
  • Image 1 and Image 2 can be obtained through different An image acquired by a camera with an angle, for example, image 1 is acquired by the front camera of the mobile phone, and image 2 is acquired by the rear camera of the mobile phone.
  • the logarithm of the first feature extraction layer and the first layer model included in the neural network model can be configured according to the number of camera angles set in a specific application scenario.
  • the first feature extraction layer The logarithm to the first layer model can be greater than or equal to the number of angles.
  • the first feature extraction layer can be implemented by using Convolutional Neural Networks (CNN, Convolutional Neural Networks).
  • CNN Convolutional Neural Networks
  • the features of the input image are extracted through CNN to obtain a feature map.
  • the VGG model Visual Geometry Group Network
  • Inception model MobileNet
  • ResNet ResNet
  • DenseNet DenseNet
  • Transformer and other convolutional neural network models are used as the first feature extraction layer to extract feature maps
  • the convolutional neural network structure can also be customized as the first feature extraction layer.
  • Both the first-layer model and the second-layer model can be implemented based on the attention mechanism, and the second-layer model can also be implemented by presetting weights for each azimuth angle or presetting a weight mapping function according to the azimuth angle. Not limited.
  • the scene recognition method provided by the embodiment of the present application uses a two-layer scene recognition model based on a competition mechanism (attention mechanism): the first layer model can assign a convolution result (a feature obtained by CNN performing a convolution operation on an input image) according to the azimuth angle. vector) different weights, activate the neurons that identify the local features of different scenes, extract the key features of the image at the azimuth angle, perform the first scene classification, and obtain the first identification result; the second layer model calculates the difference in different scenes by calculating The azimuth angle is used to weight the scene classification results, and the weighted summation is used to obtain the classification results of multiple images from different perspectives, and the final scene recognition result is obtained.
  • a competition mechanism an algorithm that uses a two-layer scene recognition model based on a competition mechanism (attention mechanism): the first layer model can assign a convolution result (a feature obtained by CNN performing a convolution operation on an input image) according to the azimuth angle. vector) different weights, activate the neurons that identify the local features of
  • Using a two-layer competition mechanism-based scene recognition model combined with azimuth information can identify key features and filter irrelevant information, effectively reducing the probability of misrecognition.
  • aircraft and high-speed rail cannot be distinguished from an upward perspective (the ceiling features are similar and difficult to distinguish) , and can be distinguished from a side view (a round window of an airplane and a square window of a high-speed rail are easy to distinguish), and the scene recognition method of the embodiment of the present application helps to reduce misjudgments.
  • the azimuth feature extraction (dotted line box) shown in FIG. 2a can be realized by a neural network, that is to say, the neural network model provided in this embodiment of the present application may further include a second feature extraction layer, and the second feature extraction layer can also be implemented through a volume A neural network model implementation.
  • the azimuth angle feature extraction (dotted line frame) shown in FIG. 2a can also be obtained by calculating the azimuth angle according to an existing function, which is not limited in this application.
  • the sensor shown in Figure 2a can be an accelerometer, a gyroscope, etc.
  • the posture of the terminal device can be obtained through the motion data of the terminal device collected by the sensor, and the direction of the camera can be determined according to the posture of the terminal device and the orientation of the camera.
  • the orientation, as well as the direction of gravity, can determine the azimuth of the camera.
  • Fig. 2b shows a flowchart of a scene recognition method according to an embodiment of the present application. The following describes the flowchart of the image processing method of the present application in detail with reference to Figs. 2a and 2b.
  • Multiple cameras with different orientations set on the terminal device collect images in the same scene, and the terminal device obtains images from different viewing angles (azimuth angles) of the same scene.
  • the front camera and the rear camera of the mobile phone are used to capture images in the same scene at the same time, and images from different perspectives of the same scene are obtained.
  • the image captured by the front camera may be an image captured by a single camera, or may be an image obtained by combining images captured by multiple cameras; the image captured by the rear camera may also be an image captured by a single camera, or an image captured by a single camera. It can be a composite image of images captured by multiple cameras. Wherein, in a scenario where images captured by multiple cameras are combined into one image, the azimuth angle of this image is the same as the azimuth angle of a single camera.
  • the images captured by the cameras in the embodiments of the present application may be black and white images, RGB (Red, Green, Blue) color images, or RGB-D (RGB-Depth) depth images (D refers to depth information), or It can be an infrared image, which is not limited in this application.
  • FIG. 3 shows a schematic diagram of an application scenario according to an embodiment of the present application.
  • the user looks at the mobile phone on the subway, and the mobile phone is inclined at a certain angle to the surface of the subway floor. Therefore, the front camera of the mobile phone can capture the picture of the top of the subway, and the The rear camera can capture the picture of the subway floor.
  • the terminal device can collect the azimuth angles of cameras from multiple viewing angles while collecting images.
  • the azimuth angle of the camera may refer to the angle between the direction vector of the camera and the unit vector of gravity.
  • the direction vector of the camera may be acquired by a sensor, such as a gravity sensor, an accelerometer, a gyroscope, etc., which is not limited in this application.
  • the direction vector of the front camera can be obtained by a gravity sensor.
  • FIG. 4a and FIG. 4b respectively show schematic diagrams of an azimuth angle determination manner according to an embodiment of the present application.
  • a three-dimensional rectangular coordinate system can be established with the front camera as the origin, where the z direction is the direction along the front camera and is perpendicular to the plane of the mobile phone, and x and y are respectively parallel to the border of the mobile phone , and the direction perpendicular to the z direction.
  • the azimuth of the front camera is ⁇
  • the azimuth angle of the front camera can be calculated by formula (2):
  • the terminal device may separately preprocess the images collected by each camera, wherein the preprocessing may include one or more processing methods among image format conversion, image size unification, image normalization, and image channel conversion.
  • the image format conversion may refer to converting a color image into a black and white image.
  • Image channel conversion may refer to converting a color image to three RGB channels, and the three channels are Red, Green, and Blue in turn.
  • Unifying the image size may refer to unifying the length and width of the images collected by each camera, for example, the length of the unified image is 800 pixels and the width is 600 pixels.
  • p_R, p_G, and p_B represent the pixel values before normalization, respectively
  • p_R_Normalized, p_G_Normalized, and p_B_Normalized represent the pixel values after normalization, respectively.
  • the collected images may be processed through one or more of the above preprocessing methods.
  • the preprocessing process may include:
  • Step1 Image format conversion, convert the collected color image into a black and white image, and convert the color image into a black and white image.
  • the relevant grayscale formula method or the average method can be used to calculate the value of the pixel after conversion according to the value of the pixel before conversion;
  • Step2 Unify the image size, unify the size of the image size, for example, unify the size of the image to: length 800, width 600;
  • the preprocessing process may include:
  • Step1 Image channel conversion. For color images, it can be uniformly converted into RGB three channels, that is, the channels are Red, Green, and Blue in turn; if the images before preprocessing are all in RGB format, the process of image channel conversion can be omitted.
  • Step2 Unify the image size, unify the size of the image size, for example, unify the size of the image to: length 800, width 600;
  • Step3 Image normalization.
  • the normalization processing method is:
  • the preprocessing manner of the terminal device for the images collected by the cameras at each angle may be the same or different, which is not limited in the present application.
  • the format of the images can be unified, which is beneficial to the subsequent process of feature extraction and scene recognition.
  • Step 2 Perform feature extraction on the azimuth angle collected in step 2 to obtain the azimuth angle feature.
  • one or more of the following processing methods can be performed on the azimuth angle: numerical normalization, discretization, one-bit Efficient encoding, trigonometric transformations, and more. This application provides a variety of different feature extraction methods, the following are a few examples.
  • Example 1 the terminal device can discretize the collected azimuth, and then perform one-hot encoding (one-bit effective encoding). Discretization is to map individuals into a limited space without changing the relative size of the data.
  • the azimuth angle can be discretely divided into [0°, 45°), [45°, 90°), [90°, 135°), [135°, 180°] four intervals, the corresponding code of the interval to which the azimuth is mapped is 1, and the code of other intervals is 0, and the corresponding code of the four intervals is 1.
  • the feature vectors are [1,0,0,0], [0,1,0,0], [0,0,1,0], [0,0,0,1] respectively.
  • the azimuth angle ⁇ is 30°, which is mapped to the interval [0°, 45°], then the corresponding azimuth angle feature is [1, 0, 0, 0].
  • the terminal device may directly perform trigonometric function transformation on the azimuth angle, and the value obtained after the transformation is normalized to the [0,1] interval as the azimuth angle feature.
  • the trigonometric function change can refer to sin ⁇ , cos ⁇ , tan ⁇ , etc.
  • Example 3 discretize the value [0, 1] interval of the normalized trigonometric function into four intervals of [0, 0.25), [0.25, 0.5), [0.5, 0.75), [0.75, 1.0], terminal
  • the device can first perform trigonometric function transformation on the azimuth and normalize it, and determine the azimuth feature according to the interval to which the value of the normalized trigonometric function is mapped.
  • the scene recognition is performed through a neural network model, and the process of scene recognition is described in conjunction with the framework of the neural network model shown in FIG. 2a.
  • the terminal device uses the preprocessed images in step 3 as the input data of multiple first feature extraction layers, respectively uses the azimuth angle features obtained in step 4 as the input data for multiple first-layer models, and each pair of first The image and azimuth angle received by the feature extraction layer and the first layer model are associated, that is, the image captured by the camera with the same azimuth angle and the azimuth angle feature corresponding to the camera are used as a pair of the first feature extraction layer and the first Input data for the layer model.
  • the front camera of the mobile phone captures an image and preprocesses the image to obtain image 1.
  • the azimuth angle of the front camera is ⁇ 1, and the azimuth angle feature of ⁇ 1 is C1.
  • the neural network model includes feature extraction layer 1 and the first layer model 1.
  • the output of the feature extraction layer 1 is the input of the first layer model 1.
  • the neural network model can also include the feature extraction layer 2 and the first layer model 2.
  • the feature extraction layer The output of 2 is the input of the first layer model 2.
  • the terminal device can use image 1 as the input of the feature extraction layer 1 and C1 as the input of the first layer model 1, or the terminal device can also use the image 1 as the input of the feature extraction layer 2 and C1 as the first layer model 2 input of.
  • serial numbers of image 1, image 2, feature extraction layer 1, feature extraction layer 2, first-layer model 1, and first-layer model 2 in this application are not intended to limit the order and correspondence, but are only to distinguish different
  • the numbers set for the modules are not to be construed as limitations on this application.
  • the first feature extraction layer extracts the features of the image to obtain a feature map (feature vector), and inputs the feature map (feature vector) into the first layer model.
  • the first recognition result can be obtained by recognizing (classifying) the scene of the image.
  • the terminal device can also use the azimuth angle feature as the input data of the second-layer model, and the second-layer model can further identify (classify) the scene according to the first recognition result output by the first-layer model and the corresponding azimuth angle feature, and obtain Scene recognition results.
  • FIG. 5 is a block diagram showing the structure of a neural network model according to an embodiment of the present application.
  • FIG. 6 shows a schematic structural diagram of a first layer model according to an embodiment of the present application.
  • the terminal device of the present application performs image acquisition on J angles (azimuth angles), that is, the camera with J angles is used to collect images. Image from J angles.
  • Image 1 is used as the input data of the CNN on the upper side, and the CNN on the upper side performs feature extraction on the input image 1 to obtain a feature vector y i , where i is a positive integer from 1 to n.
  • image 2 is used as the input data of the CNN on the lower side, and the CNN on the lower side performs feature extraction on the input image 2 to obtain a feature vector xi .
  • the terminal device takes the feature vector yi and the azimuth angle feature C1 as the input data of the first layer model on the upper side.
  • the first layer model can calculate the first recognition result Z j according to the feature vector y i and the azimuth angle feature C1, where j is a positive integer from 1 to J, and J represents the angle of the collected image (azimuth angle ), J is equal to 2 in this example, and j is equal to 1 in the example of FIG. 6 .
  • the first layer model can be implemented based on the attention mechanism, and specifically, it can include the activation function (tanh), the softmax function, and the weighted average process shown in FIG. 6 . Among them, tanh is one of the activation functions, and the present application is not limited to this, and other activation functions can also be used, such as: sigmoid, ReLU, LReLU, ELU function, and so on.
  • C j represents the azimuth feature corresponding to the image at this angle
  • y i is the feature vector output by CNN
  • Wi and bi are the weight and bias value of the activation function, respectively
  • [C j , y i ] represents the feature
  • the vector obtained by splicing the vector and the azimuth feature Represents the parameters of the softmax function. According to the mi calculated by the tanh function, it can be determined whether to activate the corresponding neuron, and the corresponding features of the tanh function are extracted as the basis for classification.
  • the softmax function normalizes the mi to obtain the weight of each feature vector. . Therefore, the azimuth feature can affect the weight s i of the calculated eigenvector y i .
  • the scene recognition model provided by the present application can identify key features according to the azimuth angle, filter irrelevant information, reduce noise, improve recognition accuracy, and realize scene recognition without user perception.
  • the weight si calculated according to formula (3) can represent the weight value assigned to different features extracted from the image, and the feature vector and the corresponding weight are weighted and summed to obtain the first recognition result Z 1 of the first layer model.
  • the terminal device uses the feature vector x i (i is a positive integer from 1 to n', and n' and n may be equal or not equal, which is not limited in this application) and the azimuth angle feature C 2 as the first layer of the lower side.
  • the input data of the model can be calculated to obtain the first recognition result Z 2 in the same manner as the first layer model on the upper side.
  • the neural network model of the embodiment of the present application extracts feature maps through CNN, and assigns different weights to different convolution results of images in combination with azimuth angles.
  • the first-layer model identifies key features and filters irrelevant information through a competitive mechanism, which can effectively reduce the probability of misrecognition. .
  • the structure of the second layer model is shown in Figure 5, f 1 and f 2 represent the activation function tanh, wherein the number of tanh functions in the second layer model and the number of angles at which images are collected
  • the number of tanh functions can be equal to or greater than the number of angles at which images are acquired.
  • the input of the tanh function includes the output result Z j of the first layer model and the azimuth angle feature C j
  • the second layer model also includes the softmax function
  • the softmax function calculates the weight of the output result Z j of the first layer model according to the calculation result of the tanh function.
  • S j and finally, the second layer model calculates the final scene recognition result Z according to the calculated weight S j and the output result Z j of the first layer model.
  • the specific calculation method is shown in the following formula (4):
  • [C j , Z j ] represents the vector obtained by splicing the first recognition result Z j of image 1 and the corresponding azimuth feature C j , W j and b j represent the tanh function weight and bias value, respectively, Represents the parameters of the softmax function.
  • M j calculated by the tanh function, it can be determined whether to activate the corresponding neuron, and the feature corresponding to the tanh function (the first recognition result) is extracted as the basis for classification.
  • the softmax function normalizes M j to get The weight of each first recognition result. Therefore, the azimuth angle feature can affect the weight S j of the calculated first identification result Z j .
  • the scene recognition model provided by the present application can extract the features of key angles according to the azimuth angle, filter irrelevant angles, improve the recognition accuracy, and realize scene recognition without user perception.
  • the weight S j calculated according to formula (4) can represent the weight value assigned to different first recognition results, and the first recognition result and the corresponding weight are weighted and summed to obtain the scene recognition result Z of the second layer model.
  • W i , bi , W j , b j and are model parameters
  • the parameter values can be obtained by training the neural network model shown in FIG. 5 using sample data
  • the neural network model of the present application can be trained by using the training methods in the related art. Repeat.
  • the second-layer model may also be implemented by presetting weights for each azimuth angle, voting, or presetting a weight mapping function according to the azimuth angle. That is to say, in another embodiment of the present application, the neural network model includes multiple pairs of feature extraction layers and first-layer models, and also includes a second-layer model. It is realized by the way of preset weight mapping function according to azimuth angle.
  • the second-layer model can preset weights for each azimuth angle.
  • the preset corresponding weight is S j
  • the second-layer model is based on each first recognition result Z j .
  • the first recognition result Z j and the corresponding weighted weight S j can be calculated to obtain the scene recognition result
  • the first layer model may also preset a weight mapping function according to the azimuth angle, that is, different azimuth angles correspond to different preset weight groups, and the preset weight groups may include weights corresponding to each first identification result Z j .
  • the terminal device collects images from two angles through the front camera and the rear camera, and recognizes the images from the two angles to obtain two first recognition results Z 1 and Z 2 , and the corresponding weights are S respectively. 1 and S 2
  • the preset weight mapping function according to the azimuth angle can be as follows:
  • the terminal device may output the final recognition result according to the scene recognition result and the preset strategy, wherein the preset strategy may include outputting the final recognition result after filtering the scene recognition result according to the confidence threshold, or combining multiple categories into one category Output the final recognition result, and so on.
  • Example 2 Assuming that the scene recognition result includes 100 categories, the terminal device can combine several categories into a larger category output, for example, combine cars and buses into the car category as the final recognition result output.
  • FIG. 7 shows a flowchart of a scene recognition method according to an embodiment of the present application.
  • the scene recognition method may include the following steps:
  • Step S700 the terminal device collects images in the same scene from multiple azimuth angles through multiple cameras, wherein the azimuth angle when each camera collects the image is the azimuth angle corresponding to the image, and the azimuth angle is the azimuth angle corresponding to the image.
  • Step S701 the terminal device recognizes the same scene according to the image and the azimuth angle corresponding to the image, and obtains a scene recognition result.
  • multiple cameras are used to shoot images of multiple azimuth angles of the same scene, and the scene is recognized by combining the multiple images and the azimuth angles corresponding to each image. Since more comprehensive scene information is obtained, Therefore, the accuracy of image-based scene recognition can be improved, the problems of limited viewing angle range and shooting angle of a single camera to recognize a scene can be solved, and the recognition can be more accurate.
  • the images captured by the cameras in the embodiments of the present application may be black and white images, RGB (Red, Green, Blue) color images, or RGB-D (RGB-Depth) depth images (D refers to depth information), or It can be an infrared image, which is not limited in this application.
  • an image of an azimuth angle may be an image captured by one camera, or may be an image obtained by combining images captured by multiple cameras.
  • a mobile phone may include multiple rear cameras, and the image captured by the rear camera may be an image obtained by combining images captured by the multiple cameras.
  • the method may further include:
  • the terminal device preprocesses the image; wherein, the preprocessing includes one or a combination of the following processes: converting image format, converting image channels, unifying image size, and image normalizing, where the converted image format is It refers to converting a color image into a black and white image. Converting an image channel refers to converting the image to red, green and blue RGB channels. Unifying image size refers to adjusting multiple images to have the same length and width. Image normalization refers to converting the pixels of an image. Value normalization.
  • Each azimuth angle can adopt the same preprocessing method, or can adopt different preprocessing methods according to the images collected at each azimuth angle, which is not limited in this application.
  • the terminal device obtains the acceleration on the coordinate axis of the three-dimensional rectangular coordinate system corresponding to the gravity sensor when each camera collects the image, and can obtain the direction vector when each camera collects the image;
  • the corresponding three-dimensional rectangular coordinate system takes each camera as the origin, the z direction is the direction along the camera, and x and y are the directions perpendicular to the z direction respectively;
  • the azimuth is calculated using the gravity unit vector.
  • the specific process please refer to the content of Part 2 above, which will not be repeated.
  • the terminal device may preset a weight corresponding to each azimuth angle, and weight the image recognition result of each azimuth angle according to the weight of each azimuth angle to obtain the final scene recognition result.
  • the image and the azimuth corresponding to the image can also be input into the trained neural network model to identify the same scene, and the scene identification result can be obtained.
  • the present application does not limit the specific manner of scene recognition.
  • FIG. 8 shows a flowchart of the method of step S701 according to an embodiment of the present application.
  • the terminal device recognizes the same scene according to the image and the azimuth angle corresponding to the image, and obtains a scene recognition result, which may include:
  • Step S7010 the terminal device extracts the azimuth angle feature corresponding to the image from the azimuth angle corresponding to the image;
  • Step S7011 using a scene recognition model to recognize the same scene based on the image and the azimuth feature corresponding to the image, to obtain a scene recognition result, wherein the scene recognition model is a neural network model.
  • step S7010 for the specific process, reference may be made to the introduction in Part 4 above, and details are not repeated here.
  • the image and the azimuth angle feature corresponding to the image can be input into the scene recognition model, and the scene recognition model can be used to identify the image based on the image and the azimuth angle feature corresponding to the image. The same scene, get the scene recognition result.
  • the scene recognition model can be implemented through a variety of different neural network structures.
  • the scene recognition model includes multiple pairs of first feature extraction layers and In the first layer model, each pair of the first feature extraction layer and the first layer model is used to process an image of an azimuth angle and an azimuth angle corresponding to the image of the one azimuth angle to obtain a first recognition result.
  • An example of the first feature extraction layer may be image feature extraction layer 1 and feature extraction layer 2 as shown in FIG. 2a.
  • the logarithm of the first feature extraction layer and the first layer model included in the neural network model can be configured according to the number of camera angles set in specific application scenarios.
  • the first feature extraction layer and the first The logarithm of the layer model can be greater than or equal to the number of angles.
  • the azimuth angle corresponding to the image of the one azimuth angle is the azimuth angle corresponding to the first recognition result
  • the first feature extraction layer is used to extract the feature of the image of the one azimuth angle to obtain the feature vector
  • the The first layer model is used to obtain the first recognition result according to the feature vector and the azimuth angle corresponding to the image of the one azimuth angle.
  • the azimuth feature 1 corresponding to the image 1 extracted from the azimuth corresponding to the image 1 can be used as the input of the first layer model 1, and the azimuth feature extracted from the azimuth corresponding to the image 2.
  • the azimuth feature 2 corresponding to the image 2 can be used as the input of the first layer model 2;
  • the feature extraction layer 1 can extract the feature vector 1 of the image 1 and output it to the first layer model 1, and the feature extraction layer 2 can extract the feature vector of the image 2 2 and output to the first layer model 2.
  • the first layer model 1 can combine the feature vector 1 and the azimuth feature 1 to perform scene recognition to obtain the first recognition result 1
  • the first layer model 2 can combine the feature vector 2 and the azimuth feature 2 to perform scene recognition to obtain the first recognition result 2.
  • the scene recognition model further includes a second layer model, and the first layer model 1 and the first layer model 2 can respectively output the first recognition result 1 and the first recognition result 2 to the second layer model.
  • the azimuth feature 1 corresponding to the image 1 extracted from the azimuth corresponding to the image 1 can be used as the input of the second layer model.
  • the azimuth feature 1 corresponds to the first recognition result 1, and the azimuth feature is extracted from the
  • the azimuth feature 2 corresponding to the image 2 extracted from the azimuth angle can be used as the input of the second layer model, and the azimuth feature 2 corresponds to the first recognition result 2 .
  • the second layer model is used to obtain the scene recognition result according to the first recognition result and the azimuth angle corresponding to the first recognition result.
  • the scene recognition method of the embodiment of the present application adopts a two-layer scene recognition model, collects images from multiple angles and combines the azimuth angles of the images to perform scene recognition by using a competition mechanism, and considers the results of local features and overall features.
  • the accuracy of recognition in scene recognition without user perception reduces misjudgment.
  • the first feature extraction layer is used to extract the features of the image of the one azimuth angle to obtain multiple feature vectors; the first layer model is used to extract the features of the one azimuth angle according to the The azimuth angle corresponding to the image calculates the first weight corresponding to each feature vector in the plurality of feature vectors; the first layer model is used to calculate the first weight corresponding to each feature vector and each feature vector, The first identification result is obtained.
  • the second layer model is used to calculate the second weight of the first recognition result according to the azimuth angle corresponding to the first recognition result; the second layer model is used to calculate the second weight of the first recognition result according to the first recognition result and the first recognition result. A second weight of the recognition result to obtain the scene recognition result.
  • the first layer model may include an activation function and a softmax function, where the activation function may be a tanh function, and the activation function may also use other types of activation functions, not limited to those shown in Figures 5 and 6
  • Sigmoid activation function Relu activation function
  • the number of activation functions in the first layer model can be set according to the number of feature vectors extracted by the feature extraction layer, and can be greater than or equal to the number of extracted feature vectors.
  • the activation function is used to determine whether to activate the corresponding neuron according to the feature vector and the azimuth feature, and the feature corresponding to the activation function is extracted as the basis for classification.
  • the activation function and the softmax function are used to calculate the first weight corresponding to each feature vector, as above. For the si calculated by the formula (3), the specific process can refer to the above, and will not be repeated here.
  • the azimuth feature can affect the weight s i of the calculated feature vector y i . If the calculated si is relatively large, the feature vector has a greater impact on the classification result. If the calculated si is relatively small, then the feature Vectors have less influence on the results of classification. Therefore, the scene recognition model provided by the present application can identify key features according to the azimuth angle, filter irrelevant information, reduce noise, improve recognition accuracy, and realize scene recognition without user perception.
  • the second-layer model may include an activation function and a softmax function, wherein the activation function may be a tanh function, and the activation function may also adopt other types of activation functions, which are not limited to the examples shown in FIG. 5 and FIG. 6 .
  • the number of activation functions in the second-layer model is related to the number of angles at which images are collected, and the number of activation functions may be equal to or greater than the number of angles at which images are collected.
  • the input of the activation function includes the output result Z j of the first layer model and the azimuth angle feature C j
  • the second layer model also includes a softmax function
  • the softmax function calculates the weight of the output result Z j of the first layer model according to the calculation result of the activation function.
  • S j the specific calculation process can refer to formula (4) and the description in the above section 5, and will not be repeated here.
  • the azimuth angle feature can affect the weight S j of the calculated first identification result Z j . If the calculated S j is relatively large, the first identification result has a greater impact on the classification result. If the calculated S j is relatively small , then the first recognition result has less influence on the classification result. Therefore, the scene recognition model provided by the present application can extract the features of key angles according to the azimuth angle, filter irrelevant angles, improve the recognition accuracy, and realize scene recognition without user perception.
  • the second-layer model presets a third weight corresponding to each of the first recognition results, and the second-layer model is configured to use the first recognition result and the The third weight corresponding to the first recognition result is used to obtain the scene recognition result.
  • the second-layer model can preset weights for each azimuth angle.
  • the preset corresponding weight is S j
  • the second-layer model is based on each first recognition result Z j .
  • the first recognition result Z j and the corresponding weighted weight S j can be calculated to obtain the scene recognition result
  • the second layer model is configured to determine a fourth weight corresponding to each of the first recognition results according to the azimuth angle and a preset rule; wherein the preset rule is: According to the weight group set by the azimuth angle, different azimuth angles correspond to different weight groups, and each weight group includes a fourth weight corresponding to each first identification result; the second layer model is used for identifying according to the first identification result. The result and the fourth weight corresponding to the first recognition result are used to obtain the scene recognition result.
  • the first layer model may also preset a weight mapping function according to the azimuth angle, that is, different azimuth angles correspond to different preset weight groups, and the preset weight groups may include each first identification result Z j . corresponding weight.
  • the terminal device collects images from two angles through the front camera and the rear camera, and recognizes the images from the two angles to obtain two first recognition results Z 1 and Z 2 , and the corresponding weights are S respectively. 1 and S 2
  • the preset weight mapping function according to the azimuth angle can be as follows:
  • FIG. 9 shows a block diagram of a scene recognition apparatus according to an embodiment of the present application.
  • the apparatus may include:
  • the image acquisition module is used to collect images in the same scene from multiple azimuth angles through multiple cameras, wherein the azimuth angle when each camera collects the image is the azimuth angle corresponding to the image, and the azimuth angle is the azimuth angle corresponding to the image. the angle between the direction vector and the unit vector of gravity when each camera collects the image;
  • a scene recognition module configured to recognize the same scene according to the image and the azimuth angle corresponding to the image, and obtain a scene recognition result.
  • the scene recognition device of the embodiment of the present application uses multiple cameras to capture images of multiple azimuth angles of the same scene, and recognizes the scene in combination with the multiple images and the azimuth angle corresponding to each image. Since more comprehensive scene information is obtained, Therefore, the accuracy of image-based scene recognition can be improved, the problems of limited viewing angle range and shooting angle of a single camera to recognize a scene can be solved, and the recognition can be more accurate.
  • the scene recognition module includes:
  • an azimuth feature extraction module used for extracting the azimuth feature corresponding to the image from the azimuth angle corresponding to the image
  • a scene recognition model configured to recognize the same scene based on the image and the azimuth angle feature corresponding to the image, and obtain a scene recognition result, wherein the scene recognition model is a neural network model.
  • the scene recognition model includes multiple pairs of the first feature extraction layer and the first layer model, and each pair of the first feature extraction layer and the first layer model is used to compare the image of an azimuth angle and all the The azimuth angle corresponding to the image of one azimuth angle is processed to obtain the first recognition result; wherein, the azimuth angle corresponding to the image of the one azimuth angle is the azimuth angle corresponding to the first recognition result, and the first feature extraction The layer is used to extract the feature of the image of the one azimuth angle to obtain a feature vector, and the first layer model is used to obtain the first identification according to the feature vector and the azimuth angle corresponding to the image of the one azimuth angle. Result; the scene recognition model further includes a second layer model, and the second layer model is used to obtain the scene recognition result according to the first recognition result and the azimuth angle corresponding to the first recognition result.
  • the scene recognition device of the embodiment of the present application adopts a two-layer scene recognition model, collects multi-angle images and combines the azimuth angles of the images to perform scene recognition by using a competition mechanism, and considers the results of local features and overall features.
  • the accuracy of recognition in scene recognition without user perception reduces misjudgment.
  • the first feature extraction layer is used to extract the features of the image of the one azimuth angle to obtain multiple feature vectors; the first layer model is used to extract the features of the one azimuth angle according to the The azimuth angle corresponding to the image calculates the first weight corresponding to each feature vector in the plurality of feature vectors; the first layer model is used to calculate the first weight corresponding to each feature vector and each feature vector, The first identification result is obtained.
  • the second layer model is used to calculate the second weight of the first recognition result according to the azimuth angle corresponding to the first recognition result; the second layer model is used to calculate the second weight of the first recognition result according to the azimuth angle corresponding to the first recognition result;
  • the first recognition result and the second weight of the first recognition result are used to obtain the scene recognition result.
  • the second-layer model presets a third weight corresponding to each of the first recognition results, and the second-layer model is configured to use the first recognition result and the The third weight corresponding to the first recognition result is used to obtain the scene recognition result.
  • the second layer model is configured to determine a fourth weight corresponding to each of the first recognition results according to the azimuth angle and a preset rule; wherein the preset rule is: According to the weight group set by the azimuth angle, different azimuth angles correspond to different weight groups, and each weight group includes a fourth weight corresponding to each first identification result; the second layer model is used for identifying according to the first identification result. The result and the fourth weight corresponding to the first recognition result are used to obtain the scene recognition result.
  • the device further includes: an azimuth angle acquisition module, configured to acquire the acceleration on the coordinate axis of the three-dimensional rectangular coordinate system corresponding to the gravity sensor when each camera collects the image, and obtain each The direction vector when the camera collects the image; wherein, the corresponding three-dimensional rectangular coordinate system when each camera collects the image takes each camera as the origin, the z direction is the direction along the camera, and x and y are respectively The direction perpendicular to the z direction, and the plane where x and y are located is perpendicular to the z direction; the azimuth angle is calculated according to the direction vector and the gravity unit vector.
  • an azimuth angle acquisition module configured to acquire the acceleration on the coordinate axis of the three-dimensional rectangular coordinate system corresponding to the gravity sensor when each camera collects the image, and obtain each The direction vector when the camera collects the image
  • the corresponding three-dimensional rectangular coordinate system when each camera collects the image takes each camera as the origin, the z direction is the direction along the camera, and x and y
  • the apparatus further includes: an image preprocessing module, configured to perform preprocessing on the image; wherein, the preprocessing includes a combination of one or more of the following processes: converting Image format, convert image channel, unify image size, image normalization, convert image format refers to converting color image to black and white image, convert image channel refers to converting image to red, green and blue RGB channel, and unify image size refers to adjusting Multiple images have the same length and the same width, and image normalization refers to normalizing the pixel values of the images.
  • an image preprocessing module configured to perform preprocessing on the image; wherein, the preprocessing includes a combination of one or more of the following processes: converting Image format, convert image channel, unify image size, image normalization, convert image format refers to converting color image to black and white image, convert image channel refers to converting image to red, green and blue RGB channel, and unify image size refers to adjusting Multiple images have the same length and the same width, and image normalization refers to normalizing the
  • the apparatus further includes: a result output module, configured to output the scene recognition result.
  • FIG. 10 shows a schematic structural diagram of a terminal device according to an embodiment of the present application. Taking the terminal device being a mobile phone as an example, FIG. 10 shows a schematic structural diagram of the mobile phone 200.
  • the mobile phone 200 may include a processor 210, an external memory interface 220, an internal memory 221, a USB interface 230, a charging management module 240, a power management module 241, a battery 242, an antenna 1, an antenna 2, a mobile communication module 251, a wireless communication module 252, Audio module 270, speaker 270A, receiver 270B, microphone 270C, headphone jack 270D, sensor module 280, buttons 290, motor 291, indicator 292, camera 293, display screen 294, SIM card interface 295, etc.
  • a processor 210 an external memory interface 220, an internal memory 221, a USB interface 230, a charging management module 240, a power management module 241, a battery 242, an antenna 1, an antenna 2, a mobile communication module 251, a wireless communication module 252, Audio module 270, speaker 270A, receiver 270B, microphone 270C, headphone jack 270D, sensor module 280, buttons 290, motor 291, indicator 292, camera 293, display screen 294, SIM card interface 295, etc.
  • the sensor module 280 may include a gyroscope sensor 280A, an acceleration sensor 280B, a proximity light sensor 280G, a fingerprint sensor 280H, and a touch sensor 280K (of course, the mobile phone 200 may also include other sensors, such as a temperature sensor, a pressure sensor, a distance sensor, and a magnetic sensor. , ambient light sensor, air pressure sensor, bone conduction sensor, etc., not shown in the figure).
  • the structures illustrated in the embodiments of the present application do not constitute a specific limitation on the mobile phone 200 .
  • the mobile phone 200 may include more or less components than shown, or combine some components, or separate some components, or arrange different components.
  • the illustrated components may be implemented in hardware, software, or a combination of software and hardware.
  • the processor 210 may include one or more processing units, for example, the processor 210 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or Neural-network Processing Unit (NPU) Wait. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
  • the controller may be the nerve center and command center of the mobile phone 200 . The controller can generate an operation control signal according to the instruction operation code and timing signal, and complete the control of fetching and executing instructions.
  • a memory may also be provided in the processor 210 for storing instructions and data.
  • the memory in processor 210 is cache memory.
  • the memory may hold instructions or data that have just been used or recycled by the processor 210 . If the processor 210 needs to use the instruction or data again, it can be called directly from the memory. Repeated accesses are avoided, and the waiting time of the processor 210 is reduced, thereby improving the efficiency of the system.
  • the processor 210 can run the scene recognition method provided by the embodiment of the present application, so as to recognize the scene in combination with multiple images and the azimuth angle corresponding to each image, obtain more comprehensive scene information, and improve the accuracy of image-based scene recognition. , solves the problem of limited viewing angle range and shooting angle of single camera recognition scene, and the recognition is more accurate.
  • the processor 210 may include different devices. For example, when a CPU and a GPU are integrated, the CPU and the GPU may cooperate to execute the scene recognition method provided by the embodiments of the present application. For example, some algorithms in the scene recognition method are executed by the CPU, and another part of the algorithms are executed by the GPU. for faster processing efficiency.
  • Display screen 294 is used to display images, videos, and the like.
  • Display screen 294 includes a display panel.
  • the display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode or an active-matrix organic light-emitting diode (active-matrix organic light).
  • LED organic light-emitting diode
  • AMOLED organic light-emitting diode
  • FLED flexible light-emitting diode
  • Miniled MicroLed, Micro-oLed, quantum dot light-emitting diode (quantum dot light emitting diodes, QLED) and so on.
  • cell phone 200 may include 1 or N display screens 294, where N is a positive integer greater than 1.
  • the display screen 294 may be used to display information entered by or provided to the user as well as various graphical user interfaces (GUIs).
  • GUIs graphical user interfaces
  • display 294 may display photos, videos, web pages, or documents, and the like.
  • display 294 may display a graphical user interface.
  • the GUI includes a status bar, a hideable navigation bar, a time and weather widget, and an application icon, such as a browser icon.
  • the status bar includes operator name (eg China Mobile), mobile network (eg 4G), time and remaining battery.
  • the navigation bar includes a back button icon, a home button icon, and a forward button icon.
  • the status bar may further include a Bluetooth icon, a Wi-Fi icon, an external device icon, and the like.
  • the graphical user interface may further include a Dock bar, and the Dock bar may include commonly used application icons and the like.
  • the display screen 294 may be an integrated flexible display screen, or a spliced display screen composed of two rigid screens and a flexible screen located between the two rigid screens.
  • Cameras 293 are used to capture still images or video.
  • the camera 293 may include a photosensitive element such as a lens group and an image sensor, wherein the lens group includes a plurality of lenses (convex or concave) for collecting the light signal reflected by the object to be photographed, and transmitting the collected light signal to the image sensor .
  • the image sensor generates an original image of the object to be photographed according to the light signal.
  • multiple cameras are used to collect images in the same scene from multiple azimuth angles, so that the scene can be identified in combination with the multiple images and the azimuth angle corresponding to each image, and more comprehensive scene information can be obtained, Improve the accuracy of image-based scene recognition, solve the problem of limited viewing angle range and shooting angle of single camera recognition scene, and make the recognition more accurate.
  • Internal memory 221 may be used to store computer executable program code, which includes instructions.
  • the processor 210 executes various functional applications and data processing of the mobile phone 200 by executing the instructions stored in the internal memory 221 .
  • the internal memory 221 may include a storage program area and a storage data area.
  • the storage program area may store operating system, code of application programs (such as camera application, WeChat application, etc.), and the like.
  • the storage data area may store data created during the use of the mobile phone 200 (such as images and videos collected by the camera application) and the like.
  • the internal memory 221 may also store one or more computer programs 1310 corresponding to the scene recognition method provided by the embodiment of the present application.
  • the one or more computer programs 1304 are stored in the aforementioned memory 221 and configured to be executed by the one or more processors 210, and the one or more computer programs 1310 include instructions that may be used to carry out the implementation of the present application
  • the computer program 1310 may include: an image acquisition module for acquiring images under the same scene from multiple azimuths through a plurality of cameras; a scene recognition module for according to the image Recognize the same scene with the azimuth angle corresponding to the image, and obtain the scene recognition result; the azimuth angle acquisition module is used to acquire the acceleration on the coordinate axis of the three-dimensional rectangular coordinate system corresponding to the gravity sensor when each camera collects the image , obtains the direction vector when each camera collects the image, and calculates the azimuth angle according to the direction vector and the gravity unit vector; an image preprocessing module is used to preprocess
  • the internal memory 221 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), and the like.
  • non-volatile memory such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), and the like.
  • the code of the scene recognition method provided by the embodiment of the present application may also be stored in an external memory.
  • the processor 210 may execute the code of the scene recognition method stored in the external memory through the external memory interface 220 .
  • the function of the sensor module 280 is described below.
  • the gyro sensor 280A can be used to determine the movement posture of the mobile phone 200 .
  • the angular velocity of cell phone 200 about three axes ie, x, y, and z axes
  • the gyro sensor 280A can be used to detect the current motion state of the mobile phone 200, such as shaking or still.
  • the gyro sensor 280A can be used to detect a folding or unfolding operation acting on the display screen 294 .
  • the gyroscope sensor 280A may report the detected folding operation or unfolding operation to the processor 210 as an event to determine the folding state or unfolding state of the display screen 294 .
  • the acceleration sensor 280B can detect the magnitude of the acceleration of the mobile phone 200 in various directions (generally three axes). That is, the gyro sensor 280A can be used to detect the current motion state of the mobile phone 200, such as shaking or still. When the display screen in the embodiment of the present application is a foldable screen, the acceleration sensor 280B can be used to detect a folding or unfolding operation acting on the display screen 294 . The acceleration sensor 280B may report the detected folding operation or unfolding operation to the processor 210 as an event to determine the folding state or unfolding state of the display screen 294 .
  • the terminal device can obtain the acceleration on the coordinate axis of the three-dimensional rectangular coordinate system corresponding to each camera when collecting the image through the acceleration sensor 280B, and obtain the direction vector when each camera collects the image , and calculate the azimuth angle according to the direction vector and the gravity unit vector.
  • Proximity light sensor 280G may include, for example, light emitting diodes (LEDs) and light detectors, such as photodiodes.
  • the light emitting diodes may be infrared light emitting diodes.
  • the mobile phone emits infrared light outward through light-emitting diodes.
  • Phones use photodiodes to detect reflected infrared light from nearby objects. When sufficient reflected light is detected, it can be determined that there is an object near the phone. When insufficient reflected light is detected, the phone can determine that there are no objects near the phone.
  • the proximity light sensor 280G can be arranged on the first screen of the foldable display screen 294, and the proximity light sensor 280G can detect the first screen according to the optical path difference of the infrared signal.
  • the gyroscope sensor 280A (or the acceleration sensor 280B) may send the detected motion state information (such as angular velocity) to the processor 210 .
  • the processor 210 determines, based on the motion state information, whether the current state is the hand-held state or the tripod state (for example, when the angular velocity is not 0, it means that the mobile phone 200 is in the hand-held state).
  • the fingerprint sensor 280H is used to collect fingerprints.
  • the mobile phone 200 can use the collected fingerprint characteristics to realize fingerprint unlocking, accessing application locks, taking photos with fingerprints, answering incoming calls with fingerprints, and the like.
  • Touch sensor 280K also called “touch panel”.
  • the touch sensor 280K may be disposed on the display screen 294, and the touch sensor 280K and the display screen 294 form a touch screen, also called a "touch screen”.
  • the touch sensor 280K is used to detect a touch operation on or near it.
  • the touch sensor can pass the detected touch operation to the application processor to determine the type of touch event.
  • Visual output related to touch operations may be provided through display screen 294 .
  • the touch sensor 280K may also be disposed on the surface of the mobile phone 200 , which is different from the location where the display screen 294 is located.
  • the display screen 294 of the mobile phone 200 displays a main interface, and the main interface includes icons of multiple applications (such as a camera application, a WeChat application, etc.).
  • Display screen 294 displays an interface of a camera application, such as a viewfinder interface. Display screen 294 may also be used to display scene recognition results.
  • the wireless communication function of the mobile phone 200 can be realized by the antenna 1, the antenna 2, the mobile communication module 251, the wireless communication module 252, the modulation and demodulation processor, the baseband processor, and the like.
  • Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals.
  • Each antenna in handset 200 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization.
  • the antenna 1 can be multiplexed as a diversity antenna of the wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.
  • the mobile communication module 251 can provide a wireless communication solution including 2G/3G/4G/5G, etc. applied on the mobile phone 200 .
  • the mobile communication module 251 may include at least one filter, switch, power amplifier, low noise amplifier (LNA) and the like.
  • the mobile communication module 251 can receive electromagnetic waves from the antenna 1, filter and amplify the received electromagnetic waves, and transmit them to the modulation and demodulation processor for demodulation.
  • the mobile communication module 251 can also amplify the signal modulated by the modulation and demodulation processor, and then turn it into an electromagnetic wave for radiation through the antenna 1 .
  • at least part of the functional modules of the mobile communication module 251 may be provided in the processor 210 .
  • at least part of the functional modules of the mobile communication module 251 may be provided in the same device as at least part of the modules of the processor 210 .
  • the modem processor may include a modulator and a demodulator.
  • the modulator is used to modulate the low frequency baseband signal to be sent into a medium and high frequency signal.
  • the demodulator is used to demodulate the received electromagnetic wave signal into a low frequency baseband signal. Then the demodulator transmits the demodulated low-frequency baseband signal to the baseband processor for processing.
  • the low frequency baseband signal is processed by the baseband processor and passed to the application processor.
  • the application processor outputs sound signals through audio devices (not limited to the speaker 270A, the receiver 270B, etc.), or displays images or videos through the display screen 294 .
  • the modem processor may be a stand-alone device.
  • the modulation and demodulation processor may be independent of the processor 210, and may be provided in the same device as the mobile communication module 251 or other functional modules.
  • the wireless communication module 252 can provide applications on the mobile phone 200 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) network), bluetooth (BT), global navigation satellite system (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions.
  • WLAN wireless local area networks
  • BT wireless fidelity
  • GNSS global navigation satellite system
  • frequency modulation frequency modulation, FM
  • NFC near field communication technology
  • infrared technology infrared, IR
  • the wireless communication module 252 may be one or more devices integrating at least one communication processing module.
  • the wireless communication module 252 receives electromagnetic waves via the antenna 2 , frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 210 .
  • the wireless communication module 252 can also receive the signal to be sent from the processor 210 , perform frequency modulation on the signal, amplify the signal, and then convert it into an electromagnetic wave for radiation through the antenna 2 .
  • the wireless communication module 252 is configured to transmit data with other terminal devices under the control of the processor 210 .
  • the mobile phone 200 can implement audio functions through an audio module 270, a speaker 270A, a receiver 270B, a microphone 270C, an earphone interface 270D, and an application processor. Such as music playback, recording, etc.
  • the cell phone 200 can receive key 290 input and generate key signal input related to user settings and function control of the cell phone 200 .
  • the mobile phone 200 can use the motor 291 to generate vibration alerts (eg, vibration alerts for incoming calls).
  • the indicator 292 in the mobile phone 200 may be an indicator light, which may be used to indicate a charging state, a change in power, and may also be used to indicate a message, a missed call, a notification, and the like.
  • the SIM card interface 295 in the mobile phone 200 is used to connect the SIM card. The SIM card can be contacted and separated from the mobile phone 200 by inserting into the SIM card interface 295 or pulling out from the SIM card interface 295 .
  • the mobile phone 200 may include more or less components than those shown in FIG. 10 , which are not limited in this embodiment of the present application.
  • the illustrated handset 200 is merely an example, and the handset 200 may have more or fewer components than those shown, two or more components may be combined, or may have different component configurations.
  • the various components shown in the figures may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing and/or application specific integrated circuits.
  • the software system of the terminal device can adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture.
  • the embodiments of the present application take an Android system with a layered architecture as an example to exemplarily describe the software structure of a terminal device.
  • An embodiment of the present application provides a scene recognition apparatus, including: a processor and a memory for storing instructions executable by the processor; wherein the processor is configured to implement the above method when executing the instructions.
  • Embodiments of the present application provide a non-volatile computer-readable storage medium on which computer program instructions are stored, and when the computer program instructions are executed by a processor, implement the above method.
  • Embodiments of the present application provide a computer program product, including computer-readable codes, or a non-volatile computer-readable storage medium carrying computer-readable codes, when the computer-readable codes are stored in a processor of an electronic device When running in the electronic device, the processor in the electronic device executes the above method.
  • a computer-readable storage medium may be a tangible device that can hold and store instructions for use by the instruction execution device.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • Computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read-only memory (Electrically Programmable Read-Only-Memory, EPROM or flash memory), static random access memory (Static Random-Access Memory, SRAM), portable compact disk read-only memory (Compact Disc Read-Only Memory, CD - ROM), Digital Video Disc (DVD), memory sticks, floppy disks, mechanically encoded devices, such as punch cards or raised structures in grooves on which instructions are stored, and any suitable combination of the foregoing .
  • RAM random access memory
  • ROM read only memory
  • EPROM erasable programmable read-only memory
  • EPROM Errically Programmable Read-Only-Memory
  • SRAM static random access memory
  • portable compact disk read-only memory Compact Disc Read-Only Memory
  • CD - ROM Compact Disc Read-Only Memory
  • DVD Digital Video Disc
  • memory sticks floppy disks
  • Computer readable program instructions or code described herein may be downloaded to various computing/processing devices from a computer readable storage medium, or to an external computer or external storage device over a network such as the Internet, a local area network, a wide area network and/or a wireless network.
  • the network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
  • a network adapter card or network interface in each computing/processing device receives computer-readable program instructions from a network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
  • the computer program instructions used to perform the operations of the present application may be assembly instructions, Instruction Set Architecture (ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, state setting data, or in one or more source or object code written in any combination of programming languages, including object-oriented programming languages such as Smalltalk, C++, etc., and conventional procedural programming languages such as the "C" language or similar programming languages.
  • the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement.
  • the remote computer may be connected to the user's computer through any kind of network—including a Local Area Network (LAN) or a Wide Area Network (WAN)—or, may be connected to an external computer (eg, use an internet service provider to connect via the internet).
  • electronic circuits such as programmable logic circuits, Field-Programmable Gate Arrays (FPGA), or Programmable Logic Arrays (Programmable Logic Arrays), are personalized by utilizing state information of computer-readable program instructions.
  • Logic Array, PLA the electronic circuit can execute computer readable program instructions to implement various aspects of the present application.
  • These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer or other programmable data processing apparatus to produce a machine that causes the instructions when executed by the processor of the computer or other programmable data processing apparatus , resulting in means for implementing the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.
  • These computer readable program instructions can also be stored in a computer readable storage medium, these instructions cause a computer, programmable data processing apparatus and/or other equipment to operate in a specific manner, so that the computer readable medium on which the instructions are stored includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.
  • Computer readable program instructions can also be loaded onto a computer, other programmable data processing apparatus, or other equipment to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process , thereby causing instructions executing on a computer, other programmable data processing apparatus, or other device to implement the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more functions for implementing the specified logical function(s) executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in hardware (eg, circuits or ASICs (Application) that perform the corresponding functions or actions. Specific Integrated Circuit, application-specific integrated circuit)), or can be implemented by a combination of hardware and software, such as firmware.

Abstract

An environment identification method and an apparatus. The method comprises: a terminal device captures images in a single environment by means of a plurality of cameras and from a plurality of azimuth angles, wherein the azimuth angle when each camera captures an image is the azimuth angle corresponding to said image, and each azimuth angle is the included angle of a gravity unit vector and a direction vector when each camera captures an image (S700); and the terminal device identifies said single environment according to the images and the azimuth angles corresponding to the images, and obtains an environment identification result (S701). The present method integrates a plurality of images and azimuth angles corresponding to each of the images to perform identification on an environment, and because more comprehensive environment information is obtained, the accuracy of image-based environment identification can be improved, the problem of limitations on an angle of view range and photographing angles when a single camera is identifying an environment is solved, and identification is more accurate and precise.

Description

场景识别方法及装置Scene recognition method and device
本申请要求于2021年02月25日提交中国专利局、申请号为202110215000.3、申请名称为“场景识别方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202110215000.3 and the application name "Scene Recognition Method and Apparatus" filed with the China Patent Office on February 25, 2021, the entire contents of which are incorporated into this application by reference.
技术领域technical field
本申请涉及图像处理技术领域,尤其涉及一种场景识别方法及装置。The present application relates to the technical field of image processing, and in particular, to a scene recognition method and device.
背景技术Background technique
基于手机等移动终端的场景识别技术是一种重要的基础感知能力,能够服务于智慧出行、服务直达、意图决策、耳机智慧降噪、搜索、推荐等多种业务。The scene recognition technology based on mobile terminals such as mobile phones is an important basic perception capability, which can serve a variety of services such as smart travel, direct service, intention decision-making, smart noise reduction of headphones, search, and recommendation.
现有的场景识别技术主要包括:基于图像的场景识别方法、基于传感器(例如,WiFi、蓝牙、位置传感器等)定位的场景识别方法或基于信号指纹比对的场景识别方法。Existing scene recognition technologies mainly include: image-based scene recognition methods, sensor (eg, WiFi, Bluetooth, location sensors, etc.) positioning-based scene recognition methods, or signal fingerprint comparison-based scene recognition methods.
其中,基于图像的场景识别方法受到摄像头视角范围、拍照角度、物体遮挡等因素影响,在复杂场景下的鲁棒性受到极大挑战,难以实现用户无感的实时场景识别。具体地,基于图像场景识别方法通过前置或后置摄像头采集图像进行识别,由于视角范围小、有效特征少,或者,拍摄角度任意,或者,同一图像中存在着大量物体特征信息,噪声信息会淹没主要特征,这些都容易造成误判。举例来说,当图像中出现天花板特征则识别为室内,实际上则可能是地铁或飞机上。因此,本申请要解决的技术问题是,如何提高基于图像的场景识别方法识别准确度。Among them, the image-based scene recognition method is affected by the camera's viewing angle range, camera angle, object occlusion and other factors, the robustness in complex scenes is greatly challenged, and it is difficult to realize real-time scene recognition without user perception. Specifically, based on the image scene recognition method, the front or rear camera is used to collect images for recognition. Due to the small viewing angle range and few effective features, or the shooting angle is arbitrary, or there is a large amount of object feature information in the same image, noise information will be Overwhelmed by key features, these are prone to miscalculation. For example, when a ceiling feature appears in the image, it is recognized as indoors, but it may actually be on a subway or an airplane. Therefore, the technical problem to be solved by this application is how to improve the recognition accuracy of the image-based scene recognition method.
发明内容SUMMARY OF THE INVENTION
有鉴于此,提出了一种场景识别方法及装置,结合多个图像和每个图像对应的方位角对场景进行识别,由于获得了更全面的场景信息,因此,可以提高基于图像的场景识别的准确度,解决了单摄像头识别场景视角范围、拍摄角度受限的问题,识别更精准。In view of this, a scene recognition method and device are proposed, which combine multiple images and the azimuth angle corresponding to each image to recognize the scene. Since more comprehensive scene information is obtained, the image-based scene recognition can be improved. The accuracy solves the problem of limited viewing angle range and shooting angle of single camera recognition scene, and the recognition is more accurate.
第一方面,本申请的实施例提供了一种场景识别方法,所述方法包括:终端设备通过多个摄像头从多个方位角采集同一场景下的图像,其中,每个摄像头采集所述图像时的方位角为所述图像对应的方位角,所述方位角为所述每个摄像头采集所述图像时的方向向量与重力单位向量的夹角;终端设备根据所述图像和所述图像对应的方位角识别所述同一场景,得到场景识别结果。In a first aspect, an embodiment of the present application provides a scene recognition method, the method includes: a terminal device collects images in the same scene from multiple azimuths through multiple cameras, wherein when each camera collects the image The azimuth angle is the azimuth angle corresponding to the image, and the azimuth angle is the angle between the direction vector and the gravity unit vector when each camera collects the image; The same scene is identified by the azimuth, and the scene identification result is obtained.
本申请实施例的场景识别方法,利用多个摄像头拍摄同一场景的多个方位角的图像,结合多个图像和每个图像对应的方位角对场景进行识别,由于获得了更全面的场景信息,因此,可以提高基于图像的场景识别的准确度,解决了单摄像头识别场景视角范围、拍摄角度受限的问题,识别更精准。In the scene recognition method of the embodiment of the present application, multiple cameras are used to shoot images of multiple azimuth angles of the same scene, and the scene is recognized by combining the multiple images and the azimuth angles corresponding to each image. Since more comprehensive scene information is obtained, Therefore, the accuracy of image-based scene recognition can be improved, the problems of limited viewing angle range and shooting angle of a single camera to recognize a scene can be solved, and the recognition can be more accurate.
根据第一方面,在第一种可能的实现方式中,终端设备根据所述图像和所述图像对应的方位角识别所述同一场景,得到场景识别结果,包括:终端设备从所述图像对应的方位角提取所述图像对应的方位角特征;利用场景识别模型基于所述图像和所述图像对应的方位角特 征识别所述同一场景,得到场景识别结果,其中,所述场景识别模型为神经网络模型。According to the first aspect, in a first possible implementation manner, the terminal device recognizes the same scene according to the image and the azimuth angle corresponding to the image, and obtains a scene recognition result, including: The azimuth angle is extracted from the azimuth angle feature corresponding to the image; the scene recognition model is used to identify the same scene based on the image and the azimuth angle feature corresponding to the image, and a scene recognition result is obtained, wherein the scene recognition model is a neural network. Model.
通过为神经网络模型中的卷积核赋予不同的权重(方位角特征),提取不同方位角下与场景最相关的特征,可以获得更准确的预测结果。By assigning different weights (azimuth features) to the convolution kernels in the neural network model, and extracting the features most relevant to the scene at different azimuths, more accurate prediction results can be obtained.
根据第一方面的第一种可能的实现方式,在第二种可能的实现方式中,所述场景识别模型包括多对第一特征提取层和第一层模型,每对第一特征提取层和第一层模型用于对一个方位角的图像和所述一个方位角的图像对应的方位角进行处理,得到第一识别结果;其中,所述一个方位角的图像对应的方位角为所述第一识别结果对应的方位角,所述第一特征提取层用于提取所述一个方位角的图像的特征,得到特征向量,所述第一层模型用于根据所述特征向量和所述一个方位角的图像对应的方位角,得到所述第一识别结果;所述场景识别模型还包括第二层模型,所述第二层模型用于根据所述第一识别结果和第一识别结果对应的方位角,得到所述场景识别结果。According to a first possible implementation manner of the first aspect, in a second possible implementation manner, the scene recognition model includes multiple pairs of the first feature extraction layer and the first layer model, and each pair of the first feature extraction layer and The first layer model is used to process an image of an azimuth angle and an azimuth angle corresponding to the image of the one azimuth angle to obtain a first recognition result; wherein, the azimuth angle corresponding to the image of the one azimuth angle is the first recognition result. an azimuth angle corresponding to the recognition result, the first feature extraction layer is used to extract the feature of the image of the one azimuth angle to obtain a feature vector, and the first layer model is used to extract the feature vector according to the feature vector and the one azimuth The azimuth angle corresponding to the image of the angle is obtained, and the first recognition result is obtained; the scene recognition model further includes a second layer model, and the second layer model is used for the first recognition result and the corresponding first recognition result. The azimuth angle is used to obtain the scene recognition result.
本申请实施例的场景识别方法,采用两层场景识别模型,采集多角度的图像并结合图像的方位角,利用竞争机制进行场景识别的方式,同时考虑了局部特征和整体特征的结果,可以提高用户无感的场景识别中识别的准确度,减少误判。The scene recognition method of the embodiment of the present application adopts a two-layer scene recognition model, collects images from multiple angles and combines the azimuth angles of the images to perform scene recognition by using a competition mechanism, and considers the results of local features and overall features. The accuracy of recognition in scene recognition without user perception reduces misjudgment.
根据第一方面的第二种可能的实现方式,在第三种可能的实现方式中,所述第一特征提取层用于提取所述一个方位角的图像的特征,得到多个特征向量;所述第一层模型用于根据所述一个方位角的图像对应的方位角计算所述多个特征向量中的每个特征向量对应的第一权重;所述第一层模型用于根据所述每个特征向量和每个特征向量对应的第一权重,得到所述第一识别结果。According to the second possible implementation manner of the first aspect, in a third possible implementation manner, the first feature extraction layer is configured to extract the feature of the image at one azimuth angle to obtain a plurality of feature vectors; The first layer model is used to calculate the first weight corresponding to each feature vector in the plurality of feature vectors according to the azimuth angle corresponding to the image of the one azimuth angle; the first layer model is used to calculate the first weight corresponding to each feature vector according to the eigenvectors and a first weight corresponding to each eigenvector to obtain the first recognition result.
根据第一方面的第二种可能的实现方式,在第四种可能的实现方式中,所述第二层模型用于根据所述第一识别结果对应的方位角计算所述第一识别结果的第二权重;所述第二层模型用于根据所述第一识别结果和所述第一识别结果的第二权重,得到所述场景识别结果。According to a second possible implementation manner of the first aspect, in a fourth possible implementation manner, the second layer model is used to calculate the azimuth angle of the first identification result according to the azimuth angle corresponding to the first identification result second weight; the second layer model is configured to obtain the scene recognition result according to the first recognition result and the second weight of the first recognition result.
根据第一方面的第二种可能的实现方式,在第五种可能的实现方式中,所述第二层模型预设了每个所述第一识别结果对应的第三权重,所述第二层模型用于根据所述第一识别结果和所述第一识别结果对应的第三权重,得到所述场景识别结果。According to a second possible implementation manner of the first aspect, in a fifth possible implementation manner, the second layer model presets a third weight corresponding to each of the first recognition results, and the second The layer model is used to obtain the scene recognition result according to the first recognition result and the third weight corresponding to the first recognition result.
根据第一方面的第二种可能的实现方式,在第六种可能的实现方式中,所述第二层模型用于根据所述方位角和预设规则,确定每个所述第一识别结果对应的第四权重;其中,所述预设规则为根据方位角设置的权重组,不同的方位角对应了不同的权重组,每个权重组包括每个第一识别结果对应的第四权重;所述第二层模型用于根据所述第一识别结果和所述第一识别结果对应的第四权重,得到所述场景识别结果。According to a second possible implementation manner of the first aspect, in a sixth possible implementation manner, the second layer model is configured to determine each of the first recognition results according to the azimuth angle and a preset rule A corresponding fourth weight; wherein, the preset rule is a weight group set according to the azimuth angle, different azimuth angles correspond to different weight groups, and each weight group includes a fourth weight corresponding to each first identification result; The second-layer model is configured to obtain the scene recognition result according to the first recognition result and the fourth weight corresponding to the first recognition result.
对于第二层模型通过预设的权重或者权重映射函数的形式的时限方式,在训练时,只需要对神经网络模型的其他部分进行训练即可,不需要对第二层模型进行训练,可以提高训练的效率。For the time-limited method in the form of preset weights or weight mapping functions for the second-layer model, during training, only other parts of the neural network model need to be trained, and the second-layer model does not need to be trained, which can improve the training efficiency.
根据第一方面,在第七种可能的实现方式中,所述方法还包括:终端设备获取重力传感器在每个摄像头采集所述图像时对应的三维直角坐标系的坐标轴上的加速度,得到每个摄像头采集所述图像时的方向向量;其中,每个摄像头采集所述图像时对应的三维直角坐标系以所述每个摄像头为原点,z方向为沿着摄像头拍摄的方向,x和y分别为与z方向垂直的方向, 且x和y所在的平面与z方向垂直;根据所述方向向量和所述重力单位向量计算所述方位角。According to the first aspect, in a seventh possible implementation manner, the method further includes: the terminal device acquires the acceleration on the coordinate axis of the three-dimensional rectangular coordinate system corresponding to the gravity sensor when each camera captures the image, and obtains each The direction vector when each camera collects the image; wherein, the corresponding three-dimensional rectangular coordinate system when each camera collects the image takes each camera as the origin, the z direction is the direction along the camera, and x and y are respectively is a direction perpendicular to the z direction, and the plane where x and y are located is perpendicular to the z direction; the azimuth angle is calculated according to the direction vector and the gravity unit vector.
根据第一方面的第一种可能的实现方式,在第八种可能的实现方式中,在利用场景识别模型基于所述图像和所述图像对应的方位角特征识别所述同一场景之前,所述方法还包括:终端设备对所述图像进行预处理;其中,所述预处理包括以下处理中的一种或多种的组合:转换图像格式、转换图像通道、统一图像尺寸、图像归一化,转换图像格式是指将彩色图像转换为黑白图像,转换图像通道是指将图像转换到红绿蓝RGB通道,统一图像尺寸是指调整多个图像的长度相同、宽度相同,图像归一化是指将图像的像素值归一化。According to a first possible implementation manner of the first aspect, in an eighth possible implementation manner, before using a scene recognition model to identify the same scene based on the image and the azimuth feature corresponding to the image, the The method further includes: the terminal device preprocesses the image; wherein, the preprocessing includes one or a combination of the following processes: converting an image format, converting an image channel, unifying the image size, and normalizing the image, Converting image format refers to converting color images to black and white images. Converting image channels refers to converting images to red, green, and blue RGB channels. Unifying image size refers to adjusting multiple images to have the same length and width. Image normalization refers to Normalize the pixel values of the image.
第二方面,本申请的实施例提供了一种场景识别装置,所述装置包括:图像采集模块,用于通过多个摄像头从多个方位角采集同一场景下的图像,其中,每个摄像头采集所述图像时的方位角为所述图像对应的方位角,所述方位角为所述每个摄像头采集所述图像时的方向向量与重力单位向量的夹角;识别模块,用于根据所述图像和所述图像对应的方位角识别所述同一场景,得到场景识别结果。In a second aspect, an embodiment of the present application provides a scene recognition apparatus, the apparatus includes: an image acquisition module, configured to collect images in the same scene from multiple azimuth angles through multiple cameras, wherein each camera collects images in the same scene. The azimuth angle of the image is the azimuth angle corresponding to the image, and the azimuth angle is the angle between the direction vector and the gravity unit vector when each camera collects the image; The same scene is recognized by the image and the azimuth angle corresponding to the image, and a scene recognition result is obtained.
本申请实施例的场景识别装置,利用多个摄像头拍摄同一场景的多个方位角的图像,结合多个图像和每个图像对应的方位角对场景进行识别,由于获得了更全面的场景信息,因此,可以提高基于图像的场景识别的准确度,解决了单摄像头识别场景视角范围、拍摄角度受限的问题,识别更精准。The scene recognition device of the embodiment of the present application uses multiple cameras to capture images of multiple azimuth angles of the same scene, and recognizes the scene in combination with the multiple images and the azimuth angle corresponding to each image. Since more comprehensive scene information is obtained, Therefore, the accuracy of image-based scene recognition can be improved, the problems of limited viewing angle range and shooting angle of a single camera to recognize a scene can be solved, and the recognition can be more accurate.
根据第二方面,在第一种可能的实现方式中,所述场景识别模块包括:方位角特征提取模块,用于从所述图像对应的方位角提取所述图像对应的方位角特征;场景识别模型,用于基于所述图像和所述图像对应的方位角特征识别所述同一场景,得到场景识别结果,其中,所述场景识别模型为神经网络模型。According to the second aspect, in a first possible implementation manner, the scene recognition module includes: an azimuth feature extraction module, configured to extract the azimuth feature corresponding to the image from the azimuth angle corresponding to the image; scene recognition a model for recognizing the same scene based on the image and the azimuth feature corresponding to the image to obtain a scene recognition result, wherein the scene recognition model is a neural network model.
根据第二方面的第一种可能的实现方式,在第二种可能的实现方式中,所述场景识别模型包括多对第一特征提取层和第一层模型,每对第一特征提取层和第一层模型用于对一个方位角的图像和所述一个方位角的图像对应的方位角进行处理,得到第一识别结果;其中,所述一个方位角的图像对应的方位角为所述第一识别结果对应的方位角,所述第一特征提取层用于提取所述一个方位角的图像的特征,得到特征向量,所述第一层模型用于根据所述特征向量和所述一个方位角的图像对应的方位角,得到所述第一识别结果;所述场景识别模型还包括第二层模型,所述第二层模型用于根据所述第一识别结果和第一识别结果对应的方位角,得到所述场景识别结果。According to the first possible implementation manner of the second aspect, in the second possible implementation manner, the scene recognition model includes multiple pairs of the first feature extraction layer and the first layer model, and each pair of the first feature extraction layer and The first layer model is used to process an image of an azimuth angle and an azimuth angle corresponding to the image of the one azimuth angle to obtain a first recognition result; wherein, the azimuth angle corresponding to the image of the one azimuth angle is the first recognition result. an azimuth angle corresponding to the recognition result, the first feature extraction layer is used to extract the feature of the image of the one azimuth angle to obtain a feature vector, and the first layer model is used to extract the feature vector according to the feature vector and the one azimuth The azimuth angle corresponding to the image of the angle is obtained, and the first recognition result is obtained; the scene recognition model further includes a second layer model, and the second layer model is used for the first recognition result and the corresponding first recognition result. The azimuth angle is used to obtain the scene recognition result.
本申请实施例的场景识别装置,采用两层场景识别模型,采集多角度的图像并结合图像的方位角,利用竞争机制进行场景识别的方式,同时考虑了局部特征和整体特征的结果,可以提高用户无感的场景识别中识别的准确度,减少误判。The scene recognition device of the embodiment of the present application adopts a two-layer scene recognition model, collects multi-angle images and combines the azimuth angles of the images to perform scene recognition by using a competition mechanism, and considers the results of local features and overall features. The accuracy of recognition in scene recognition without user perception reduces misjudgment.
根据第二方面的第二种可能的实现方式,在第三种可能的实现方式中,所述第一特征提取层用于提取所述一个方位角的图像的特征,得到多个特征向量;According to a second possible implementation manner of the second aspect, in a third possible implementation manner, the first feature extraction layer is configured to extract the feature of the image at one azimuth angle to obtain a plurality of feature vectors;
所述第一层模型用于根据所述一个方位角的图像对应的方位角计算所述多个特征向量中的每个特征向量对应的第一权重;The first layer model is used to calculate the first weight corresponding to each feature vector in the plurality of feature vectors according to the azimuth angle corresponding to the image of the one azimuth angle;
所述第一层模型用于根据所述每个特征向量和每个特征向量对应的第一权重,得到所述第一识别结果。The first layer model is configured to obtain the first recognition result according to each feature vector and the first weight corresponding to each feature vector.
根据第二方面的第二种可能的实现方式,在第四种可能的实现方式中,所述第二层模型用于根据所述第一识别结果对应的方位角计算所述第一识别结果的第二权重;所述第二层模型用于根据所述第一识别结果和所述第一识别结果的第二权重,得到所述场景识别结果。According to a second possible implementation manner of the second aspect, in a fourth possible implementation manner, the second layer model is used to calculate the azimuth of the first identification result according to the azimuth angle corresponding to the first identification result second weight; the second layer model is configured to obtain the scene recognition result according to the first recognition result and the second weight of the first recognition result.
根据第二方面的第二种可能的实现方式,在第五种可能的实现方式中,所述第二层模型预设了每个所述第一识别结果对应的第三权重,所述第二层模型用于根据所述第一识别结果和所述第一识别结果对应的第三权重,得到所述场景识别结果。According to a second possible implementation manner of the second aspect, in a fifth possible implementation manner, the second layer model presets a third weight corresponding to each of the first recognition results, and the second The layer model is used to obtain the scene recognition result according to the first recognition result and the third weight corresponding to the first recognition result.
根据第二方面的第二种可能的实现方式,在第六种可能的实现方式中,所述第二层模型用于根据所述方位角和预设规则,确定每个所述第一识别结果对应的第四权重;其中,所述预设规则为根据方位角设置的权重组,不同的方位角对应了不同的权重组,每个权重组包括每个第一识别结果对应的第四权重;所述第二层模型用于根据所述第一识别结果和所述第一识别结果对应的第四权重,得到所述场景识别结果。According to a second possible implementation manner of the second aspect, in a sixth possible implementation manner, the second layer model is configured to determine each of the first recognition results according to the azimuth angle and a preset rule A corresponding fourth weight; wherein, the preset rule is a weight group set according to the azimuth angle, different azimuth angles correspond to different weight groups, and each weight group includes a fourth weight corresponding to each first identification result; The second-layer model is configured to obtain the scene recognition result according to the first recognition result and the fourth weight corresponding to the first recognition result.
根据第二方面,在第七种可能的实现方式中,所述装置还包括:方位角采集模块,用于获取重力传感器在每个摄像头采集所述图像时对应的三维直角坐标系的坐标轴上的加速度,得到每个摄像头采集所述图像时的方向向量;其中,每个摄像头采集所述图像时对应的三维直角坐标系以所述每个摄像头为原点,z方向为沿着摄像头拍摄的方向,x和y分别为与z方向垂直的方向,且x和y所在的平面与z方向垂直;根据所述方向向量和所述重力单位向量计算所述方位角。According to the second aspect, in a seventh possible implementation manner, the device further includes: an azimuth angle acquisition module, configured to acquire the coordinate axis of the three-dimensional rectangular coordinate system corresponding to the gravity sensor when each camera collects the image to obtain the direction vector when each camera collects the image; wherein, the corresponding three-dimensional Cartesian coordinate system when each camera collects the image takes each camera as the origin, and the z direction is the direction along the camera. , x and y are directions perpendicular to the z direction respectively, and the planes where x and y are located are perpendicular to the z direction; the azimuth angle is calculated according to the direction vector and the gravity unit vector.
根据第二方面的第一种可能的实现方式,在第八种可能的实现方式中,所述装置还包括:图像预处理模块,用于对所述图像进行预处理;其中,所述预处理包括以下处理中的一种或多种的组合:转换图像格式、转换图像通道、统一图像尺寸、图像归一化,转换图像格式是指将彩色图像转换为黑白图像,转换图像通道是指将图像转换到红绿蓝RGB通道,统一图像尺寸是指调整多个图像的长度相同、宽度相同,图像归一化是指将图像的像素值归一化。According to a first possible implementation manner of the second aspect, in an eighth possible implementation manner, the apparatus further includes: an image preprocessing module, configured to preprocess the image; wherein the preprocessing Includes one or a combination of the following processes: Convert Image Format, Convert Image Channels, Unify Image Size, Image Normalization, Convert Image Format refers to converting a color image to black and white, Convert Image Channel refers to Converting to red, green, and blue RGB channels, unified image size refers to adjusting the length and width of multiple images to be the same, and image normalization refers to normalizing the pixel values of the images.
第三方面,本申请的实施例提供了一种终端设备,该终端设备可以执行上述第一方面或者第一方面的多种可能的实现方式中的一种或几种的场景识别方法。In a third aspect, embodiments of the present application provide a terminal device, where the terminal device can execute the above first aspect or one or more of the scene recognition methods in multiple possible implementation manners of the first aspect.
第四方面,本申请的实施例提供了一种计算机程序产品,包括计算机可读代码,或者承载有计算机可读代码的非易失性计算机可读存储介质,当所述计算机可读代码在电子设备中运行时,所述电子设备中的处理器执行上述第一方面或者第一方面的多种可能的实现方式中的一种或几种的场景识别方法。In a fourth aspect, embodiments of the present application provide a computer program product, comprising computer-readable codes, or a non-volatile computer-readable storage medium carrying computer-readable codes, when the computer-readable codes are stored in an electronic When running in the device, the processor in the electronic device executes the first aspect or one or more of the scene recognition methods in the multiple possible implementation manners of the first aspect.
第五方面,本申请的实施例提供了一种非易失性计算机可读存储介质,其上存储有计算机程序指令,其特征在于,所述计算机程序指令被处理器执行时实现上述第一方面或者第一方面的多种可能的实现方式中的一种或几种的场景识别方法。In a fifth aspect, embodiments of the present application provide a non-volatile computer-readable storage medium on which computer program instructions are stored, characterized in that, when the computer program instructions are executed by a processor, the above-mentioned first aspect is implemented Or one or more scene recognition methods in multiple possible implementation manners of the first aspect.
本申请的这些和其他方面在以下(多个)实施例的描述中会更加简明易懂。These and other aspects of the present application will be more clearly understood in the following description of the embodiment(s).
附图说明Description of drawings
包含在说明书中并且构成说明书的一部分的附图与说明书一起示出了本申请的示例性实施例、特征和方面,并且用于解释本申请的原理。The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate exemplary embodiments, features and aspects of the application and together with the description, serve to explain the principles of the application.
图1a和图1b分别示出根据本申请一实施例的应用场景的示意图。1a and 1b respectively show schematic diagrams of application scenarios according to an embodiment of the present application.
图2a示出根据本申请一实施例的神经网络模型进行场景识别的示意图。Fig. 2a shows a schematic diagram of scene recognition performed by a neural network model according to an embodiment of the present application.
图2b示出根据本申请一实施例的场景识别方法的流程图。FIG. 2b shows a flowchart of a scene recognition method according to an embodiment of the present application.
图3示出根据本申请一实施例的应用场景的示意图。FIG. 3 shows a schematic diagram of an application scenario according to an embodiment of the present application.
图4a和图4b分别示出根据本申请一实施例的方位角确定方式的示意图。FIG. 4a and FIG. 4b respectively show schematic diagrams of an azimuth angle determination manner according to an embodiment of the present application.
图5示出根据本申请一实施例的神经网络模型的结构的框图。FIG. 5 is a block diagram showing the structure of a neural network model according to an embodiment of the present application.
图6示出根据本申请一实施例的第一层模型的结构示意图。FIG. 6 shows a schematic structural diagram of a first layer model according to an embodiment of the present application.
图7示出根据本申请一实施例的场景识别方法的流程图。FIG. 7 shows a flowchart of a scene recognition method according to an embodiment of the present application.
图8示出根据本申请一实施例的步骤S701的方法的流程图。FIG. 8 shows a flowchart of the method of step S701 according to an embodiment of the present application.
图9示出根据本申请一实施例的场景识别装置的框图。FIG. 9 shows a block diagram of a scene recognition apparatus according to an embodiment of the present application.
图10示出根据本申请一实施例的终端设备的结构示意图。FIG. 10 shows a schematic structural diagram of a terminal device according to an embodiment of the present application.
具体实施方式Detailed ways
以下将参考附图详细说明本申请的各种示例性实施例、特征和方面。附图中相同的附图标记表示功能相同或相似的元件。尽管在附图中示出了实施例的各种方面,但是除非特别指出,不必按比例绘制附图。Various exemplary embodiments, features and aspects of the present application will be described in detail below with reference to the accompanying drawings. The same reference numbers in the figures denote elements that have the same or similar functions. While various aspects of the embodiments are shown in the drawings, the drawings are not necessarily drawn to scale unless otherwise indicated.
在这里专用的词“示例性”意为“用作例子、实施例或说明性”。这里作为“示例性”所说明的任何实施例不必解释为优于或好于其它实施例。The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration." Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
另外,为了更好的说明本申请,在下文的具体实施方式中给出了众多的具体细节。本领域技术人员应当理解,没有某些具体细节,本申请同样可以实施。在一些实例中,对于本领域技术人员熟知的方法、手段、元件和电路未作详细描述,以便于凸显本申请的主旨。In addition, in order to better illustrate the present application, numerous specific details are given in the following detailed description. It should be understood by those skilled in the art that the present application may be practiced without certain specific details. In some instances, methods, means, components and circuits well known to those skilled in the art have not been described in detail so as not to obscure the subject matter of the present application.
名词解释Glossary
方位角:摄像头的方向向量与重力单位向量的夹角。Azimuth: the angle between the direction vector of the camera and the unit vector of gravity.
摄像头的方向向量:以摄像头为原点建立三维直角坐标系,z方向为沿着摄像头拍摄的方向,x和y分别为与z方向垂直的方向,且x和y所在的平面与z方向垂直,重力传感器在x、y和z三个方向的加速度组成的向量为摄像头的方向向量。The direction vector of the camera: establish a three-dimensional rectangular coordinate system with the camera as the origin, the z direction is the direction along the camera, x and y are the directions perpendicular to the z direction, and the plane where x and y are located is perpendicular to the z direction, gravity The vector composed of the acceleration of the sensor in the three directions of x, y and z is the direction vector of the camera.
重力单位向量:(0,0,1)。Gravity unit vector: (0,0,1).
因此,本申请要解决的技术问题是,如何提高基于图像的场景识别方法的识别准确度。现有的基于图像的场景识别方法可以应用在以下场景中:一、仅通过单摄像头采集图像进行识别,视角范围小,观察到的有效特征少,场景识别召回率低;二、由于是用户无感的场景识别,摄像头拍摄的角度任意,因此,相似的物体或特征容易产生误判,例如,当图像中出现天花板的特征,则识别为室内,实际上则可能是在地铁或飞机上;三、同一图像中存在着大量的物体特征信息,噪声信息可能淹没主要特征,造成主要特征的误判。Therefore, the technical problem to be solved by this application is how to improve the recognition accuracy of the image-based scene recognition method. The existing image-based scene recognition methods can be applied in the following scenarios: 1. The image is only collected by a single camera for recognition, the viewing angle range is small, the observed effective features are few, and the scene recognition recall rate is low; Therefore, similar objects or features are prone to misjudgment. For example, when the feature of the ceiling appears in the image, it is recognized as indoor, but it may actually be in the subway or plane; three , There is a large amount of object feature information in the same image, and the noise information may drown out the main features, resulting in misjudgment of the main features.
为了解决上述技术问题,本申请提供了一种场景识别方法,利用多个摄像头拍摄同一场景的多个方位角的图像,获取多个摄像头拍摄(采集)图像时的方位角,每个摄像头拍摄(采集)图像时的方位角为图像对应的方位角,结合多个图像和每个图像对应的方位角对场景进行识别,由于获得了更全面的场景信息,因此,可以提高基于图像的场景识别的准确度,解决了单摄像头识别场景视角范围、拍摄角度受限的问题,识别更精准。In order to solve the above-mentioned technical problems, the present application provides a scene recognition method, which uses multiple cameras to shoot images of multiple azimuth angles of the same scene, and obtains the azimuth angles when the multiple cameras shoot (collect) images, and each camera shoots ( The azimuth angle when collecting) images is the azimuth angle corresponding to the image, and the scene is recognized by combining multiple images and the azimuth angle corresponding to each image. Since more comprehensive scene information is obtained, the image-based scene recognition can be improved. The accuracy solves the problem of limited viewing angle range and shooting angle of single camera recognition scene, and the recognition is more accurate.
在一种可能的实现方式中,本申请实施例的多个摄像头可以设置在终端设备上,比如说多个摄像头可以是设置在手机上的前置摄像头和后置摄像头,多个摄像头可以是设置在车身的多个不同方位的摄像头,多个摄像头也可以是设置在无人机的多个不同方向的摄像头,等等。需要说明的是以上应用的场景仅仅是本申请的一些示例,本申请不限于此。In a possible implementation manner, the multiple cameras of the embodiments of the present application may be set on the terminal device. For example, the multiple cameras may be the front camera and the rear camera set on the mobile phone, and the multiple cameras may be set Multiple cameras at different orientations of the vehicle body, multiple cameras can also be multiple cameras set at different orientations of the drone, and so on. It should be noted that the above application scenarios are only some examples of the present application, and the present application is not limited thereto.
图1a和图1b分别示出根据本申请一实施例的应用场景的示意图。如图1a所示,手机上可以设置有前置摄像头和后置摄像头,通过手机的前置摄像头和后置摄像头可以采集两个角度的图像,还可以获取前置摄像头和后置摄像头的方位角,根据两个角度的图像结合对应的方位角可以更准确的识别场景。如图1b所示,自动驾驶汽车的车身上可以设置多个摄像头,多个摄像头可以分别设置在不同的位置,比如说如图1b所示,可以设置在车头、车尾、车身两侧、车顶等等,每个摄像头的方向还可以单独调节,自动驾驶汽车上还可以设置有控制器,控制器可以连接多个摄像头,需要说明的是,自动驾驶汽车上还可以设置有其他传感器,比如GPS、雷达、加速度计、陀螺仪等,所有的传感器和摄像头连接到控制器,控制器可以通过多个摄像头采集不同角度的图像,还可以获得每个摄像头的方位角,根据多个图像和每个图像对应的方位角可以更准确的识别场景。1a and 1b respectively show schematic diagrams of application scenarios according to an embodiment of the present application. As shown in Figure 1a, the mobile phone can be provided with a front camera and a rear camera. Through the front camera and the rear camera of the mobile phone, images from two angles can be collected, and the azimuth angle of the front camera and the rear camera can also be obtained. , the scene can be more accurately recognized according to the images of the two angles combined with the corresponding azimuth angles. As shown in Figure 1b, multiple cameras can be set on the body of an autonomous vehicle, and multiple cameras can be set at different positions. Top and so on, the direction of each camera can also be adjusted individually, and a controller can also be set on the self-driving car, and the controller can be connected to multiple cameras. It should be noted that other sensors can also be set on the self-driving car, such as GPS, radar, accelerometer, gyroscope, etc., all sensors and cameras are connected to the controller, the controller can collect images of different angles through multiple cameras, and can also obtain the azimuth of each camera, according to the multiple images and each The azimuth corresponding to each image can more accurately identify the scene.
需要说明的是,图1a和图1b仅仅是本申请提供的应用场景的示例,本申请不限于此。比如说,本申请还可以应用在无人机采集图像进行场景识别的场景中。It should be noted that FIG. 1 a and FIG. 1 b are only examples of application scenarios provided by the present application, and the present application is not limited thereto. For example, the present application can also be applied to a scene in which a drone collects images for scene recognition.
本申请实施例提供的场景识别方法可以应用于终端设备,举例来说,本申请的终端设备可以是智能手机、上网本、平板电脑、笔记本电脑、可穿戴电子设备(如智能手环、智能手表等)、TV、虚拟现实设备、音响、电子墨水,等等。图10示出根据本申请一实施例的终端设备的结构示意图。以终端设备是手机为例,图10示出了手机200的结构示意图,具体可以参见下文中的具体描述。The scene recognition method provided by the embodiments of the present application can be applied to terminal devices. For example, the terminal devices of the present application can be smart phones, netbooks, tablet computers, notebook computers, wearable electronic devices (such as smart bracelets, smart watches, etc. ), TVs, virtual reality devices, speakers, e-ink, and more. FIG. 10 shows a schematic structural diagram of a terminal device according to an embodiment of the present application. Taking the terminal device as a mobile phone as an example, FIG. 10 shows a schematic structural diagram of the mobile phone 200 , for details, please refer to the specific description below.
在本申请的一种可能的实现方式中,基于手机的前后置摄像头和加速度传感器进行场景识别。其中,前后置摄像头同时采集不同方位角的图像,手机的加速度传感器用于提取当前手机摄像头的朝向与重力方向的夹角,此夹角可以作为摄像头的方位角。In a possible implementation manner of the present application, scene recognition is performed based on the front and rear cameras and acceleration sensors of the mobile phone. Among them, the front and rear cameras simultaneously collect images of different azimuth angles, and the acceleration sensor of the mobile phone is used to extract the angle between the current direction of the mobile phone camera and the direction of gravity, and this angle can be used as the azimuth angle of the camera.
在一种可能的实现方式中,本申请实施例提供的场景识别方法可以采用神经网络模型(场景识别模型)实现,图2a示出根据本申请一实施例的神经网络模型进行场景识别的示意图。如图2a所示,本申请实施例采用的神经网络模型可以包括:多对第一特征提取层和第一层模型,每对第一特征提取层和第一层模型用于对一个方位角的图像和一个方位角的图像对应的方位角进行处理,得到第一识别结果,一个方位角的图像对应的方位角为第一识别结果对应的方位角。其中,第一特征提取层用于提取图像的特征得到特征向量(特征图),如图2a所示的特征提取层1和特征提取层2,第一层模型用于根据所述特征向量和所述一个方位角的图像对应的方位角,得到所述第一识别结果。In a possible implementation manner, the scene recognition method provided by this embodiment of the present application may be implemented by using a neural network model (scene recognition model). As shown in FIG. 2a, the neural network model used in this embodiment of the present application may include: multiple pairs of the first feature extraction layer and the first layer model, each pair of the first feature extraction layer and the first layer model is used to The image and the azimuth angle corresponding to the image of one azimuth angle are processed to obtain the first recognition result, and the azimuth angle corresponding to the image of one azimuth angle is the azimuth angle corresponding to the first recognition result. Among them, the first feature extraction layer is used to extract the features of the image to obtain the feature vector (feature map). The azimuth angle corresponding to the image of one azimuth angle is obtained, and the first recognition result is obtained.
如图2a所示,神经网络模型还可以包括第二层模型,第二层模型用于根据第一识别结果和第一识别结果对应的方位角,得到场景识别结果。As shown in FIG. 2a, the neural network model may further include a second layer model, and the second layer model is used to obtain the scene recognition result according to the first recognition result and the azimuth angle corresponding to the first recognition result.
图2a所示的示例包括两对第一特征提取层和第一层模型,图2a所示的应用场景可以应用于双摄(前后双摄)的场景下,图像1和图像2可以为通过不同角度的摄像头采集得到的图像,比如说,图像1为手机的前置摄像头采集得到的,图像2为手机的后置摄像头采集得 到的。The example shown in Figure 2a includes two pairs of the first feature extraction layer and the first layer model. The application scenario shown in Figure 2a can be applied to a dual-camera (front and rear dual-camera) scenario. Image 1 and Image 2 can be obtained through different An image acquired by a camera with an angle, for example, image 1 is acquired by the front camera of the mobile phone, and image 2 is acquired by the rear camera of the mobile phone.
在一种可能的实现方式中,神经网络模型中包括的第一特征提取层和第一层模型的对数可以根据具体应用场景中设置摄像头的角度的数量配置,比如说,第一特征提取层和第一层模型的对数可以大于或者等于角度数量。In a possible implementation manner, the logarithm of the first feature extraction layer and the first layer model included in the neural network model can be configured according to the number of camera angles set in a specific application scenario. For example, the first feature extraction layer The logarithm to the first layer model can be greater than or equal to the number of angles.
在一种可能的实现方式中,第一特征提取层可以采用卷积神经网络(CNN,Convolutional Neural Networks)实现,通过CNN提取输入的图像的特征,得到特征图,比如说,可以采用VGG模型(Visual Geometry Group Network)、Inception模型、MobileNet、ResNet、DenseNet、Transformer等卷积神经网络模型作为第一特征提取层提取特征图,还可以自定义卷积神经网络结构作为第一特征提取层,本申请对此不作限定。第一层模型和第二层模型都可以是基于注意力机制实现,第二层模型还可以通过对各方位角预设加权重或者根据方位角预设权重映射函数的方式实现,本申请对此不作限定。In a possible implementation, the first feature extraction layer can be implemented by using Convolutional Neural Networks (CNN, Convolutional Neural Networks). The features of the input image are extracted through CNN to obtain a feature map. For example, the VGG model ( Visual Geometry Group Network), Inception model, MobileNet, ResNet, DenseNet, Transformer and other convolutional neural network models are used as the first feature extraction layer to extract feature maps, and the convolutional neural network structure can also be customized as the first feature extraction layer. This is not limited. Both the first-layer model and the second-layer model can be implemented based on the attention mechanism, and the second-layer model can also be implemented by presetting weights for each azimuth angle or presetting a weight mapping function according to the azimuth angle. Not limited.
本申请实施例提供的场景识别方法,使用两层基于竞争机制(注意力机制)的场景识别模型:第一层模型可根据方位角赋予卷积结果(CNN对输入图像进行卷积运算得到的特征向量)不同的权重,激活识别不同场景局部特征的神经元,提取图像在该方位角下的关键特征,进行第一次场景分类,得到第一识别结果;第二层模型通过计算不同场景下不同方位角进行场景分类结果的权重,加权求和得到多张不同视角图像的分类结果,得到最终的场景识别结果。使用两层基于竞争机制的场景识别模型并结合方位角信息能够识别关键特征并过滤无关信息,有效降低误识别概率,例如,从朝上的视角无法分辨飞机和高铁(天花板特征相似、难以区分),而从侧面的视角可以分辨(飞机圆形窗户与高铁方形窗户易区分),本申请实施例的场景识别方法有助于减少误判。The scene recognition method provided by the embodiment of the present application uses a two-layer scene recognition model based on a competition mechanism (attention mechanism): the first layer model can assign a convolution result (a feature obtained by CNN performing a convolution operation on an input image) according to the azimuth angle. vector) different weights, activate the neurons that identify the local features of different scenes, extract the key features of the image at the azimuth angle, perform the first scene classification, and obtain the first identification result; the second layer model calculates the difference in different scenes by calculating The azimuth angle is used to weight the scene classification results, and the weighted summation is used to obtain the classification results of multiple images from different perspectives, and the final scene recognition result is obtained. Using a two-layer competition mechanism-based scene recognition model combined with azimuth information can identify key features and filter irrelevant information, effectively reducing the probability of misrecognition. For example, aircraft and high-speed rail cannot be distinguished from an upward perspective (the ceiling features are similar and difficult to distinguish) , and can be distinguished from a side view (a round window of an airplane and a square window of a high-speed rail are easy to distinguish), and the scene recognition method of the embodiment of the present application helps to reduce misjudgments.
因此,采用两层场景识别模型,采集多角度的图像并结合图像的方位角,利用竞争机制进行场景识别的方式,同时考虑了局部特征和整体特征的结果,可以提高用户无感的场景识别中识别的准确度,减少误判。Therefore, using a two-layer scene recognition model, collecting images from multiple angles and combining the azimuth of the images, using a competitive mechanism for scene recognition, taking into account the results of local features and overall features at the same time, can improve the user's senseless scene recognition. Recognition accuracy and reduce misjudgment.
图2a所示的方位角特征提取(虚线框)可以通过神经网络实现,也就是说,本申请实施例提供的神经网络模型还可以包括第二特征提取层,第二特征提取层也可以通过卷积神经网络模型实现。图2a所示的方位角特征提取(虚线框)也可以根据已有的函数对方位角进行计算得到,本申请对此不作限定。The azimuth feature extraction (dotted line box) shown in FIG. 2a can be realized by a neural network, that is to say, the neural network model provided in this embodiment of the present application may further include a second feature extraction layer, and the second feature extraction layer can also be implemented through a volume A neural network model implementation. The azimuth angle feature extraction (dotted line frame) shown in FIG. 2a can also be obtained by calculating the azimuth angle according to an existing function, which is not limited in this application.
图2a所示的传感器可以为加速度计、陀螺仪等,通过传感器采集的终端设备的运动数据,可以得到终端设备的姿态,根据终端设备的姿态以及摄像头的方位可以确定摄像头的方向,根据摄像头的方向以及重力方向可以确定摄像头的方位角。The sensor shown in Figure 2a can be an accelerometer, a gyroscope, etc. The posture of the terminal device can be obtained through the motion data of the terminal device collected by the sensor, and the direction of the camera can be determined according to the posture of the terminal device and the orientation of the camera. The orientation, as well as the direction of gravity, can determine the azimuth of the camera.
图2b示出根据本申请一实施例的场景识别方法的流程图,下面结合图2a和图2b对本申请的图像处理方法的流程进行详细说明。Fig. 2b shows a flowchart of a scene recognition method according to an embodiment of the present application. The following describes the flowchart of the image processing method of the present application in detail with reference to Figs. 2a and 2b.
1、图像采集1. Image acquisition
终端设备上设置的多个不同方位的摄像头采集同一场景下的图像,终端设备获得同一场景的不同视角(方位角)的图像。Multiple cameras with different orientations set on the terminal device collect images in the same scene, and the terminal device obtains images from different viewing angles (azimuth angles) of the same scene.
比如说,同时采用手机的前置摄像头和后置摄像头在同一场景下拍摄图像,获得同一场 景的不同视角的图像。其中,所述前置摄像头拍摄的图像可以是单摄像头拍摄的图像,也可以是多个摄像头拍摄的图像合成后的一张图像;后置摄像头拍摄的图像也可以是单摄像头拍摄的图像,也可以是多个摄像头拍摄的图像合成后的一张图像。其中,多个摄像头拍摄的图像合成为一张图像的场景下,这张图像的方位角与单摄像头的方位角相同。For example, the front camera and the rear camera of the mobile phone are used to capture images in the same scene at the same time, and images from different perspectives of the same scene are obtained. The image captured by the front camera may be an image captured by a single camera, or may be an image obtained by combining images captured by multiple cameras; the image captured by the rear camera may also be an image captured by a single camera, or an image captured by a single camera. It can be a composite image of images captured by multiple cameras. Wherein, in a scenario where images captured by multiple cameras are combined into one image, the azimuth angle of this image is the same as the azimuth angle of a single camera.
本申请实施例中摄像头拍摄的图像可以是黑白图像,也可以是RGB(Red,Green,Blue)彩色图像,也可以是RGB-D(RGB-Depth)深度图像(D是指深度信息),也可以是红外图像,本申请不作限定。The images captured by the cameras in the embodiments of the present application may be black and white images, RGB (Red, Green, Blue) color images, or RGB-D (RGB-Depth) depth images (D refers to depth information), or It can be an infrared image, which is not limited in this application.
图3示出根据本申请一实施例的应用场景的示意图。如图3所示,以乘坐地铁的场景为例,用户在地铁上看手机,手机以与地铁地板表面成一定角度倾斜,因此,手机的前置摄像头可以采集到的地铁顶部的画面,手机的后置摄像头可以采集到的地铁地板的画面。FIG. 3 shows a schematic diagram of an application scenario according to an embodiment of the present application. As shown in Figure 3, taking the scene of taking the subway as an example, the user looks at the mobile phone on the subway, and the mobile phone is inclined at a certain angle to the surface of the subway floor. Therefore, the front camera of the mobile phone can capture the picture of the top of the subway, and the The rear camera can capture the picture of the subway floor.
2、方位角采集2. Azimuth acquisition
终端设备可以在采集图像的同时,采集多个视角的摄像头的方位角。摄像头的方位角可以是指摄像头的方向向量与重力单位向量的夹角。The terminal device can collect the azimuth angles of cameras from multiple viewing angles while collecting images. The azimuth angle of the camera may refer to the angle between the direction vector of the camera and the unit vector of gravity.
在本申请的实施例中,摄像头的方向向量可以通过传感器获取,比如,重力传感器、加速度计、陀螺仪等,本申请对此不作限定。重力单位向量为g gravity=(0,0,1)。因此,本申请的实施例通过采集摄像头的方向向量,根据摄像头的方向向量和g gravity=(0,0,1)即可计算得到摄像头的方位角。 In the embodiment of the present application, the direction vector of the camera may be acquired by a sensor, such as a gravity sensor, an accelerometer, a gyroscope, etc., which is not limited in this application. The gravity unit vector is g gravity = (0,0,1). Therefore, in the embodiment of the present application, the azimuth angle of the camera can be obtained by collecting the direction vector of the camera and calculating the azimuth angle of the camera according to the direction vector of the camera and g gravity =(0,0,1).
以手机为例,其中,前置摄像头的方向向量可以通过重力传感器获得。图4a和图4b分别示出根据本申请一实施例的方位角确定方式的示意图。如图4a所示,可以以前置摄像头为原点建立三维直角坐标系,其中,z方向为沿着前置摄像头拍摄的方向、且垂直于手机平面的方向,x和y分别为与手机的边框平行、且与z方向垂直的方向。根据重力传感器在x、y和z三个方向的加速度可以得到前置摄像头的方向向量g camera=(Acc_x,Acc_y,Acc_z),假设前置摄像头的方位角为θ,那么后置摄像头拍摄的方位角可以为π-θ。因此,根据g camera和重力单位向量g gravity=(0,0,1)可以计算得到前置摄像头的方位角θ,即可得到手机的前置摄像头的方位角和后置摄像头的方位角。 Taking a mobile phone as an example, the direction vector of the front camera can be obtained by a gravity sensor. FIG. 4a and FIG. 4b respectively show schematic diagrams of an azimuth angle determination manner according to an embodiment of the present application. As shown in Figure 4a, a three-dimensional rectangular coordinate system can be established with the front camera as the origin, where the z direction is the direction along the front camera and is perpendicular to the plane of the mobile phone, and x and y are respectively parallel to the border of the mobile phone , and the direction perpendicular to the z direction. According to the acceleration of the gravity sensor in the three directions of x, y and z, the direction vector g camera = (Acc_x, Acc_y, Acc_z) of the front camera can be obtained. Assuming that the azimuth of the front camera is θ, then the azimuth of the rear camera The angle can be π-θ. Therefore, according to the g camera and the gravity unit vector g gravity = (0,0,1), the azimuth angle θ of the front camera can be calculated, and the azimuth angle of the front camera and the rear camera of the mobile phone can be obtained.
具体地,如图4b所示,手机前置摄像头的方向向量g camera与重力单位向量g gravity的夹角θ满足公式(1): Specifically, as shown in Figure 4b, the angle θ between the direction vector g camera of the front camera of the mobile phone and the gravity unit vector g gravity satisfies formula (1):
Figure PCTCN2021140833-appb-000001
Figure PCTCN2021140833-appb-000001
因此,前置摄像头的方位角可以通过公式(2)计算得到:Therefore, the azimuth angle of the front camera can be calculated by formula (2):
Figure PCTCN2021140833-appb-000002
Figure PCTCN2021140833-appb-000002
因此,手机的后置摄像头的方位角为
Figure PCTCN2021140833-appb-000003
So the azimuth of the phone's rear camera is
Figure PCTCN2021140833-appb-000003
3、图像预处理3. Image preprocessing
终端设备可以对每个摄像头采集的图像分别进行预处理,其中,预处理可以包括图像格式转换、图像尺寸统一、图像归一化、图像通道转换中的一种或多种处理方式。The terminal device may separately preprocess the images collected by each camera, wherein the preprocessing may include one or more processing methods among image format conversion, image size unification, image normalization, and image channel conversion.
其中,图像格式转换可以是指将彩色图像转换为黑白图像。The image format conversion may refer to converting a color image into a black and white image.
图像通道转换可以是指将彩色图像转换到RGB三通道,三通道依次为Red、Green、Blue。Image channel conversion may refer to converting a color image to three RGB channels, and the three channels are Red, Green, and Blue in turn.
图像尺寸统一可以是指将每个摄像头采集的图像的长、宽尺寸统一,比如说,统一后图像的长为800像素、宽为600像素。Unifying the image size may refer to unifying the length and width of the images collected by each camera, for example, the length of the unified image is 800 pixels and the width is 600 pixels.
图像归一化的目的是保证提取的特征图(特征向量)的均值都在0附近,因此,对于黑白图像,图像归一化可以是指将黑白图像的像素的值减去均值127.5,然后除以128,也就是说,p_Normalized=(p-127.5)/128,其中,p可以表示黑白图像的像素值,p_Normalized可以表示归一化后的像素值;对于彩色图像,例如以RGB格式表示的图像,可以将像素的值减去均值[103.939,116.779,123.68],即:The purpose of image normalization is to ensure that the mean values of the extracted feature maps (feature vectors) are all around 0. Therefore, for black and white images, image normalization can refer to subtracting the pixel value of the black and white image by the mean value of 127.5, and then dividing the Take 128, that is, p_Normalized=(p-127.5)/128, where p can represent the pixel value of the black and white image, and p_Normalized can represent the normalized pixel value; for color images, such as images represented in RGB format , the value of the pixel can be subtracted from the mean [103.939, 116.779, 123.68], that is:
p_R_Normalized=p_R-103.939;p_R_Normalized=p_R-103.939;
p_G_Normalized=p_G-116.779;p_G_Normalized=p_G-116.779;
p_B_Normalized=p_B-123.68;p_B_Normalized=p_B-123.68;
其中,p_R、p_G和p_B分别表示归一化之前的像素值,p_R_Normalized、p_G_Normalized、p_B_Normalized分别表示归一化之后的像素值。Among them, p_R, p_G, and p_B represent the pixel values before normalization, respectively, and p_R_Normalized, p_G_Normalized, and p_B_Normalized represent the pixel values after normalization, respectively.
对于采集到的图像,可以通过以上预处理方式中的一种或多种处理方式进行处理。比如说,在一种可能的实现方式中,预处理的过程可以包括:The collected images may be processed through one or more of the above preprocessing methods. For example, in one possible implementation, the preprocessing process may include:
Step1:图像格式转换,将采集的彩色图像转换为黑白图像,将彩色图像转为黑白图像可以采用相关的灰度公式法或者平均值法根据转换前像素的值计算转换后像素的值;Step1: Image format conversion, convert the collected color image into a black and white image, and convert the color image into a black and white image. The relevant grayscale formula method or the average method can be used to calculate the value of the pixel after conversion according to the value of the pixel before conversion;
Step2:图像尺寸统一,统一图像尺寸的大小,例如,将图像的尺寸统一到:长800、宽600;Step2: Unify the image size, unify the size of the image size, for example, unify the size of the image to: length 800, width 600;
Step3:图像归一化,对于黑白图像p_Normalized=(p-127.5)/128。Step3: Image normalization, for black and white images p_Normalized=(p-127.5)/128.
在另一种可能的实现方式中,预处理的过程可以包括:In another possible implementation, the preprocessing process may include:
Step1:图像通道转换,对于彩色图像可以统一转换为RGB三通道,即通道依次为Red、Green、Blue;如果预处理之前的图像都为RGB格式,则可以省略图像通道转换的过程。Step1: Image channel conversion. For color images, it can be uniformly converted into RGB three channels, that is, the channels are Red, Green, and Blue in turn; if the images before preprocessing are all in RGB format, the process of image channel conversion can be omitted.
Step2:图像尺寸统一,统一图像尺寸的大小,例如,将图像的尺寸统一到:长800、宽600;Step2: Unify the image size, unify the size of the image size, for example, unify the size of the image to: length 800, width 600;
Step3:图像归一化,对于RGB图像,归一化处理方式为:Step3: Image normalization. For RGB images, the normalization processing method is:
R通道:p_R_Normalized=(p_R-103.939)/1.0;R channel: p_R_Normalized=(p_R-103.939)/1.0;
G通道:p_G_Normalized=(p_G-116.779)/1.0;G channel: p_G_Normalized=(p_G-116.779)/1.0;
B通道:p_B_Normalized=(p_B-123.68)/1.0。B channel: p_B_Normalized=(p_B-123.68)/1.0.
在本申请的实施例中,终端设备对每个角度的摄像头采集到的图像的预处理方式可以相同,也可以不同,本申请对此不作限定。通过对采集到的图像进行预处理可以统一图像的格式,有利于后续特征提取和场景识别的过程。In the embodiment of the present application, the preprocessing manner of the terminal device for the images collected by the cameras at each angle may be the same or different, which is not limited in the present application. By preprocessing the collected images, the format of the images can be unified, which is beneficial to the subsequent process of feature extraction and scene recognition.
4、方位角特征提取4. Azimuth feature extraction
对第2步中采集到的方位角进行特征提取得到方位角特征,提取方位角特征可以对方位角进行以下处理方式中的一种或多种的组合:数值归一化、离散化、一位有效编码、三角函数变换,等等。本申请提供了多种不同的特征提取方式,以下是几个示例。Perform feature extraction on the azimuth angle collected in step 2 to obtain the azimuth angle feature. To extract the azimuth angle feature, one or more of the following processing methods can be performed on the azimuth angle: numerical normalization, discretization, one-bit Efficient encoding, trigonometric transformations, and more. This application provides a variety of different feature extraction methods, the following are a few examples.
示例1,终端设备可以对采集到的方位角离散化,再进行one-hot编码(一位有效编码)。离散化是在不改变数据的相对大小的情况下,将个体映射到有限的空间中,比如说,对于本 申请实施例的方位角,可以将方位角离散划分为[0°,45°),[45°,90°),[90°,135°),[135°,180°]四个区间,方位角映射到的区间对应的编码为1,其他区间编码为0,四个区间对应的特征向量分别为[1,0,0,0],[0,1,0,0],[0,0,1,0],[0,0,0,1]。比如说,方位角θ为30°,映射到区间[0°,45°],那么,对应的方位角特征为[1,0,0,0]。Example 1, the terminal device can discretize the collected azimuth, and then perform one-hot encoding (one-bit effective encoding). Discretization is to map individuals into a limited space without changing the relative size of the data. For example, for the azimuth angle of the embodiment of the present application, the azimuth angle can be discretely divided into [0°, 45°), [45°, 90°), [90°, 135°), [135°, 180°] four intervals, the corresponding code of the interval to which the azimuth is mapped is 1, and the code of other intervals is 0, and the corresponding code of the four intervals is 1. The feature vectors are [1,0,0,0], [0,1,0,0], [0,0,1,0], [0,0,0,1] respectively. For example, the azimuth angle θ is 30°, which is mapped to the interval [0°, 45°], then the corresponding azimuth angle feature is [1, 0, 0, 0].
示例2,终端设备可以直接对方位角进行三角函数变换,变换后得到的值归一化到[0,1]区间作为方位角特征。其中,三角函数变化可以是指sinθ,cosθ,tanθ等。Example 2, the terminal device may directly perform trigonometric function transformation on the azimuth angle, and the value obtained after the transformation is normalized to the [0,1] interval as the azimuth angle feature. Among them, the trigonometric function change can refer to sinθ, cosθ, tanθ, etc.
示例3,将归一化后的三角函数的值[0,1]区间离散化为[0,0.25),[0.25,0.5),[0.5,0.75),[0.75,1.0]四个区间,终端设备可以先对方位角进行三角函数变换并归一化,根据归一化后的三角函数的值映射到的区间确定方位角特征。比如说,方位角θ为30°,sinθ=1/2,映射到区间[0.5,0.75),因此,方位角θ的方位角特征为[0,0,1,0]。Example 3, discretize the value [0, 1] interval of the normalized trigonometric function into four intervals of [0, 0.25), [0.25, 0.5), [0.5, 0.75), [0.75, 1.0], terminal The device can first perform trigonometric function transformation on the azimuth and normalize it, and determine the azimuth feature according to the interval to which the value of the normalized trigonometric function is mapped. For example, the azimuth angle θ is 30°, sinθ=1/2, which maps to the interval [0.5, 0.75), so the azimuth angle feature of the azimuth angle θ is [0, 0, 1, 0].
通过为神经网络模型中的卷积核赋予不同的权重(方位角特征),提取不同方位角下与场景最相关的特征,可以获得更准确的预测结果。By assigning different weights (azimuth features) to the convolution kernels in the neural network model, and extracting the features most relevant to the scene at different azimuths, more accurate prediction results can be obtained.
5、场景识别5. Scene recognition
本申请的实施例种,通过神经网络模型进行场景识别,结合图2a所示的神经网络模型的框架对场景识别的过程进行说明。In the embodiment of the present application, the scene recognition is performed through a neural network model, and the process of scene recognition is described in conjunction with the framework of the neural network model shown in FIG. 2a.
终端设备将第3步预处理后的图像分别作为多个第一特征提取层的输入数据,将第4步得到的方位角特征分别作为多个第一层模型的输入数据,并且每对第一特征提取层和第一层模型接收的图像和方位角是关联的,也就是说,同一方位角的摄像头拍摄的图像和该摄像头对应的方位角特征分别作为一对第一特征提取层和第一层模型的输入数据。The terminal device uses the preprocessed images in step 3 as the input data of multiple first feature extraction layers, respectively uses the azimuth angle features obtained in step 4 as the input data for multiple first-layer models, and each pair of first The image and azimuth angle received by the feature extraction layer and the first layer model are associated, that is, the image captured by the camera with the same azimuth angle and the azimuth angle feature corresponding to the camera are used as a pair of the first feature extraction layer and the first Input data for the layer model.
举例来说,手机的前置摄像头拍摄了图像,并对图像进行预处理后得到图像1,前置摄像头的方位角为θ1,θ1的方位角特征为C1。神经网络模型包括特征提取层1和第一层模型1,特征提取层1的输出为第一层模型1的输入,神经网络模型还可以包括特征提取层2和第一层模型2,特征提取层2的输出为第一层模型2的输入。终端设备可以将图像1作为特征提取层1的输入、将C1作为第一层模型1的输入,或者,终端设备也可以将图像1作为特征提取层2的输入、将C1作为第一层模型2的输入。也就是说,本申请的图像1、图像2、特征提取层1、特征提取层2、第一层模型1以及第一层模型2的序号不作为顺序和对应关系的限定,仅仅是为了区分不同的模块而设置的编号,不解释为对本申请的限定。For example, the front camera of the mobile phone captures an image and preprocesses the image to obtain image 1. The azimuth angle of the front camera is θ1, and the azimuth angle feature of θ1 is C1. The neural network model includes feature extraction layer 1 and the first layer model 1. The output of the feature extraction layer 1 is the input of the first layer model 1. The neural network model can also include the feature extraction layer 2 and the first layer model 2. The feature extraction layer The output of 2 is the input of the first layer model 2. The terminal device can use image 1 as the input of the feature extraction layer 1 and C1 as the input of the first layer model 1, or the terminal device can also use the image 1 as the input of the feature extraction layer 2 and C1 as the first layer model 2 input of. That is to say, the serial numbers of image 1, image 2, feature extraction layer 1, feature extraction layer 2, first-layer model 1, and first-layer model 2 in this application are not intended to limit the order and correspondence, but are only to distinguish different The numbers set for the modules are not to be construed as limitations on this application.
这样,第一特征提取层对图像的特征进行提取得到特征图(特征向量),将特征图(特征向量)输入到第一层模型,第一层模型根据特征图和图像对应的方位角特征,对图像的场景进行识别(分类),可以得到第一识别结果。In this way, the first feature extraction layer extracts the features of the image to obtain a feature map (feature vector), and inputs the feature map (feature vector) into the first layer model. The first recognition result can be obtained by recognizing (classifying) the scene of the image.
终端设备还可以将方位角特征作为第二层模型的输入数据,第二层模型根据第一层模型输出的第一识别结果和对应的方位角特征,可以进一步对场景进行识别(分类),得到场景识别结果。The terminal device can also use the azimuth angle feature as the input data of the second-layer model, and the second-layer model can further identify (classify) the scene according to the first recognition result output by the first-layer model and the corresponding azimuth angle feature, and obtain Scene recognition results.
图5示出根据本申请一实施例的神经网络模型的结构的框图。图6示出根据本申请一实施例的第一层模型的结构示意图。FIG. 5 is a block diagram showing the structure of a neural network model according to an embodiment of the present application. FIG. 6 shows a schematic structural diagram of a first layer model according to an embodiment of the present application.
假设本申请的实施例中,CNN提取的图像的特征向量的个数为n个,本申请的终端设备对J个角度(方位角)进行了图像采集,也就是采用J个角度的摄像头采集了J个角度的图 像。Assuming that in the embodiment of the present application, the number of feature vectors of the image extracted by the CNN is n, the terminal device of the present application performs image acquisition on J angles (azimuth angles), that is, the camera with J angles is used to collect images. Image from J angles.
如图5所示,假设在图5的示例中,J为2,两个角度的摄像头分别采集了图像1和图像2,图像1对应的方位角为θ1,图像2对应的方位角为θ2,方位角θ1的方位角特征为C1,方位角θ2的方位角特征为C2。图像1作为上侧的CNN的输入数据,上侧的CNN对输入的图像1进行特征提取可以得到特征向量y i,i为1到n的正整数。图像2作为下侧的CNN的输入数据,下侧的CNN对输入的图像2进行特征提取可以得到特征向量x iAs shown in Figure 5, assuming that in the example of Figure 5, J is 2, the cameras at two angles have collected image 1 and image 2 respectively, the azimuth angle corresponding to image 1 is θ1, and the azimuth angle corresponding to image 2 is θ2, The azimuth angle characteristic of the azimuth angle θ1 is C1, and the azimuth angle characteristic of the azimuth angle θ2 is C2. Image 1 is used as the input data of the CNN on the upper side, and the CNN on the upper side performs feature extraction on the input image 1 to obtain a feature vector y i , where i is a positive integer from 1 to n. The image 2 is used as the input data of the CNN on the lower side, and the CNN on the lower side performs feature extraction on the input image 2 to obtain a feature vector xi .
终端设备将特征向量y i和方位角特征C1作为上侧的第一层模型的输入数据。如图6所示,第一层模型根据特征向量y i和方位角特征C1,可以计算得到第一识别结果Z j,其中j为1到J的正整数,J表示采集图像的角度(方位角)的数量,在本示例中,J等于2,在图6的示例中,j等于1。第一层模型可以基于注意力机制实现,具体的,可以包括图6所示的激活函数(tanh)、softmax函数以及加权平均过程。其中,tanh是其中一种激活函数,本申请不限于此,还可以采用其他激活函数,如:sigmoid,ReLU,LReLU,ELU函数,等等。 The terminal device takes the feature vector yi and the azimuth angle feature C1 as the input data of the first layer model on the upper side. As shown in Figure 6, the first layer model can calculate the first recognition result Z j according to the feature vector y i and the azimuth angle feature C1, where j is a positive integer from 1 to J, and J represents the angle of the collected image (azimuth angle ), J is equal to 2 in this example, and j is equal to 1 in the example of FIG. 6 . The first layer model can be implemented based on the attention mechanism, and specifically, it can include the activation function (tanh), the softmax function, and the weighted average process shown in FIG. 6 . Among them, tanh is one of the activation functions, and the present application is not limited to this, and other activation functions can also be used, such as: sigmoid, ReLU, LReLU, ELU function, and so on.
具体的计算方式如下公式(3)所示:The specific calculation method is shown in the following formula (3):
Figure PCTCN2021140833-appb-000004
Figure PCTCN2021140833-appb-000004
其中,C j表示这个角度的图像对应的方位角特征,y i为CNN输出的特征向量,W i和b i分别为激活函数的权重和偏置值,[C j,y i]表示将特征向量和方位角特征拼接得到的向量,
Figure PCTCN2021140833-appb-000005
表示softmax函数的参数,根据tanh函数计算得到的m i可以确定是否激活对应的神经元,提取tanh函数对应的特征作为分类的依据,softmax函数对m i进行归一化得到每个特征向量的权重。因此,方位角特征可以影响计算得到的特征向量y i的权重s i,如果计算得到的s i比较大,那么该特征向量对分类的结果影响较大,如果计算得到的s i比较小,那么该特征向量对分类的结果影响较小。因此,本申请提供的场景识别的模型可以根据方位角识别关键特征、过滤无关信息,减少噪声,提高识别准确度,实现用户无感的场景识别。
Among them, C j represents the azimuth feature corresponding to the image at this angle, y i is the feature vector output by CNN, Wi and bi are the weight and bias value of the activation function, respectively, [C j , y i ] represents the feature The vector obtained by splicing the vector and the azimuth feature,
Figure PCTCN2021140833-appb-000005
Represents the parameters of the softmax function. According to the mi calculated by the tanh function, it can be determined whether to activate the corresponding neuron, and the corresponding features of the tanh function are extracted as the basis for classification. The softmax function normalizes the mi to obtain the weight of each feature vector. . Therefore, the azimuth feature can affect the weight s i of the calculated eigenvector y i . If the calculated s i is relatively large, then the eigenvector has a greater impact on the classification result. If the calculated si is relatively small, then This feature vector has less influence on the classification result. Therefore, the scene recognition model provided by the present application can identify key features according to the azimuth angle, filter irrelevant information, reduce noise, improve recognition accuracy, and realize scene recognition without user perception.
根据公式(3)计算得到的权重s i可以表示赋予从图像提取的不同特征的权重值,将特征向量和对应的权重进行加权求和,可以得到第一层模型的第一识别结果Z 1The weight si calculated according to formula (3) can represent the weight value assigned to different features extracted from the image, and the feature vector and the corresponding weight are weighted and summed to obtain the first recognition result Z 1 of the first layer model.
终端设备将特征向量x i(i为1到n’的正整数,n’和n可以相等,也可以不相等,本申请对此不作限定)和方位角特征C 2作为下侧的第一层模型的输入数据,根据与上侧的第一层模型相同的方式,可以计算得到第一识别结果Z 2The terminal device uses the feature vector x i (i is a positive integer from 1 to n', and n' and n may be equal or not equal, which is not limited in this application) and the azimuth angle feature C 2 as the first layer of the lower side. The input data of the model can be calculated to obtain the first recognition result Z 2 in the same manner as the first layer model on the upper side.
本申请实施例的神经网络模型通过CNN提取特征图,结合方位角对图像的不同卷积结果赋予不同权重,第一层模型通过竞争性机制识别关键特征并过滤无关信息,能够有效降低误识别概率。The neural network model of the embodiment of the present application extracts feature maps through CNN, and assigns different weights to different convolution results of images in combination with azimuth angles. The first-layer model identifies key features and filters irrelevant information through a competitive mechanism, which can effectively reduce the probability of misrecognition. .
在一种可能的实现方式中,第二层模型的结构如图5所示,f 1和f 2表示激活函数tanh,其中,第二层模型中tanh函数的个数与采集图像的角度的数量相关,tanh函数的个数可以等于或者大于采集图像的角度的数量。tanh函数的输入包括第一层模型的输出结果Z j和方位角特征C j,第二层模型还包括softmax函数,softmax函数根据tanh函数的计算结果计算第一层 模型的输出结果Z j的权重S j,最后,第二层模型根据计算得到的权重S j和第一层模型的输出结果Z j计算最终的场景识别结果Z。具体的计算方式如下公式(4)所示: In a possible implementation manner, the structure of the second layer model is shown in Figure 5, f 1 and f 2 represent the activation function tanh, wherein the number of tanh functions in the second layer model and the number of angles at which images are collected Relatedly, the number of tanh functions can be equal to or greater than the number of angles at which images are acquired. The input of the tanh function includes the output result Z j of the first layer model and the azimuth angle feature C j , the second layer model also includes the softmax function, and the softmax function calculates the weight of the output result Z j of the first layer model according to the calculation result of the tanh function. S j , and finally, the second layer model calculates the final scene recognition result Z according to the calculated weight S j and the output result Z j of the first layer model. The specific calculation method is shown in the following formula (4):
Figure PCTCN2021140833-appb-000006
Figure PCTCN2021140833-appb-000006
其中,[C j,Z j]表示图像1的第一识别结果Z j和对应的方位角特征C j拼接得到的向量,W j和b j分别表示tanh函数权重和偏置值,
Figure PCTCN2021140833-appb-000007
表示softmax函数的参数,根据tanh函数计算得到的M j可以确定是否激活对应的神经元,提取tanh函数对应的特征(第一识别结果)作为分类的依据,softmax函数对M j进行归一化得到每个第一识别结果的权重。因此,方位角特征可以影响计算得到的第一识别结果Z j的权重S j,如果计算得到的S j比较大,那么该第一识别结果对分类的结果影响较大,如果计算得到的S j比较小,那么该第一识别结果对分类的结果影响较小。因此,本申请提供的场景识别的模型可以根据方位角提取关键角度的特征、过滤无关角度,提高识别准确度,实现用户无感的场景识别。
Among them, [C j , Z j ] represents the vector obtained by splicing the first recognition result Z j of image 1 and the corresponding azimuth feature C j , W j and b j represent the tanh function weight and bias value, respectively,
Figure PCTCN2021140833-appb-000007
Represents the parameters of the softmax function. According to the M j calculated by the tanh function, it can be determined whether to activate the corresponding neuron, and the feature corresponding to the tanh function (the first recognition result) is extracted as the basis for classification. The softmax function normalizes M j to get The weight of each first recognition result. Therefore, the azimuth angle feature can affect the weight S j of the calculated first identification result Z j . If the calculated S j is relatively large, then the first identification result has a greater impact on the classification result. If the calculated S j is relatively small, then the first recognition result has little influence on the classification result. Therefore, the scene recognition model provided by the present application can extract the features of key angles according to the azimuth angle, filter irrelevant angles, improve the recognition accuracy, and realize scene recognition without user perception.
根据公式(4)计算得到的权重S j可以表示赋予不同的第一识别结果的权重值,将第一识别结果和对应的权重进行加权求和,可以得到第二层模型的场景识别结果Z。 The weight S j calculated according to formula (4) can represent the weight value assigned to different first recognition results, and the first recognition result and the corresponding weight are weighted and summed to obtain the scene recognition result Z of the second layer model.
其中,W i、b i
Figure PCTCN2021140833-appb-000008
W j、b j
Figure PCTCN2021140833-appb-000009
均为模型参数,可以采用样本数据对图5所示的神经网络模型进行训练得到参数值,可以采用相关现有技术中的训练方法对本申请的神经网络模型进行训练,本申请对训练的过程不再赘述。
Among them, W i , bi ,
Figure PCTCN2021140833-appb-000008
W j , b j and
Figure PCTCN2021140833-appb-000009
are model parameters, the parameter values can be obtained by training the neural network model shown in FIG. 5 using sample data, and the neural network model of the present application can be trained by using the training methods in the related art. Repeat.
在另一种可能的实现方式中,第二层模型还可以通过对各方位角预设加权重、投票或者根据方位角预设权重映射函数的方式实现。也就是说,在本申请的另一个实施例中,神经网络模型包括多对特征提取层和第一层模型,还包括第二层模型,第二层模型通过对各方位角预设加权重或者根据方位角预设权重映射函数的方式实现。In another possible implementation manner, the second-layer model may also be implemented by presetting weights for each azimuth angle, voting, or presetting a weight mapping function according to the azimuth angle. That is to say, in another embodiment of the present application, the neural network model includes multiple pairs of feature extraction layers and first-layer models, and also includes a second-layer model. It is realized by the way of preset weight mapping function according to azimuth angle.
举例来说,第二层模型可以对各方位角预设加权重,对于第一层模型输出的每个第一识别结果Z j,预设对应的加权重为S j,第二层模型根据每个第一识别结果Z j以及对应的加权重S j计算可以得到场景识别结果
Figure PCTCN2021140833-appb-000010
For example, the second-layer model can preset weights for each azimuth angle. For each first recognition result Z j output by the first-layer model, the preset corresponding weight is S j , and the second-layer model is based on each first recognition result Z j . The first recognition result Z j and the corresponding weighted weight S j can be calculated to obtain the scene recognition result
Figure PCTCN2021140833-appb-000010
第一层模型还可以根据方位角预设了权重映射函数,也就是不同的方位角,对应了不同的预设权重组,预设权重组可以包括每个第一识别结果Z j对应的权重。举例来说,假设终端设备通过前置摄像头、后置摄像头采集了两个角度的图像,对两个角度的图像进行识别得到两个第一识别结果Z 1和Z 2,对应的权重分别为S 1和S 2,在一个示例中,根据方位角预设的权重映射函数可以如下所示: The first layer model may also preset a weight mapping function according to the azimuth angle, that is, different azimuth angles correspond to different preset weight groups, and the preset weight groups may include weights corresponding to each first identification result Z j . For example, it is assumed that the terminal device collects images from two angles through the front camera and the rear camera, and recognizes the images from the two angles to obtain two first recognition results Z 1 and Z 2 , and the corresponding weights are S respectively. 1 and S 2 , in one example, the preset weight mapping function according to the azimuth angle can be as follows:
(1)如果前置摄像头方位角θ属于区间[0°,45°),S 1=1.0,S 2=0.0; (1) If the azimuth angle θ of the front camera belongs to the interval [0°, 45°), S 1 =1.0, S 2 =0.0;
(2)如果前置摄像头方位角θ属于区间[45°,90°),S 1=0.7,S 2=0.3; (2) If the azimuth angle θ of the front camera belongs to the interval [45°, 90°), S 1 =0.7, S 2 =0.3;
(3)如果前置摄像头方位角θ属于区间[90°,135°),S 1=0.3,S 2=0.7; (3) If the azimuth angle θ of the front camera belongs to the interval [90°, 135°), S 1 =0.3, S 2 =0.7;
(4)如果前置摄像头方位角θ属于区间[135°,180°],S 1=0.0,S 2=1.0。 (4) If the azimuth angle θ of the front camera belongs to the interval [135°, 180°], S 1 =0.0, S 2 =1.0.
以上仅仅是第二层模型的实现方式的一些示例,本申请不限于此。The above are just some examples of implementations of the second-layer model, and the present application is not limited thereto.
对于第二层模型通过预设的权重或者权重映射函数的形式的时限方式,在训练时,只需要对神经网络模型的其他部分进行训练即可,不需要对第二层模型进行训练,可以提高训练的效率。For the time-limited method in the form of preset weights or weight mapping functions for the second-layer model, during training, only other parts of the neural network model need to be trained, and the second-layer model does not need to be trained, which can improve the training efficiency.
6、输出结果6. Output results
终端设备可以根据场景识别结果和预设策略输出最终识别结果,其中,预设策略可以包括根据置信度阈值对场景识别结果进行过滤后,输出最终识别结果,或者,将多种类别合并为大类输出最终识别结果,等等。The terminal device may output the final recognition result according to the scene recognition result and the preset strategy, wherein the preset strategy may include outputting the final recognition result after filtering the scene recognition result according to the confidence threshold, or combining multiple categories into one category Output the final recognition result, and so on.
示例1:假设置信度阈值threshold=0.8,当场景识别结果对应类别的置信度大于或等于阈值时,可以将场景识别结果对应的类别预测为最终识别结果。Example 1: Assuming that the reliability threshold is set to threshold=0.8, when the confidence of the category corresponding to the scene recognition result is greater than or equal to the threshold, the category corresponding to the scene recognition result can be predicted as the final recognition result.
示例2:假设场景识别结果包括的类别为100种,终端设备可以将其中若干种类别合并为一种更大的类别输出,例如,将小汽车、公交合并为汽车类别作为最终识别结果输出。Example 2: Assuming that the scene recognition result includes 100 categories, the terminal device can combine several categories into a larger category output, for example, combine cars and buses into the car category as the final recognition result output.
基于本申请上述示例,本申请提供了一种场景识别方法,图7示出根据本申请一实施例的场景识别方法的流程图。如图7所示,所述场景识别方法可以包括以下步骤:Based on the above examples of the present application, the present application provides a scene recognition method, and FIG. 7 shows a flowchart of a scene recognition method according to an embodiment of the present application. As shown in Figure 7, the scene recognition method may include the following steps:
步骤S700,终端设备通过多个摄像头从多个方位角采集同一场景下的图像,其中,每个摄像头采集所述图像时的方位角为所述图像对应的方位角,所述方位角为所述每个摄像头采集所述图像时的方向向量与重力单位向量的夹角;Step S700, the terminal device collects images in the same scene from multiple azimuth angles through multiple cameras, wherein the azimuth angle when each camera collects the image is the azimuth angle corresponding to the image, and the azimuth angle is the azimuth angle corresponding to the image. The angle between the direction vector and the gravity unit vector when each camera captures the image;
步骤S701,终端设备根据所述图像和所述图像对应的方位角识别所述同一场景,得到场景识别结果。Step S701, the terminal device recognizes the same scene according to the image and the azimuth angle corresponding to the image, and obtains a scene recognition result.
本申请实施例的场景识别方法,利用多个摄像头拍摄同一场景的多个方位角的图像,结合多个图像和每个图像对应的方位角对场景进行识别,由于获得了更全面的场景信息,因此,可以提高基于图像的场景识别的准确度,解决了单摄像头识别场景视角范围、拍摄角度受限的问题,识别更精准。In the scene recognition method of the embodiment of the present application, multiple cameras are used to shoot images of multiple azimuth angles of the same scene, and the scene is recognized by combining the multiple images and the azimuth angles corresponding to each image. Since more comprehensive scene information is obtained, Therefore, the accuracy of image-based scene recognition can be improved, the problems of limited viewing angle range and shooting angle of a single camera to recognize a scene can be solved, and the recognition can be more accurate.
本申请实施例中摄像头拍摄的图像可以是黑白图像,也可以是RGB(Red,Green,Blue)彩色图像,也可以是RGB-D(RGB-Depth)深度图像(D是指深度信息),也可以是红外图像,本申请不作限定。The images captured by the cameras in the embodiments of the present application may be black and white images, RGB (Red, Green, Blue) color images, or RGB-D (RGB-Depth) depth images (D refers to depth information), or It can be an infrared image, which is not limited in this application.
本申请实施例中一个方位角的图像可以是通过一个摄像头拍摄的图像,也可以是多个摄像头拍摄的图像合成后的一张图像。比如说,手机可以包括多个后置摄像头,后置摄像头拍摄的图像可以是多个摄像头拍摄的图像合成后的一张图像。In this embodiment of the present application, an image of an azimuth angle may be an image captured by one camera, or may be an image obtained by combining images captured by multiple cameras. For example, a mobile phone may include multiple rear cameras, and the image captured by the rear camera may be an image obtained by combining images captured by the multiple cameras.
在一种可能的实现方式中,在根据所述图像和所述图像对应的方位角识别所述同一场景之前,所述方法还可以包括:In a possible implementation manner, before identifying the same scene according to the image and the azimuth angle corresponding to the image, the method may further include:
终端设备对所述图像进行预处理;其中,所述预处理包括以下处理中的一种或多种的组合:转换图像格式、转换图像通道、统一图像尺寸、图像归一化,转换图像格式是指将彩色图像转换为黑白图像,转换图像通道是指将图像转换到红绿蓝RGB通道,统一图像尺寸是指调整多个图像的长度相同、宽度相同,图像归一化是指将图像的像素值归一化。The terminal device preprocesses the image; wherein, the preprocessing includes one or a combination of the following processes: converting image format, converting image channels, unifying image size, and image normalizing, where the converted image format is It refers to converting a color image into a black and white image. Converting an image channel refers to converting the image to red, green and blue RGB channels. Unifying image size refers to adjusting multiple images to have the same length and width. Image normalization refers to converting the pixels of an image. Value normalization.
具体过程可以参见上文第3部分所示的过程,每个方位角可以采用相同的预处理方式,也可以根据每个方位角采集的图像采用不同的预处理方式,本申请对此不作限定。The specific process can refer to the process shown in Section 3 above. Each azimuth angle can adopt the same preprocessing method, or can adopt different preprocessing methods according to the images collected at each azimuth angle, which is not limited in this application.
本申请实施例中终端设备获取重力传感器在每个摄像头采集所述图像时对应的三维直角 坐标系的坐标轴上的加速度,可以得到每个摄像头采集所述图像时的方向向量;其中,每个摄像头采集所述图像时对应的三维直角坐标系以所述每个摄像头为原点,z方向为沿着摄像头拍摄的方向,x和y分别为与z方向垂直的方向;根据所述方向向量和所述重力单位向量计算所述方位角。具体过程可以参见上文中第2部分的内容,不再赘述。In the embodiment of the present application, the terminal device obtains the acceleration on the coordinate axis of the three-dimensional rectangular coordinate system corresponding to the gravity sensor when each camera collects the image, and can obtain the direction vector when each camera collects the image; When the camera collects the image, the corresponding three-dimensional rectangular coordinate system takes each camera as the origin, the z direction is the direction along the camera, and x and y are the directions perpendicular to the z direction respectively; The azimuth is calculated using the gravity unit vector. For the specific process, please refer to the content of Part 2 above, which will not be repeated.
对于步骤S701,终端设备可以预设每个方位角对应的权重,根据每个方位角的权重对每个方位角的图像的识别结果进行加权得到最终的场景识别结果。或者,也可以将图像和图像对应的方位角输入训练好的神经网络模型中识别同一场景,得到场景识别结果。本申请对具体进行场景识别的方式不作限定。For step S701, the terminal device may preset a weight corresponding to each azimuth angle, and weight the image recognition result of each azimuth angle according to the weight of each azimuth angle to obtain the final scene recognition result. Alternatively, the image and the azimuth corresponding to the image can also be input into the trained neural network model to identify the same scene, and the scene identification result can be obtained. The present application does not limit the specific manner of scene recognition.
图8示出根据本申请一实施例的步骤S701的方法的流程图。如图8所示,在一种可能的实现方式中,步骤S701,终端设备根据所述图像和所述图像对应的方位角识别所述同一场景,得到场景识别结果,可以包括:FIG. 8 shows a flowchart of the method of step S701 according to an embodiment of the present application. As shown in FIG. 8, in a possible implementation manner, in step S701, the terminal device recognizes the same scene according to the image and the azimuth angle corresponding to the image, and obtains a scene recognition result, which may include:
步骤S7010,终端设备从所述图像对应的方位角提取所述图像对应的方位角特征;Step S7010, the terminal device extracts the azimuth angle feature corresponding to the image from the azimuth angle corresponding to the image;
步骤S7011,利用场景识别模型基于所述图像和所述图像对应的方位角特征识别所述同一场景,得到场景识别结果,其中,所述场景识别模型为神经网络模型。Step S7011, using a scene recognition model to recognize the same scene based on the image and the azimuth feature corresponding to the image, to obtain a scene recognition result, wherein the scene recognition model is a neural network model.
对于步骤S7010,具体过程可以参见上文中第4部分的介绍,不再赘述。如图2a所示,在提取方位角特征后,可以将图像和图像对应的方位角特征输入到场景识别模型中,利用场景识别模型基于所述图像和所述图像对应的方位角特征识别所述同一场景,得到场景识别结果。For the step S7010, for the specific process, reference may be made to the introduction in Part 4 above, and details are not repeated here. As shown in Figure 2a, after the azimuth angle feature is extracted, the image and the azimuth angle feature corresponding to the image can be input into the scene recognition model, and the scene recognition model can be used to identify the image based on the image and the azimuth angle feature corresponding to the image. The same scene, get the scene recognition result.
本申请的实施例中,场景识别模型可以通过多种不同的神经网络结构实现,在一种可能的实现方式中,如图2a所示,所述场景识别模型包括多对第一特征提取层和第一层模型,每对第一特征提取层和第一层模型用于对一个方位角的图像和所述一个方位角的图像对应的方位角进行处理,得到第一识别结果。In the embodiment of the present application, the scene recognition model can be implemented through a variety of different neural network structures. In a possible implementation manner, as shown in FIG. 2a, the scene recognition model includes multiple pairs of first feature extraction layers and In the first layer model, each pair of the first feature extraction layer and the first layer model is used to process an image of an azimuth angle and an azimuth angle corresponding to the image of the one azimuth angle to obtain a first recognition result.
其中,第一特征提取层的示例可以为如图2a所示的图像特征提取层1和特征提取层2,图2a的示例中只画出了两对第一特征提取层和第一层模型,本申请不限于此,神经网络模型中包括的第一特征提取层和第一层模型的对数可以根据具体应用场景中设置摄像头的角度的数量配置,比如说,第一特征提取层和第一层模型的对数可以大于或者等于角度数量。An example of the first feature extraction layer may be image feature extraction layer 1 and feature extraction layer 2 as shown in FIG. 2a. In the example of FIG. 2a, only two pairs of the first feature extraction layer and the first layer model are drawn. The present application is not limited to this, the logarithm of the first feature extraction layer and the first layer model included in the neural network model can be configured according to the number of camera angles set in specific application scenarios. For example, the first feature extraction layer and the first The logarithm of the layer model can be greater than or equal to the number of angles.
其中,所述一个方位角的图像对应的方位角为所述第一识别结果对应的方位角,所述第一特征提取层用于提取所述一个方位角的图像的特征,得到特征向量,所述第一层模型用于根据所述特征向量和所述一个方位角的图像对应的方位角,得到所述第一识别结果。Wherein, the azimuth angle corresponding to the image of the one azimuth angle is the azimuth angle corresponding to the first recognition result, and the first feature extraction layer is used to extract the feature of the image of the one azimuth angle to obtain the feature vector, and the The first layer model is used to obtain the first recognition result according to the feature vector and the azimuth angle corresponding to the image of the one azimuth angle.
如图2a所示,方位角特征从图像1对应的方位角提取到的图像1对应的方位角特征1可以作为第一层模型1的输入,方位角特征从图像2对应的方位角提取到的图像2对应的方位角特征2可以作为第一层模型2的输入;特征提取层1可以提取图像1的特征向量1并输出到第一层模型1,特征提取层2可以提取图像2的特征向量2并输出到第一层模型2。第一层模型1可以结合特征向量1和方位角特征1进行场景识别得到第一识别结果1,第一层模型2可以结合特征向量2和方位角特征2进行场景识别得到第一识别结果2。As shown in Figure 2a, the azimuth feature 1 corresponding to the image 1 extracted from the azimuth corresponding to the image 1 can be used as the input of the first layer model 1, and the azimuth feature extracted from the azimuth corresponding to the image 2. The azimuth feature 2 corresponding to the image 2 can be used as the input of the first layer model 2; the feature extraction layer 1 can extract the feature vector 1 of the image 1 and output it to the first layer model 1, and the feature extraction layer 2 can extract the feature vector of the image 2 2 and output to the first layer model 2. The first layer model 1 can combine the feature vector 1 and the azimuth feature 1 to perform scene recognition to obtain the first recognition result 1, and the first layer model 2 can combine the feature vector 2 and the azimuth feature 2 to perform scene recognition to obtain the first recognition result 2.
如图2a所示,所述场景识别模型还包括第二层模型,第一层模型1和第一层模型2可以分别将第一识别结果1和第一识别结果2输出到第二层模型。方位角特征从图像1对应的方 位角提取到的图像1对应的方位角特征1可以作为第二层模型的输入,方位角特征1与第一识别结果1对应,方位角特征从图像2对应的方位角提取到的图像2对应的方位角特征2可以作为第二层模型的输入,方位角特征2与第一识别结果2对应。第二层模型用于可以根据所述第一识别结果和第一识别结果对应的方位角,得到所述场景识别结果。As shown in FIG. 2a, the scene recognition model further includes a second layer model, and the first layer model 1 and the first layer model 2 can respectively output the first recognition result 1 and the first recognition result 2 to the second layer model. The azimuth feature 1 corresponding to the image 1 extracted from the azimuth corresponding to the image 1 can be used as the input of the second layer model. The azimuth feature 1 corresponds to the first recognition result 1, and the azimuth feature is extracted from the The azimuth feature 2 corresponding to the image 2 extracted from the azimuth angle can be used as the input of the second layer model, and the azimuth feature 2 corresponds to the first recognition result 2 . The second layer model is used to obtain the scene recognition result according to the first recognition result and the azimuth angle corresponding to the first recognition result.
本申请实施例的场景识别方法,采用两层场景识别模型,采集多角度的图像并结合图像的方位角,利用竞争机制进行场景识别的方式,同时考虑了局部特征和整体特征的结果,可以提高用户无感的场景识别中识别的准确度,减少误判。The scene recognition method of the embodiment of the present application adopts a two-layer scene recognition model, collects images from multiple angles and combines the azimuth angles of the images to perform scene recognition by using a competition mechanism, and considers the results of local features and overall features. The accuracy of recognition in scene recognition without user perception reduces misjudgment.
在一种可能的实现方式中,所述第一特征提取层用于提取所述一个方位角的图像的特征,得到多个特征向量;所述第一层模型用于根据所述一个方位角的图像对应的方位角计算所述多个特征向量中的每个特征向量对应的第一权重;所述第一层模型用于根据所述每个特征向量和每个特征向量对应的第一权重,得到所述第一识别结果。所述第二层模型用于根据所述第一识别结果对应的方位角计算所述第一识别结果的第二权重;所述第二层模型用于根据所述第一识别结果和所述第一识别结果的第二权重,得到所述场景识别结果。In a possible implementation manner, the first feature extraction layer is used to extract the features of the image of the one azimuth angle to obtain multiple feature vectors; the first layer model is used to extract the features of the one azimuth angle according to the The azimuth angle corresponding to the image calculates the first weight corresponding to each feature vector in the plurality of feature vectors; the first layer model is used to calculate the first weight corresponding to each feature vector and each feature vector, The first identification result is obtained. The second layer model is used to calculate the second weight of the first recognition result according to the azimuth angle corresponding to the first recognition result; the second layer model is used to calculate the second weight of the first recognition result according to the first recognition result and the first recognition result. A second weight of the recognition result to obtain the scene recognition result.
如图5和图6所示,第一层模型可以包括激活函数、softmax函数,其中激活函数可以为tanh函数,激活函数还可以采用其他类型的激活函数,不限于图5和图6所示的示例,比如说,还可以采用Sigmoid激活函数、Relu激活函数。第一层模型中激活函数的数量可以根据特征提取层提取的特征向量的个数设置,可以大于或者等于提取的特征向量的个数。激活函数用于根据特征向量和方位角特征确定是否激活对应的神经元,提取激活函数对应的特征作为分类的依据,激活函数和softmax函数用于计算每个特征向量对应的第一权重,如上文的公式(3)计算得到的s i,具体的过程可以参见上文所述,不再赘述。 As shown in Figure 5 and Figure 6, the first layer model may include an activation function and a softmax function, where the activation function may be a tanh function, and the activation function may also use other types of activation functions, not limited to those shown in Figures 5 and 6 For example, Sigmoid activation function, Relu activation function can also be used. The number of activation functions in the first layer model can be set according to the number of feature vectors extracted by the feature extraction layer, and can be greater than or equal to the number of extracted feature vectors. The activation function is used to determine whether to activate the corresponding neuron according to the feature vector and the azimuth feature, and the feature corresponding to the activation function is extracted as the basis for classification. The activation function and the softmax function are used to calculate the first weight corresponding to each feature vector, as above. For the si calculated by the formula (3), the specific process can refer to the above, and will not be repeated here.
方位角特征可以影响计算得到的特征向量y i的权重s i,如果计算得到的s i比较大,那么该特征向量对分类的结果影响较大,如果计算得到的s i比较小,那么该特征向量对分类的结果影响较小。因此,本申请提供的场景识别的模型可以根据方位角识别关键特征、过滤无关信息,减少噪声,提高识别准确度,实现用户无感的场景识别。 The azimuth feature can affect the weight s i of the calculated feature vector y i . If the calculated si is relatively large, the feature vector has a greater impact on the classification result. If the calculated si is relatively small, then the feature Vectors have less influence on the results of classification. Therefore, the scene recognition model provided by the present application can identify key features according to the azimuth angle, filter irrelevant information, reduce noise, improve recognition accuracy, and realize scene recognition without user perception.
同样的,第二层模型可以包括激活函数、softmax函数,其中激活函数可以为tanh函数,激活函数还可以采用其他类型的激活函数,不限于图5和图6所示的示例。第二层模型中激活函数的个数与采集图像的角度的数量相关,激活函数的个数可以等于或者大于采集图像的角度的数量。激活函数的输入包括第一层模型的输出结果Z j和方位角特征C j,第二层模型还包括softmax函数,softmax函数根据激活函数的计算结果计算第一层模型的输出结果Z j的权重S j,具体的计算过程可以参见公式(4)和上文第5部分的描述,不再赘述。 Similarly, the second-layer model may include an activation function and a softmax function, wherein the activation function may be a tanh function, and the activation function may also adopt other types of activation functions, which are not limited to the examples shown in FIG. 5 and FIG. 6 . The number of activation functions in the second-layer model is related to the number of angles at which images are collected, and the number of activation functions may be equal to or greater than the number of angles at which images are collected. The input of the activation function includes the output result Z j of the first layer model and the azimuth angle feature C j , the second layer model also includes a softmax function, and the softmax function calculates the weight of the output result Z j of the first layer model according to the calculation result of the activation function. S j , the specific calculation process can refer to formula (4) and the description in the above section 5, and will not be repeated here.
方位角特征可以影响计算得到的第一识别结果Z j的权重S j,如果计算得到的S j比较大,那么该第一识别结果对分类的结果影响较大,如果计算得到的S j比较小,那么该第一识别结果对分类的结果影响较小。因此,本申请提供的场景识别的模型可以根据方位角提取关键角度的特征、过滤无关角度,提高识别准确度,实现用户无感的场景识别。 The azimuth angle feature can affect the weight S j of the calculated first identification result Z j . If the calculated S j is relatively large, the first identification result has a greater impact on the classification result. If the calculated S j is relatively small , then the first recognition result has less influence on the classification result. Therefore, the scene recognition model provided by the present application can extract the features of key angles according to the azimuth angle, filter irrelevant angles, improve the recognition accuracy, and realize scene recognition without user perception.
在一种可能的实现方式中,所述第二层模型预设了每个所述第一识别结果对应的第三权重,所述第二层模型用于根据所述第一识别结果和所述第一识别结果对应的第三权重,得到 所述场景识别结果。举例来说,第二层模型可以对各方位角预设加权重,对于第一层模型输出的每个第一识别结果Z j,预设对应的加权重为S j,第二层模型根据每个第一识别结果Z j以及对应的加权重S j计算可以得到场景识别结果
Figure PCTCN2021140833-appb-000011
In a possible implementation manner, the second-layer model presets a third weight corresponding to each of the first recognition results, and the second-layer model is configured to use the first recognition result and the The third weight corresponding to the first recognition result is used to obtain the scene recognition result. For example, the second-layer model can preset weights for each azimuth angle. For each first recognition result Z j output by the first-layer model, the preset corresponding weight is S j , and the second-layer model is based on each first recognition result Z j . The first recognition result Z j and the corresponding weighted weight S j can be calculated to obtain the scene recognition result
Figure PCTCN2021140833-appb-000011
在一种可能的实现方式中,所述第二层模型用于根据所述方位角和预设规则,确定每个所述第一识别结果对应的第四权重;其中,所述预设规则为根据方位角设置的权重组,不同的方位角对应了不同的权重组,每个权重组包括每个第一识别结果对应的第四权重;所述第二层模型用于根据所述第一识别结果和所述第一识别结果对应的第四权重,得到所述场景识别结果。In a possible implementation manner, the second layer model is configured to determine a fourth weight corresponding to each of the first recognition results according to the azimuth angle and a preset rule; wherein the preset rule is: According to the weight group set by the azimuth angle, different azimuth angles correspond to different weight groups, and each weight group includes a fourth weight corresponding to each first identification result; the second layer model is used for identifying according to the first identification result. The result and the fourth weight corresponding to the first recognition result are used to obtain the scene recognition result.
举例来说,第一层模型还可以根据方位角预设了权重映射函数,也就是不同的方位角,对应了不同的预设权重组,预设权重组可以包括每个第一识别结果Z j对应的权重。举例来说,假设终端设备通过前置摄像头、后置摄像头采集了两个角度的图像,对两个角度的图像进行识别得到两个第一识别结果Z 1和Z 2,对应的权重分别为S 1和S 2,在一个示例中,根据方位角预设的权重映射函数可以如下所示: For example, the first layer model may also preset a weight mapping function according to the azimuth angle, that is, different azimuth angles correspond to different preset weight groups, and the preset weight groups may include each first identification result Z j . corresponding weight. For example, it is assumed that the terminal device collects images from two angles through the front camera and the rear camera, and recognizes the images from the two angles to obtain two first recognition results Z 1 and Z 2 , and the corresponding weights are S respectively. 1 and S 2 , in one example, the preset weight mapping function according to the azimuth angle can be as follows:
(1)如果前置摄像头方位角θ属于区间[0°,45°),S 1=1.0,S 2=0.0; (1) If the azimuth angle θ of the front camera belongs to the interval [0°, 45°), S 1 =1.0, S 2 =0.0;
(2)如果前置摄像头方位角θ属于区间[45°,90°),S 1=0.7,S 2=0.3; (2) If the azimuth angle θ of the front camera belongs to the interval [45°, 90°), S 1 =0.7, S 2 =0.3;
(3)如果前置摄像头方位角θ属于区间[90°,135°),S 1=0.3,S 2=0.7; (3) If the azimuth angle θ of the front camera belongs to the interval [90°, 135°), S 1 =0.3, S 2 =0.7;
(4)如果前置摄像头方位角θ属于区间[135°,180°],S 1=0.0,S 2=1.0。 (4) If the azimuth angle θ of the front camera belongs to the interval [135°, 180°], S 1 =0.0, S 2 =1.0.
以上仅仅是第二层模型的实现方式的一些示例,本申请不限于此。The above are just some examples of implementations of the second-layer model, and the present application is not limited thereto.
对于第二层模型通过预设的权重或者权重映射函数的形式的时限方式,在训练时,只需要对神经网络模型的其他部分进行训练即可,不需要对第二层模型进行训练,可以提高训练的效率。For the time-limited method in the form of preset weights or weight mapping functions for the second-layer model, during training, only other parts of the neural network model need to be trained, and the second-layer model does not need to be trained, which can improve the training efficiency.
本申请实施例还提供了一种场景识别装置,图9示出根据本申请一实施例的场景识别装置的框图。如图9所示,所述装置可以包括:An embodiment of the present application further provides a scene recognition apparatus, and FIG. 9 shows a block diagram of a scene recognition apparatus according to an embodiment of the present application. As shown in Figure 9, the apparatus may include:
图像采集模块,用于通过多个摄像头从多个方位角采集同一场景下的图像,其中,每个摄像头采集所述图像时的方位角为所述图像对应的方位角,所述方位角为所述每个摄像头采集所述图像时的方向向量与重力单位向量的夹角;The image acquisition module is used to collect images in the same scene from multiple azimuth angles through multiple cameras, wherein the azimuth angle when each camera collects the image is the azimuth angle corresponding to the image, and the azimuth angle is the azimuth angle corresponding to the image. the angle between the direction vector and the unit vector of gravity when each camera collects the image;
场景识别模块,用于根据所述图像和所述图像对应的方位角识别所述同一场景,得到场景识别结果。A scene recognition module, configured to recognize the same scene according to the image and the azimuth angle corresponding to the image, and obtain a scene recognition result.
本申请实施例的场景识别装置,利用多个摄像头拍摄同一场景的多个方位角的图像,结合多个图像和每个图像对应的方位角对场景进行识别,由于获得了更全面的场景信息,因此,可以提高基于图像的场景识别的准确度,解决了单摄像头识别场景视角范围、拍摄角度受限的问题,识别更精准。The scene recognition device of the embodiment of the present application uses multiple cameras to capture images of multiple azimuth angles of the same scene, and recognizes the scene in combination with the multiple images and the azimuth angle corresponding to each image. Since more comprehensive scene information is obtained, Therefore, the accuracy of image-based scene recognition can be improved, the problems of limited viewing angle range and shooting angle of a single camera to recognize a scene can be solved, and the recognition can be more accurate.
在一种可能的实现方式中,所述场景识别模块包括:In a possible implementation, the scene recognition module includes:
方位角特征提取模块,用于从所述图像对应的方位角提取所述图像对应的方位角特征;an azimuth feature extraction module, used for extracting the azimuth feature corresponding to the image from the azimuth angle corresponding to the image;
场景识别模型,用于基于所述图像和所述图像对应的方位角特征识别所述同一场景,得到场景识别结果,其中,所述场景识别模型为神经网络模型。A scene recognition model, configured to recognize the same scene based on the image and the azimuth angle feature corresponding to the image, and obtain a scene recognition result, wherein the scene recognition model is a neural network model.
在一种可能的实现方式中,所述场景识别模型包括多对第一特征提取层和第一层模型,每对第一特征提取层和第一层模型用于对一个方位角的图像和所述一个方位角的图像对应的方位角进行处理,得到第一识别结果;其中,所述一个方位角的图像对应的方位角为所述第一识别结果对应的方位角,所述第一特征提取层用于提取所述一个方位角的图像的特征,得到特征向量,所述第一层模型用于根据所述特征向量和所述一个方位角的图像对应的方位角,得到所述第一识别结果;所述场景识别模型还包括第二层模型,所述第二层模型用于根据所述第一识别结果和第一识别结果对应的方位角,得到所述场景识别结果。In a possible implementation manner, the scene recognition model includes multiple pairs of the first feature extraction layer and the first layer model, and each pair of the first feature extraction layer and the first layer model is used to compare the image of an azimuth angle and all the The azimuth angle corresponding to the image of one azimuth angle is processed to obtain the first recognition result; wherein, the azimuth angle corresponding to the image of the one azimuth angle is the azimuth angle corresponding to the first recognition result, and the first feature extraction The layer is used to extract the feature of the image of the one azimuth angle to obtain a feature vector, and the first layer model is used to obtain the first identification according to the feature vector and the azimuth angle corresponding to the image of the one azimuth angle. Result; the scene recognition model further includes a second layer model, and the second layer model is used to obtain the scene recognition result according to the first recognition result and the azimuth angle corresponding to the first recognition result.
本申请实施例的场景识别装置,采用两层场景识别模型,采集多角度的图像并结合图像的方位角,利用竞争机制进行场景识别的方式,同时考虑了局部特征和整体特征的结果,可以提高用户无感的场景识别中识别的准确度,减少误判。The scene recognition device of the embodiment of the present application adopts a two-layer scene recognition model, collects multi-angle images and combines the azimuth angles of the images to perform scene recognition by using a competition mechanism, and considers the results of local features and overall features. The accuracy of recognition in scene recognition without user perception reduces misjudgment.
在一种可能的实现方式中,所述第一特征提取层用于提取所述一个方位角的图像的特征,得到多个特征向量;所述第一层模型用于根据所述一个方位角的图像对应的方位角计算所述多个特征向量中的每个特征向量对应的第一权重;所述第一层模型用于根据所述每个特征向量和每个特征向量对应的第一权重,得到所述第一识别结果。In a possible implementation manner, the first feature extraction layer is used to extract the features of the image of the one azimuth angle to obtain multiple feature vectors; the first layer model is used to extract the features of the one azimuth angle according to the The azimuth angle corresponding to the image calculates the first weight corresponding to each feature vector in the plurality of feature vectors; the first layer model is used to calculate the first weight corresponding to each feature vector and each feature vector, The first identification result is obtained.
在一种可能的实现方式中,所述第二层模型用于根据所述第一识别结果对应的方位角计算所述第一识别结果的第二权重;所述第二层模型用于根据所述第一识别结果和所述第一识别结果的第二权重,得到所述场景识别结果。In a possible implementation manner, the second layer model is used to calculate the second weight of the first recognition result according to the azimuth angle corresponding to the first recognition result; the second layer model is used to calculate the second weight of the first recognition result according to the azimuth angle corresponding to the first recognition result; The first recognition result and the second weight of the first recognition result are used to obtain the scene recognition result.
在一种可能的实现方式中,所述第二层模型预设了每个所述第一识别结果对应的第三权重,所述第二层模型用于根据所述第一识别结果和所述第一识别结果对应的第三权重,得到所述场景识别结果。In a possible implementation manner, the second-layer model presets a third weight corresponding to each of the first recognition results, and the second-layer model is configured to use the first recognition result and the The third weight corresponding to the first recognition result is used to obtain the scene recognition result.
在一种可能的实现方式中,所述第二层模型用于根据所述方位角和预设规则,确定每个所述第一识别结果对应的第四权重;其中,所述预设规则为根据方位角设置的权重组,不同的方位角对应了不同的权重组,每个权重组包括每个第一识别结果对应的第四权重;所述第二层模型用于根据所述第一识别结果和所述第一识别结果对应的第四权重,得到所述场景识别结果。In a possible implementation manner, the second layer model is configured to determine a fourth weight corresponding to each of the first recognition results according to the azimuth angle and a preset rule; wherein the preset rule is: According to the weight group set by the azimuth angle, different azimuth angles correspond to different weight groups, and each weight group includes a fourth weight corresponding to each first identification result; the second layer model is used for identifying according to the first identification result. The result and the fourth weight corresponding to the first recognition result are used to obtain the scene recognition result.
在一种可能的实现方式中,所述装置还包括:方位角采集模块,用于获取重力传感器在每个摄像头采集所述图像时对应的三维直角坐标系的坐标轴上的加速度,得到每个摄像头采集所述图像时的方向向量;其中,每个摄像头采集所述图像时对应的三维直角坐标系以所述每个摄像头为原点,z方向为沿着摄像头拍摄的方向,x和y分别为与z方向垂直的方向,且x和y所在的平面与z方向垂直;根据所述方向向量和所述重力单位向量计算所述方位角。In a possible implementation manner, the device further includes: an azimuth angle acquisition module, configured to acquire the acceleration on the coordinate axis of the three-dimensional rectangular coordinate system corresponding to the gravity sensor when each camera collects the image, and obtain each The direction vector when the camera collects the image; wherein, the corresponding three-dimensional rectangular coordinate system when each camera collects the image takes each camera as the origin, the z direction is the direction along the camera, and x and y are respectively The direction perpendicular to the z direction, and the plane where x and y are located is perpendicular to the z direction; the azimuth angle is calculated according to the direction vector and the gravity unit vector.
在一种可能的实现方式中,所述装置还包括:图像预处理模块,用于对所述图像进行预处理;其中,所述预处理包括以下处理中的一种或多种的组合:转换图像格式、转换图像通道、统一图像尺寸、图像归一化,转换图像格式是指将彩色图像转换为黑白图像,转换图像通道是指将图像转换到红绿蓝RGB通道,统一图像尺寸是指调整多个图像的长度相同、宽度相同,图像归一化是指将图像的像素值归一化。In a possible implementation manner, the apparatus further includes: an image preprocessing module, configured to perform preprocessing on the image; wherein, the preprocessing includes a combination of one or more of the following processes: converting Image format, convert image channel, unify image size, image normalization, convert image format refers to converting color image to black and white image, convert image channel refers to converting image to red, green and blue RGB channel, and unify image size refers to adjusting Multiple images have the same length and the same width, and image normalization refers to normalizing the pixel values of the images.
在一种可能的实现方式中,所述装置还包括:结果输出模块,用于输出场景识别结果。In a possible implementation manner, the apparatus further includes: a result output module, configured to output the scene recognition result.
图10示出根据本申请一实施例的终端设备的结构示意图。以终端设备是手机为例,图 10示出了手机200的结构示意图。FIG. 10 shows a schematic structural diagram of a terminal device according to an embodiment of the present application. Taking the terminal device being a mobile phone as an example, FIG. 10 shows a schematic structural diagram of the mobile phone 200.
手机200可以包括处理器210,外部存储器接口220,内部存储器221,USB接口230,充电管理模块240,电源管理模块241,电池242,天线1,天线2,移动通信模块251,无线通信模块252,音频模块270,扬声器270A,受话器270B,麦克风270C,耳机接口270D,传感器模块280,按键290,马达291,指示器292,摄像头293,显示屏294,以及SIM卡接口295等。其中传感器模块280可以包括陀螺仪传感器280A,加速度传感器280B,接近光传感器280G、指纹传感器280H,触摸传感器280K(当然,手机200还可以包括其它传感器,比如温度传感器,压力传感器、距离传感器、磁传感器、环境光传感器、气压传感器、骨传导传感器等,图中未示出)。The mobile phone 200 may include a processor 210, an external memory interface 220, an internal memory 221, a USB interface 230, a charging management module 240, a power management module 241, a battery 242, an antenna 1, an antenna 2, a mobile communication module 251, a wireless communication module 252, Audio module 270, speaker 270A, receiver 270B, microphone 270C, headphone jack 270D, sensor module 280, buttons 290, motor 291, indicator 292, camera 293, display screen 294, SIM card interface 295, etc. The sensor module 280 may include a gyroscope sensor 280A, an acceleration sensor 280B, a proximity light sensor 280G, a fingerprint sensor 280H, and a touch sensor 280K (of course, the mobile phone 200 may also include other sensors, such as a temperature sensor, a pressure sensor, a distance sensor, and a magnetic sensor. , ambient light sensor, air pressure sensor, bone conduction sensor, etc., not shown in the figure).
可以理解的是,本申请实施例示意的结构并不构成对手机200的具体限定。在本申请另一些实施例中,手机200可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。It can be understood that the structures illustrated in the embodiments of the present application do not constitute a specific limitation on the mobile phone 200 . In other embodiments of the present application, the mobile phone 200 may include more or less components than shown, or combine some components, or separate some components, or arrange different components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
处理器210可以包括一个或多个处理单元,例如:处理器210可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,存储器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(Neural-network Processing Unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。其中,控制器可以是手机200的神经中枢和指挥中心。控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。The processor 210 may include one or more processing units, for example, the processor 210 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or Neural-network Processing Unit (NPU) Wait. Wherein, different processing units may be independent devices, or may be integrated in one or more processors. The controller may be the nerve center and command center of the mobile phone 200 . The controller can generate an operation control signal according to the instruction operation code and timing signal, and complete the control of fetching and executing instructions.
处理器210中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器210中的存储器为高速缓冲存储器。该存储器可以保存处理器210刚用过或循环使用的指令或数据。如果处理器210需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器210的等待时间,因而提高了系统的效率。A memory may also be provided in the processor 210 for storing instructions and data. In some embodiments, the memory in processor 210 is cache memory. The memory may hold instructions or data that have just been used or recycled by the processor 210 . If the processor 210 needs to use the instruction or data again, it can be called directly from the memory. Repeated accesses are avoided, and the waiting time of the processor 210 is reduced, thereby improving the efficiency of the system.
处理器210可以运行本申请实施例提供的场景识别方法,以便于结合多个图像和每个图像对应的方位角对场景进行识别,获得更全面的场景信息,提高基于图像的场景识别的准确度,解决了单摄像头识别场景视角范围、拍摄角度受限的问题,识别更精准。处理器210可以包括不同的器件,比如集成CPU和GPU时,CPU和GPU可以配合执行本申请实施例提供的场景识别方法,比如场景识别方法中部分算法由CPU执行,另一部分算法由GPU执行,以得到较快的处理效率。The processor 210 can run the scene recognition method provided by the embodiment of the present application, so as to recognize the scene in combination with multiple images and the azimuth angle corresponding to each image, obtain more comprehensive scene information, and improve the accuracy of image-based scene recognition. , solves the problem of limited viewing angle range and shooting angle of single camera recognition scene, and the recognition is more accurate. The processor 210 may include different devices. For example, when a CPU and a GPU are integrated, the CPU and the GPU may cooperate to execute the scene recognition method provided by the embodiments of the present application. For example, some algorithms in the scene recognition method are executed by the CPU, and another part of the algorithms are executed by the GPU. for faster processing efficiency.
显示屏294用于显示图像,视频等。显示屏294包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD),有机发光二极管(organic light-emitting diode,OLED),有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode的,AMOLED),柔性发光二极管(flex light-emitting diode,FLED),Miniled,MicroLed,Micro-oLed,量子点发光二极管(quantum dot light emitting diodes,QLED)等。在一些实施例中,手机200可以包括1个或N个显示屏294,N为大于1的正整数。显示屏294可用于显示由用户输入的信息或提供给用户的信息以及各种图形用户界面(graphical user interface,GUI)。例如,显示器294可以显示照片、视频、网页、或者文件等。再例如,显示器294可 以显示图形用户界面。其中,图形用户界面上包括状态栏、可隐藏的导航栏、时间和天气小组件(widget)、以及应用的图标,例如浏览器图标等。状态栏中包括运营商名称(例如中国移动)、移动网络(例如4G)、时间和剩余电量。导航栏中包括后退(back)键图标、主屏幕(home)键图标和前进键图标。此外,可以理解的是,在一些实施例中,状态栏中还可以包括蓝牙图标、Wi-Fi图标、外接设备图标等。还可以理解的是,在另一些实施例中,图形用户界面中还可以包括Dock栏,Dock栏中可以包括常用的应用图标等。当处理器210检测到用户的手指(或触控笔等)针对某一应用图标的触摸事件后,响应于该触摸事件,打开与该应用图标对应的应用的用户界面,并在显示器294上显示该应用的用户界面。 Display screen 294 is used to display images, videos, and the like. Display screen 294 includes a display panel. The display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode or an active-matrix organic light-emitting diode (active-matrix organic light). emitting diode, AMOLED), flexible light-emitting diode (flex light-emitting diode, FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diode (quantum dot light emitting diodes, QLED) and so on. In some embodiments, cell phone 200 may include 1 or N display screens 294, where N is a positive integer greater than 1. The display screen 294 may be used to display information entered by or provided to the user as well as various graphical user interfaces (GUIs). For example, display 294 may display photos, videos, web pages, or documents, and the like. As another example, display 294 may display a graphical user interface. The GUI includes a status bar, a hideable navigation bar, a time and weather widget, and an application icon, such as a browser icon. The status bar includes operator name (eg China Mobile), mobile network (eg 4G), time and remaining battery. The navigation bar includes a back button icon, a home button icon, and a forward button icon. In addition, it can be understood that, in some embodiments, the status bar may further include a Bluetooth icon, a Wi-Fi icon, an external device icon, and the like. It can also be understood that, in other embodiments, the graphical user interface may further include a Dock bar, and the Dock bar may include commonly used application icons and the like. After the processor 210 detects a touch event of the user's finger (or stylus, etc.) on an application icon, in response to the touch event, the user interface of the application corresponding to the application icon is opened and displayed on the display 294 The user interface of the application.
在本申请实施例中,显示屏294可以是一个一体的柔性显示屏,也可以采用两个刚性屏以及位于两个刚性屏之间的一个柔性屏组成的拼接显示屏。In this embodiment of the present application, the display screen 294 may be an integrated flexible display screen, or a spliced display screen composed of two rigid screens and a flexible screen located between the two rigid screens.
摄像头293(前置摄像头、后置摄像头,前置摄像头和后置摄像头都可以包括一个或多个摄像头)用于捕获静态图像或视频。通常,摄像头293可以包括感光元件比如镜头组和图像传感器,其中,镜头组包括多个透镜(凸透镜或凹透镜),用于采集待拍摄物体反射的光信号,并将采集的光信号传递给图像传感器。图像传感器根据所述光信号生成待拍摄物体的原始图像。在本申请的实施例中,通过多个摄像头从多个方位角采集同一场景下的图像,从而可以结合多个图像和每个图像对应的方位角对场景进行识别,获得更全面的场景信息,提高基于图像的场景识别的准确度,解决了单摄像头识别场景视角范围、拍摄角度受限的问题,识别更精准。Cameras 293 (front camera, rear camera, both front and rear cameras may include one or more cameras) are used to capture still images or video. Generally, the camera 293 may include a photosensitive element such as a lens group and an image sensor, wherein the lens group includes a plurality of lenses (convex or concave) for collecting the light signal reflected by the object to be photographed, and transmitting the collected light signal to the image sensor . The image sensor generates an original image of the object to be photographed according to the light signal. In the embodiment of the present application, multiple cameras are used to collect images in the same scene from multiple azimuth angles, so that the scene can be identified in combination with the multiple images and the azimuth angle corresponding to each image, and more comprehensive scene information can be obtained, Improve the accuracy of image-based scene recognition, solve the problem of limited viewing angle range and shooting angle of single camera recognition scene, and make the recognition more accurate.
内部存储器221可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。处理器210通过运行存储在内部存储器221的指令,从而执行手机200的各种功能应用以及数据处理。内部存储器221可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统,应用程序(比如相机应用,微信应用等)的代码等。存储数据区可存储手机200使用过程中所创建的数据(比如相机应用采集的图像、视频等)等。 Internal memory 221 may be used to store computer executable program code, which includes instructions. The processor 210 executes various functional applications and data processing of the mobile phone 200 by executing the instructions stored in the internal memory 221 . The internal memory 221 may include a storage program area and a storage data area. The storage program area may store operating system, code of application programs (such as camera application, WeChat application, etc.), and the like. The storage data area may store data created during the use of the mobile phone 200 (such as images and videos collected by the camera application) and the like.
内部存储器221还可以存储本申请实施例提供的场景识别方法对应的一个或多个计算机程序1310。该一个或多个计算机程序1304被存储在上述存储器221中并被配置为被该一个或多个处理器210执行,该一个或多个计算机程序1310包括指令,上述指令可以用于执行本申请实施例提供的场景识别方法中的各个步骤,该计算机程序1310可以包括:图像采集模块,用于通过多个摄像头从多个方位角采集同一场景下的图像;场景识别模块,用于根据所述图像和所述图像对应的方位角识别所述同一场景,得到场景识别结果;方位角采集模块,用于获取重力传感器在每个摄像头采集所述图像时对应的三维直角坐标系的坐标轴上的加速度,得到每个摄像头采集所述图像时的方向向量,根据所述方向向量和所述重力单位向量计算所述方位角;图像预处理模块,用于对所述图像进行预处理。The internal memory 221 may also store one or more computer programs 1310 corresponding to the scene recognition method provided by the embodiment of the present application. The one or more computer programs 1304 are stored in the aforementioned memory 221 and configured to be executed by the one or more processors 210, and the one or more computer programs 1310 include instructions that may be used to carry out the implementation of the present application For each step in the scene recognition method provided by the example, the computer program 1310 may include: an image acquisition module for acquiring images under the same scene from multiple azimuths through a plurality of cameras; a scene recognition module for according to the image Recognize the same scene with the azimuth angle corresponding to the image, and obtain the scene recognition result; the azimuth angle acquisition module is used to acquire the acceleration on the coordinate axis of the three-dimensional rectangular coordinate system corresponding to the gravity sensor when each camera collects the image , obtains the direction vector when each camera collects the image, and calculates the azimuth angle according to the direction vector and the gravity unit vector; an image preprocessing module is used to preprocess the image.
此外,内部存储器221可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。In addition, the internal memory 221 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), and the like.
当然,本申请实施例提供的场景识别方法的代码还可以存储在外部存储器中。这种情况下,处理器210可以通过外部存储器接口220运行存储在外部存储器中的场景识别方法的代码。Certainly, the code of the scene recognition method provided by the embodiment of the present application may also be stored in an external memory. In this case, the processor 210 may execute the code of the scene recognition method stored in the external memory through the external memory interface 220 .
下面介绍传感器模块280的功能。The function of the sensor module 280 is described below.
陀螺仪传感器280A,可以用于确定手机200的运动姿态。在一些实施例中,可以通过陀螺仪传感器280A确定手机200围绕三个轴(即,x,y和z轴)的角速度。即陀螺仪传感器280A可以用于检测手机200当前的运动状态,比如抖动还是静止。The gyro sensor 280A can be used to determine the movement posture of the mobile phone 200 . In some embodiments, the angular velocity of cell phone 200 about three axes (ie, x, y, and z axes) may be determined by gyro sensor 280A. That is, the gyro sensor 280A can be used to detect the current motion state of the mobile phone 200, such as shaking or still.
当本申请实施例中的显示屏为可折叠屏时,陀螺仪传感器280A可用于检测作用于显示屏294上的折叠或者展开操作。陀螺仪传感器280A可以将检测到的折叠操作或者展开操作作为事件上报给处理器210,以确定显示屏294的折叠状态或展开状态。When the display screen in the embodiment of the present application is a foldable screen, the gyro sensor 280A can be used to detect a folding or unfolding operation acting on the display screen 294 . The gyroscope sensor 280A may report the detected folding operation or unfolding operation to the processor 210 as an event to determine the folding state or unfolding state of the display screen 294 .
加速度传感器280B可检测手机200在各个方向上(一般为三轴)加速度的大小。即陀螺仪传感器280A可以用于检测手机200当前的运动状态,比如抖动还是静止。当本申请实施例中的显示屏为可折叠屏时,加速度传感器280B可用于检测作用于显示屏294上的折叠或者展开操作。加速度传感器280B可以将检测到的折叠操作或者展开操作作为事件上报给处理器210,以确定显示屏294的折叠状态或展开状态。The acceleration sensor 280B can detect the magnitude of the acceleration of the mobile phone 200 in various directions (generally three axes). That is, the gyro sensor 280A can be used to detect the current motion state of the mobile phone 200, such as shaking or still. When the display screen in the embodiment of the present application is a foldable screen, the acceleration sensor 280B can be used to detect a folding or unfolding operation acting on the display screen 294 . The acceleration sensor 280B may report the detected folding operation or unfolding operation to the processor 210 as an event to determine the folding state or unfolding state of the display screen 294 .
在本申请的实施例中,终端设备通过加速度传感器280B可以获取在每个摄像头采集所述图像时对应的三维直角坐标系的坐标轴上的加速度,得到每个摄像头采集所述图像时的方向向量,根据所述方向向量和所述重力单位向量计算所述方位角。In the embodiment of the present application, the terminal device can obtain the acceleration on the coordinate axis of the three-dimensional rectangular coordinate system corresponding to each camera when collecting the image through the acceleration sensor 280B, and obtain the direction vector when each camera collects the image , and calculate the azimuth angle according to the direction vector and the gravity unit vector.
接近光传感器280G可以包括例如发光二极管(LED)和光检测器,例如光电二极管。发光二极管可以是红外发光二极管。手机通过发光二极管向外发射红外光。手机使用光电二极管检测来自附近物体的红外反射光。当检测到充分的反射光时,可以确定手机附近有物体。当检测到不充分的反射光时,手机可以确定手机附近没有物体。当本申请实施例中的显示屏为可折叠屏时,接近光传感器280G可以设置在可折叠的显示屏294的第一屏上,接近光传感器280G可根据红外信号的光程差来检测第一屏与第二屏的折叠角度或者展开角度的大小。 Proximity light sensor 280G may include, for example, light emitting diodes (LEDs) and light detectors, such as photodiodes. The light emitting diodes may be infrared light emitting diodes. The mobile phone emits infrared light outward through light-emitting diodes. Phones use photodiodes to detect reflected infrared light from nearby objects. When sufficient reflected light is detected, it can be determined that there is an object near the phone. When insufficient reflected light is detected, the phone can determine that there are no objects near the phone. When the display screen in the embodiment of the present application is a foldable screen, the proximity light sensor 280G can be arranged on the first screen of the foldable display screen 294, and the proximity light sensor 280G can detect the first screen according to the optical path difference of the infrared signal. The size of the folding or unfolding angle between the screen and the second screen.
陀螺仪传感器280A(或加速度传感器280B)可以将检测到的运动状态信息(比如角速度)发送给处理器210。处理器210基于运动状态信息确定当前是手持状态还是脚架状态(比如,角速度不为0时,说明手机200处于手持状态)。The gyroscope sensor 280A (or the acceleration sensor 280B) may send the detected motion state information (such as angular velocity) to the processor 210 . The processor 210 determines, based on the motion state information, whether the current state is the hand-held state or the tripod state (for example, when the angular velocity is not 0, it means that the mobile phone 200 is in the hand-held state).
指纹传感器280H用于采集指纹。手机200可以利用采集的指纹特性实现指纹解锁,访问应用锁,指纹拍照,指纹接听来电等。The fingerprint sensor 280H is used to collect fingerprints. The mobile phone 200 can use the collected fingerprint characteristics to realize fingerprint unlocking, accessing application locks, taking photos with fingerprints, answering incoming calls with fingerprints, and the like.
触摸传感器280K,也称“触控面板”。触摸传感器280K可以设置于显示屏294,由触摸传感器280K与显示屏294组成触摸屏,也称“触控屏”。触摸传感器280K用于检测作用于其上或附近的触摸操作。触摸传感器可以将检测到的触摸操作传递给应用处理器,以确定触摸事件类型。可以通过显示屏294提供与触摸操作相关的视觉输出。在另一些实施例中,触摸传感器280K也可以设置于手机200的表面,与显示屏294所处的位置不同。 Touch sensor 280K, also called "touch panel". The touch sensor 280K may be disposed on the display screen 294, and the touch sensor 280K and the display screen 294 form a touch screen, also called a "touch screen". The touch sensor 280K is used to detect a touch operation on or near it. The touch sensor can pass the detected touch operation to the application processor to determine the type of touch event. Visual output related to touch operations may be provided through display screen 294 . In other embodiments, the touch sensor 280K may also be disposed on the surface of the mobile phone 200 , which is different from the location where the display screen 294 is located.
示例性的,手机200的显示屏294显示主界面,主界面中包括多个应用(比如相机应用、微信应用等)的图标。用户通过触摸传感器280K点击主界面中相机应用的图标,触发处理器210启动相机应用,打开摄像头293。显示屏294显示相机应用的界面,例如取景界面。显示屏294还可以用于显示场景识别结果。Exemplarily, the display screen 294 of the mobile phone 200 displays a main interface, and the main interface includes icons of multiple applications (such as a camera application, a WeChat application, etc.). The user clicks the icon of the camera application in the main interface through the touch sensor 280K, which triggers the processor 210 to start the camera application and turn on the camera 293 . Display screen 294 displays an interface of a camera application, such as a viewfinder interface. Display screen 294 may also be used to display scene recognition results.
手机200的无线通信功能可以通过天线1,天线2,移动通信模块251,无线通信模块252,调制解调处理器以及基带处理器等实现。The wireless communication function of the mobile phone 200 can be realized by the antenna 1, the antenna 2, the mobile communication module 251, the wireless communication module 252, the modulation and demodulation processor, the baseband processor, and the like.
天线1和天线2用于发射和接收电磁波信号。手机200中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。例如:可以将天线1复用为无线局域网的分集天线。在另外一些实施例中,天线可以和调谐开关结合使用。Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals. Each antenna in handset 200 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization. For example, the antenna 1 can be multiplexed as a diversity antenna of the wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.
移动通信模块251可以提供应用在手机200上的包括2G/3G/4G/5G等无线通信的解决方案。移动通信模块251可以包括至少一个滤波器,开关,功率放大器,低噪声放大器(low noise amplifier,LNA)等。移动通信模块251可以由天线1接收电磁波,并对接收的电磁波进行滤波,放大等处理,传送至调制解调处理器进行解调。移动通信模块251还可以对经调制解调处理器调制后的信号放大,经天线1转为电磁波辐射出去。在一些实施例中,移动通信模块251的至少部分功能模块可以被设置于处理器210中。在一些实施例中,移动通信模块251的至少部分功能模块可以与处理器210的至少部分模块被设置在同一个器件中。The mobile communication module 251 can provide a wireless communication solution including 2G/3G/4G/5G, etc. applied on the mobile phone 200 . The mobile communication module 251 may include at least one filter, switch, power amplifier, low noise amplifier (LNA) and the like. The mobile communication module 251 can receive electromagnetic waves from the antenna 1, filter and amplify the received electromagnetic waves, and transmit them to the modulation and demodulation processor for demodulation. The mobile communication module 251 can also amplify the signal modulated by the modulation and demodulation processor, and then turn it into an electromagnetic wave for radiation through the antenna 1 . In some embodiments, at least part of the functional modules of the mobile communication module 251 may be provided in the processor 210 . In some embodiments, at least part of the functional modules of the mobile communication module 251 may be provided in the same device as at least part of the modules of the processor 210 .
调制解调处理器可以包括调制器和解调器。其中,调制器用于将待发送的低频基带信号调制成中高频信号。解调器用于将接收的电磁波信号解调为低频基带信号。随后解调器将解调得到的低频基带信号传送至基带处理器处理。低频基带信号经基带处理器处理后,被传递给应用处理器。应用处理器通过音频设备(不限于扬声器270A,受话器270B等)输出声音信号,或通过显示屏294显示图像或视频。在一些实施例中,调制解调处理器可以是独立的器件。在另一些实施例中,调制解调处理器可以独立于处理器210,与移动通信模块251或其他功能模块设置在同一个器件中。The modem processor may include a modulator and a demodulator. Wherein, the modulator is used to modulate the low frequency baseband signal to be sent into a medium and high frequency signal. The demodulator is used to demodulate the received electromagnetic wave signal into a low frequency baseband signal. Then the demodulator transmits the demodulated low-frequency baseband signal to the baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and passed to the application processor. The application processor outputs sound signals through audio devices (not limited to the speaker 270A, the receiver 270B, etc.), or displays images or videos through the display screen 294 . In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modulation and demodulation processor may be independent of the processor 210, and may be provided in the same device as the mobile communication module 251 or other functional modules.
无线通信模块252可以提供应用在手机200上的包括无线局域网(wireless local area networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(bluetooth,BT),全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。无线通信模块252可以是集成至少一个通信处理模块的一个或多个器件。无线通信模块252经由天线2接收电磁波,将电磁波信号调频以及滤波处理,将处理后的信号发送到处理器210。无线通信模块252还可以从处理器210接收待发送的信号,对其进行调频,放大,经天线2转为电磁波辐射出去。本申请实施例中,无线通信模块252,用于在处理器210的控制下与其他终端设备之间传输数据。The wireless communication module 252 can provide applications on the mobile phone 200 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) network), bluetooth (BT), global navigation satellite system (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions. The wireless communication module 252 may be one or more devices integrating at least one communication processing module. The wireless communication module 252 receives electromagnetic waves via the antenna 2 , frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 210 . The wireless communication module 252 can also receive the signal to be sent from the processor 210 , perform frequency modulation on the signal, amplify the signal, and then convert it into an electromagnetic wave for radiation through the antenna 2 . In this embodiment of the present application, the wireless communication module 252 is configured to transmit data with other terminal devices under the control of the processor 210 .
另外,手机200可以通过音频模块270,扬声器270A,受话器270B,麦克风270C,耳机接口270D,以及应用处理器等实现音频功能。例如音乐播放,录音等。手机200可以接收按键290输入,产生与手机200的用户设置以及功能控制有关的键信号输入。手机200可以利用马达291产生振动提示(比如来电振动提示)。手机200中的指示器292可以是指示灯,可以用于指示充电状态,电量变化,也可以用于指示消息,未接来电,通知等。手机200中的SIM卡接口295用于连接SIM卡。SIM卡可以通过插入SIM卡接口295,或从SIM卡接口295拔出,实现和手机200的接触和分离。In addition, the mobile phone 200 can implement audio functions through an audio module 270, a speaker 270A, a receiver 270B, a microphone 270C, an earphone interface 270D, and an application processor. Such as music playback, recording, etc. The cell phone 200 can receive key 290 input and generate key signal input related to user settings and function control of the cell phone 200 . The mobile phone 200 can use the motor 291 to generate vibration alerts (eg, vibration alerts for incoming calls). The indicator 292 in the mobile phone 200 may be an indicator light, which may be used to indicate a charging state, a change in power, and may also be used to indicate a message, a missed call, a notification, and the like. The SIM card interface 295 in the mobile phone 200 is used to connect the SIM card. The SIM card can be contacted and separated from the mobile phone 200 by inserting into the SIM card interface 295 or pulling out from the SIM card interface 295 .
应理解,在实际应用中,手机200可以包括比图10所示的更多或更少的部件,本申请实施例不作限定。图示手机200仅是一个范例,并且手机200可以具有比图中所示出的更多的或者更少的部件,可以组合两个或更多的部件,或者可以具有不同的部件配置。图中所示出的各种部件可以在包括一个或多个信号处理和/或专用集成电路在内的硬件、软件、或硬件和 软件的组合中实现。It should be understood that, in practical applications, the mobile phone 200 may include more or less components than those shown in FIG. 10 , which are not limited in this embodiment of the present application. The illustrated handset 200 is merely an example, and the handset 200 may have more or fewer components than those shown, two or more components may be combined, or may have different component configurations. The various components shown in the figures may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing and/or application specific integrated circuits.
终端设备的软件系统可以采用分层架构,事件驱动架构,微核架构,微服务架构,或云架构。本申请实施例以分层架构的Android系统为例,示例性说明终端设备的软件结构。The software system of the terminal device can adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. The embodiments of the present application take an Android system with a layered architecture as an example to exemplarily describe the software structure of a terminal device.
本申请的实施例提供了一种场景识别装置,包括:处理器以及用于存储处理器可执行指令的存储器;其中,所述处理器被配置为执行所述指令时实现上述方法。An embodiment of the present application provides a scene recognition apparatus, including: a processor and a memory for storing instructions executable by the processor; wherein the processor is configured to implement the above method when executing the instructions.
本申请的实施例提供了一种非易失性计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现上述方法。Embodiments of the present application provide a non-volatile computer-readable storage medium on which computer program instructions are stored, and when the computer program instructions are executed by a processor, implement the above method.
本申请的实施例提供了一种计算机程序产品,包括计算机可读代码,或者承载有计算机可读代码的非易失性计算机可读存储介质,当所述计算机可读代码在电子设备的处理器中运行时,所述电子设备中的处理器执行上述方法。Embodiments of the present application provide a computer program product, including computer-readable codes, or a non-volatile computer-readable storage medium carrying computer-readable codes, when the computer-readable codes are stored in a processor of an electronic device When running in the electronic device, the processor in the electronic device executes the above method.
计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是――但不限于――电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(Random Access Memory,RAM)、只读存储器(Read Only Memory,ROM)、可擦式可编程只读存储器(Electrically Programmable Read-Only-Memory,EPROM或闪存)、静态随机存取存储器(Static Random-Access Memory,SRAM)、便携式压缩盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、数字多功能盘(Digital Video Disc,DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。A computer-readable storage medium may be a tangible device that can hold and store instructions for use by the instruction execution device. The computer-readable storage medium may be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read-only memory (Electrically Programmable Read-Only-Memory, EPROM or flash memory), static random access memory (Static Random-Access Memory, SRAM), portable compact disk read-only memory (Compact Disc Read-Only Memory, CD - ROM), Digital Video Disc (DVD), memory sticks, floppy disks, mechanically encoded devices, such as punch cards or raised structures in grooves on which instructions are stored, and any suitable combination of the foregoing .
这里所描述的计算机可读程序指令或代码可以从计算机可读存储介质下载到各个计算/处理设备,或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令,并转发该计算机可读程序指令,以供存储在各个计算/处理设备中的计算机可读存储介质中。Computer readable program instructions or code described herein may be downloaded to various computing/processing devices from a computer readable storage medium, or to an external computer or external storage device over a network such as the Internet, a local area network, a wide area network and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer-readable program instructions from a network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
用于执行本申请操作的计算机程序指令可以是汇编指令、指令集架构(Instruction Set Architecture,ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码,所述编程语言包括面向对象的编程语言—诸如Smalltalk、C++等,以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络—包括局域网(Local Area Network,LAN)或广域网(Wide Area Network,WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中,通过利用计算机可读程序指令的状态信息来个性化定制电子电路,例如可编程逻辑电路、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或可编程逻辑阵列(Programmable Logic Array,PLA),该电子电路可以执行计算机可读程序指令,从而实现 本申请的各个方面。The computer program instructions used to perform the operations of the present application may be assembly instructions, Instruction Set Architecture (ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, state setting data, or in one or more source or object code written in any combination of programming languages, including object-oriented programming languages such as Smalltalk, C++, etc., and conventional procedural programming languages such as the "C" language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network—including a Local Area Network (LAN) or a Wide Area Network (WAN)—or, may be connected to an external computer (eg, use an internet service provider to connect via the internet). In some embodiments, electronic circuits, such as programmable logic circuits, Field-Programmable Gate Arrays (FPGA), or Programmable Logic Arrays (Programmable Logic Arrays), are personalized by utilizing state information of computer-readable program instructions. Logic Array, PLA), the electronic circuit can execute computer readable program instructions to implement various aspects of the present application.
这里参照根据本申请实施例的方法、装置(系统)和计算机程序产品的流程图和/或框图描述了本申请的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。Aspects of the present application are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the present application. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理器,从而生产出一种机器,使得这些指令在通过计算机或其它可编程数据处理装置的处理器执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer or other programmable data processing apparatus to produce a machine that causes the instructions when executed by the processor of the computer or other programmable data processing apparatus , resulting in means for implementing the functions/acts specified in one or more blocks of the flowchart and/or block diagrams. These computer readable program instructions can also be stored in a computer readable storage medium, these instructions cause a computer, programmable data processing apparatus and/or other equipment to operate in a specific manner, so that the computer readable medium on which the instructions are stored includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.
也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上,使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。Computer readable program instructions can also be loaded onto a computer, other programmable data processing apparatus, or other equipment to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process , thereby causing instructions executing on a computer, other programmable data processing apparatus, or other device to implement the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.
附图中的流程图和框图显示了根据本申请的多个实施例的装置、系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,所述模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of apparatuses, systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more functions for implementing the specified logical function(s) executable instructions. In some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行相应的功能或动作的硬件(例如电路或ASIC(Application Specific Integrated Circuit,专用集成电路))来实现,或者可以用硬件和软件的组合,如固件等来实现。It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in hardware (eg, circuits or ASICs (Application) that perform the corresponding functions or actions. Specific Integrated Circuit, application-specific integrated circuit)), or can be implemented by a combination of hardware and software, such as firmware.
尽管在此结合各实施例对本发明进行了描述,然而,在实施所要求保护的本发明过程中,本领域技术人员通过查看所述附图、公开内容、以及所附权利要求书,可理解并实现所述公开实施例的其它变化。在权利要求中,“包括”(comprising)一词不排除其他组成部分或步骤,“一”或“一个”不排除多个的情况。单个处理器或其它单元可以实现权利要求中列举的若干项功能。相互不同的从属权利要求中记载了某些措施,但这并不表示这些措施不能组合起来产生良好的效果。While the invention has been described herein in connection with various embodiments, those skilled in the art will understand and understand from a review of the drawings, the disclosure, and the appended claims in practicing the claimed invention. Other variations of the disclosed embodiments are implemented. In the claims, the word "comprising" does not exclude other components or steps, and "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that these measures cannot be combined to advantage.
以上已经描述了本申请的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。Various embodiments of the present application have been described above, and the foregoing descriptions are exemplary, not exhaustive, and not limiting of the disclosed embodiments. Numerous modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the various embodiments, the practical application or improvement over the technology in the marketplace, or to enable others of ordinary skill in the art to understand the various embodiments disclosed herein.

Claims (21)

  1. 一种场景识别方法,其特征在于,所述方法包括:A scene recognition method, characterized in that the method comprises:
    终端设备通过多个摄像头从多个方位角采集同一场景下的图像,其中,每个摄像头采集所述图像时的方位角为所述图像对应的方位角,所述方位角为所述每个摄像头采集所述图像时的方向向量与重力单位向量的夹角;The terminal device collects images in the same scene from multiple azimuth angles through multiple cameras, wherein the azimuth angle at which each camera collects the image is the azimuth angle corresponding to the image, and the azimuth angle is the azimuth angle of each camera the angle between the direction vector and the gravity unit vector when the image is collected;
    终端设备根据所述图像和所述图像对应的方位角识别所述同一场景,得到场景识别结果。The terminal device recognizes the same scene according to the image and the azimuth angle corresponding to the image, and obtains a scene recognition result.
  2. 根据权利要求1所述的方法,其特征在于,终端设备根据所述图像和所述图像对应的方位角识别所述同一场景,得到场景识别结果,包括:The method according to claim 1, wherein the terminal device recognizes the same scene according to the image and the azimuth angle corresponding to the image, and obtains a scene recognition result, comprising:
    终端设备从所述图像对应的方位角提取所述图像对应的方位角特征;The terminal device extracts the azimuth angle feature corresponding to the image from the azimuth angle corresponding to the image;
    利用场景识别模型基于所述图像和所述图像对应的方位角特征识别所述同一场景,得到场景识别结果,其中,所述场景识别模型为神经网络模型。A scene recognition model is used to recognize the same scene based on the image and the azimuth feature corresponding to the image, and a scene recognition result is obtained, wherein the scene recognition model is a neural network model.
  3. 根据权利要求2所述的方法,其特征在于,所述场景识别模型包括多对第一特征提取层和第一层模型,The method according to claim 2, wherein the scene recognition model comprises multiple pairs of the first feature extraction layer and the first layer model,
    每对第一特征提取层和第一层模型用于对一个方位角的图像和所述一个方位角的图像对应的方位角进行处理,得到第一识别结果;Each pair of the first feature extraction layer and the first layer model is used to process an image of an azimuth angle and an azimuth angle corresponding to the image of the one azimuth angle to obtain a first recognition result;
    其中,所述一个方位角的图像对应的方位角为所述第一识别结果对应的方位角,所述第一特征提取层用于提取所述一个方位角的图像的特征,得到特征向量,所述第一层模型用于根据所述特征向量和所述一个方位角的图像对应的方位角,得到所述第一识别结果;Wherein, the azimuth angle corresponding to the image of the one azimuth angle is the azimuth angle corresponding to the first recognition result, and the first feature extraction layer is used to extract the feature of the image of the one azimuth angle to obtain the feature vector, and the The first layer model is used to obtain the first recognition result according to the feature vector and the azimuth angle corresponding to the image of the one azimuth angle;
    所述场景识别模型还包括第二层模型,所述第二层模型用于根据所述第一识别结果和第一识别结果对应的方位角,得到所述场景识别结果。The scene recognition model further includes a second-layer model, and the second-layer model is used to obtain the scene recognition result according to the first recognition result and the azimuth angle corresponding to the first recognition result.
  4. 根据权利要求3所述的方法,其特征在于,The method of claim 3, wherein:
    所述第一特征提取层用于提取所述一个方位角的图像的特征,得到多个特征向量;The first feature extraction layer is used to extract the feature of the image of one azimuth angle to obtain a plurality of feature vectors;
    所述第一层模型用于根据所述一个方位角的图像对应的方位角计算所述多个特征向量中的每个特征向量对应的第一权重;The first layer model is used to calculate the first weight corresponding to each feature vector in the plurality of feature vectors according to the azimuth angle corresponding to the image of the one azimuth angle;
    所述第一层模型用于根据所述每个特征向量和每个特征向量对应的第一权重,得到所述第一识别结果。The first layer model is configured to obtain the first recognition result according to each feature vector and the first weight corresponding to each feature vector.
  5. 根据权利要求3所述的方法,其特征在于,The method of claim 3, wherein:
    所述第二层模型用于根据所述第一识别结果对应的方位角计算所述第一识别结果的第二权重;The second layer model is used to calculate the second weight of the first identification result according to the azimuth angle corresponding to the first identification result;
    所述第二层模型用于根据所述第一识别结果和所述第一识别结果的第二权重,得到所述场景识别结果。The second layer model is configured to obtain the scene recognition result according to the first recognition result and the second weight of the first recognition result.
  6. 根据权利要求3所述的方法,其特征在于,The method of claim 3, wherein:
    所述第二层模型预设了每个所述第一识别结果对应的第三权重,The second layer model presets a third weight corresponding to each of the first recognition results,
    所述第二层模型用于根据所述第一识别结果和所述第一识别结果对应的第三权重,得到所述场景识别结果。The second layer model is configured to obtain the scene recognition result according to the first recognition result and the third weight corresponding to the first recognition result.
  7. 根据权利要求3所述的方法,其特征在于,The method of claim 3, wherein:
    所述第二层模型用于根据所述方位角和预设规则,确定每个所述第一识别结果对应的第四权重;其中,所述预设规则为根据方位角设置的权重组,不同的方位角对应了不同的权重组,每个权重组包括每个第一识别结果对应的第四权重;The second-layer model is used to determine the fourth weight corresponding to each of the first recognition results according to the azimuth angle and a preset rule; wherein, the preset rule is a weight group set according to the azimuth angle, and different The azimuth angles of , correspond to different weight groups, and each weight group includes the fourth weight corresponding to each first recognition result;
    所述第二层模型用于根据所述第一识别结果和所述第一识别结果对应的第四权重,得到所述场景识别结果。The second-layer model is configured to obtain the scene recognition result according to the first recognition result and the fourth weight corresponding to the first recognition result.
  8. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    终端设备获取重力传感器在每个摄像头采集所述图像时对应的三维直角坐标系的坐标轴上的加速度,得到每个摄像头采集所述图像时的方向向量;The terminal device obtains the acceleration on the coordinate axis of the three-dimensional rectangular coordinate system corresponding to the gravity sensor when each camera collects the image, and obtains the direction vector when each camera collects the image;
    其中,每个摄像头采集所述图像时对应的三维直角坐标系以所述每个摄像头为原点,z方向为沿着摄像头拍摄的方向,x和y分别为与z方向垂直的方向,且x和y所在的平面与z方向垂直;Wherein, the three-dimensional rectangular coordinate system corresponding to each camera when collecting the image takes each camera as the origin, the z direction is the direction along the camera, x and y are the directions perpendicular to the z direction, and x and y are respectively the directions perpendicular to the z direction. The plane in which y is located is perpendicular to the z direction;
    根据所述方向向量和所述重力单位向量计算所述方位角。The azimuth is calculated from the direction vector and the gravity unit vector.
  9. 根据权利要求2所述的方法,其特征在于,在利用场景识别模型基于所述图像和所述图像对应的方位角特征识别所述同一场景之前,所述方法还包括:The method according to claim 2, wherein before using a scene recognition model to identify the same scene based on the image and the azimuth feature corresponding to the image, the method further comprises:
    终端设备对所述图像进行预处理;The terminal device preprocesses the image;
    其中,所述预处理包括以下处理中的一种或多种的组合:转换图像格式、转换图像通道、统一图像尺寸、图像归一化,转换图像格式是指将彩色图像转换为黑白图像,转换图像通道是指将图像转换到红绿蓝RGB通道,统一图像尺寸是指调整多个图像的长度相同、宽度相同,图像归一化是指将图像的像素值归一化。Wherein, the preprocessing includes one or a combination of the following processes: converting image format, converting image channels, unifying image size, and image normalizing, converting image format refers to converting a color image into a black-and-white image, converting Image channel refers to converting the image to red, green, and blue RGB channels, uniform image size refers to adjusting multiple images to have the same length and width, and image normalization refers to normalizing the pixel values of the image.
  10. 一种场景识别装置,其特征在于,所述装置包括:A scene recognition device, characterized in that the device comprises:
    图像采集模块,用于通过多个摄像头从多个方位角采集同一场景下的图像,其中,每个摄像头采集所述图像时的方位角为所述图像对应的方位角,所述方位角为所述每个摄像头采集所述图像时的方向向量与重力单位向量的夹角;The image acquisition module is used to collect images in the same scene from multiple azimuth angles through multiple cameras, wherein the azimuth angle when each camera collects the image is the azimuth angle corresponding to the image, and the azimuth angle is the azimuth angle corresponding to the image. the angle between the direction vector and the unit vector of gravity when each camera collects the image;
    场景识别模块,用于根据所述图像和所述图像对应的方位角识别所述同一场景,得到场景识别结果。A scene recognition module, configured to recognize the same scene according to the image and the azimuth angle corresponding to the image, and obtain a scene recognition result.
  11. 根据权利要求10所述的装置,其特征在于,所述场景识别模块包括:The device according to claim 10, wherein the scene recognition module comprises:
    方位角特征提取模块,用于从所述图像对应的方位角提取所述图像对应的方位角特征;an azimuth feature extraction module, used for extracting the azimuth feature corresponding to the image from the azimuth angle corresponding to the image;
    场景识别模型,用于基于所述图像和所述图像对应的方位角特征识别所述同一场景,得到场景识别结果,其中,所述场景识别模型为神经网络模型。A scene recognition model, configured to recognize the same scene based on the image and the azimuth angle feature corresponding to the image, and obtain a scene recognition result, wherein the scene recognition model is a neural network model.
  12. 根据权利要求11所述的装置,其特征在于,所述场景识别模型包括多对第一特征提取层和第一层模型,The device according to claim 11, wherein the scene recognition model comprises a plurality of pairs of the first feature extraction layer and the first layer model,
    每对第一特征提取层和第一层模型用于对一个方位角的图像和所述一个方位角的图像对应的方位角进行处理,得到第一识别结果;Each pair of the first feature extraction layer and the first layer model is used to process an image of an azimuth angle and an azimuth angle corresponding to the image of the one azimuth angle to obtain a first recognition result;
    其中,所述一个方位角的图像对应的方位角为所述第一识别结果对应的方位角,所述第一特征提取层用于提取所述一个方位角的图像的特征,得到特征向量,所述第一层模型用于根据所述特征向量和所述一个方位角的图像对应的方位角,得到所述第一识别结果;Wherein, the azimuth angle corresponding to the image of the one azimuth angle is the azimuth angle corresponding to the first recognition result, and the first feature extraction layer is used to extract the feature of the image of the one azimuth angle to obtain the feature vector, and the The first layer model is used to obtain the first recognition result according to the feature vector and the azimuth angle corresponding to the image of the one azimuth angle;
    所述场景识别模型还包括第二层模型,所述第二层模型用于根据所述第一识别结果和第 一识别结果对应的方位角,得到所述场景识别结果。The scene recognition model further includes a second layer model, and the second layer model is used to obtain the scene recognition result according to the first recognition result and the azimuth angle corresponding to the first recognition result.
  13. 根据权利要求12所述的装置,其特征在于,The device of claim 12, wherein:
    所述第一特征提取层用于提取所述一个方位角的图像的特征,得到多个特征向量;The first feature extraction layer is used to extract the feature of the image of one azimuth angle to obtain a plurality of feature vectors;
    所述第一层模型用于根据所述一个方位角的图像对应的方位角计算所述多个特征向量中的每个特征向量对应的第一权重;The first layer model is used to calculate the first weight corresponding to each feature vector in the plurality of feature vectors according to the azimuth angle corresponding to the image of the one azimuth angle;
    所述第一层模型用于根据所述每个特征向量和每个特征向量对应的第一权重,得到所述第一识别结果。The first layer model is configured to obtain the first recognition result according to each feature vector and the first weight corresponding to each feature vector.
  14. 根据权利要求12所述的装置,其特征在于,The device of claim 12, wherein:
    所述第二层模型用于根据所述第一识别结果对应的方位角计算所述第一识别结果的第二权重;The second layer model is used to calculate the second weight of the first identification result according to the azimuth angle corresponding to the first identification result;
    所述第二层模型用于根据所述第一识别结果和所述第一识别结果的第二权重,得到所述场景识别结果。The second layer model is configured to obtain the scene recognition result according to the first recognition result and the second weight of the first recognition result.
  15. 根据权利要求12所述的装置,其特征在于,The device of claim 12, wherein:
    所述第二层模型预设了每个所述第一识别结果对应的第三权重,The second layer model presets a third weight corresponding to each of the first recognition results,
    所述第二层模型用于根据所述第一识别结果和所述第一识别结果对应的第三权重,得到所述场景识别结果。The second layer model is configured to obtain the scene recognition result according to the first recognition result and the third weight corresponding to the first recognition result.
  16. 根据权利要求12所述的装置,其特征在于,The device of claim 12, wherein:
    所述第二层模型用于根据所述方位角和预设规则,确定每个所述第一识别结果对应的第四权重;The second layer model is used to determine the fourth weight corresponding to each of the first recognition results according to the azimuth angle and the preset rule;
    其中,所述预设规则为根据方位角设置的权重组,不同的方位角对应了不同的权重组,每个权重组包括每个第一识别结果对应的第四权重;The preset rules are weight groups set according to azimuth angles, different azimuth angles correspond to different weight groups, and each weight group includes a fourth weight corresponding to each first identification result;
    所述第二层模型用于根据所述第一识别结果和所述第一识别结果对应的第四权重,得到所述场景识别结果。The second-layer model is configured to obtain the scene recognition result according to the first recognition result and the fourth weight corresponding to the first recognition result.
  17. 根据权利要求10所述的装置,其特征在于,所述装置还包括:The apparatus of claim 10, wherein the apparatus further comprises:
    方位角采集模块,用于获取重力传感器在每个摄像头采集所述图像时对应的三维直角坐标系的坐标轴上的加速度,得到每个摄像头采集所述图像时的方向向量;an azimuth angle acquisition module, configured to acquire the acceleration on the coordinate axis of the three-dimensional rectangular coordinate system corresponding to the gravity sensor when each camera collects the image, and obtain the direction vector when each camera collects the image;
    其中,每个摄像头采集所述图像时对应的三维直角坐标系以所述每个摄像头为原点,z方向为沿着摄像头拍摄的方向,x和y分别为与z方向垂直的方向,且x和y所在的平面与z方向垂直;根据所述方向向量和所述重力单位向量计算所述方位角。Wherein, the three-dimensional rectangular coordinate system corresponding to each camera when collecting the image takes each camera as the origin, the z direction is the direction along the camera, x and y are the directions perpendicular to the z direction, and x and y are respectively the directions perpendicular to the z direction. The plane where y is located is perpendicular to the z direction; the azimuth angle is calculated according to the direction vector and the gravity unit vector.
  18. 根据权利要求11所述的装置,其特征在于,所述装置还包括:The apparatus of claim 11, wherein the apparatus further comprises:
    图像预处理模块,用于对所述图像进行预处理;an image preprocessing module for preprocessing the image;
    其中,所述预处理包括以下处理中的一种或多种的组合:转换图像格式、转换图像通道、统一图像尺寸、图像归一化,转换图像格式是指将彩色图像转换为黑白图像,转换图像通道是指将图像转换到红绿蓝RGB通道,统一图像尺寸是指调整多个图像的长度相同、宽度相同,图像归一化是指将图像的像素值归一化。Wherein, the preprocessing includes one or a combination of the following processes: converting image format, converting image channels, unifying image size, and image normalizing, converting image format refers to converting a color image into a black-and-white image, converting Image channel refers to converting the image to red, green, and blue RGB channels, uniform image size refers to adjusting multiple images to have the same length and width, and image normalization refers to normalizing the pixel values of the image.
  19. 一种计算机程序产品,包括计算机可读代码,当所述计算机可读代码在电子设备中运行时,所述电子设备中的处理器执行权利要求1-9任意一项所述的方法。A computer program product comprising computer readable code, when the computer readable code is executed in an electronic device, a processor in the electronic device executes the method of any one of claims 1-9.
  20. 一种非易失性计算机可读存储介质,其上存储有计算机程序指令,其特征在于,所述计算机程序指令被处理器执行时实现权利要求1-9中任意一项所述的方法。A non-volatile computer-readable storage medium on which computer program instructions are stored, characterized in that, when the computer program instructions are executed by a processor, the method described in any one of claims 1-9 is implemented.
  21. 一种终端设备,其特征在于,包括:A terminal device, characterized in that it includes:
    处理器;processor;
    用于存储处理器可执行指令的存储器;memory for storing processor-executable instructions;
    其中,所述处理器被配置为执行所述指令时实现权利要求1-9任意一项所述的方法。Wherein, the processor is configured to implement the method of any one of claims 1-9 when executing the instructions.
PCT/CN2021/140833 2021-02-25 2021-12-23 Environment identification method and apparatus WO2022179281A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110215000.3A CN115049909A (en) 2021-02-25 2021-02-25 Scene recognition method and device
CN202110215000.3 2021-02-25

Publications (1)

Publication Number Publication Date
WO2022179281A1 true WO2022179281A1 (en) 2022-09-01

Family

ID=83048679

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/140833 WO2022179281A1 (en) 2021-02-25 2021-12-23 Environment identification method and apparatus

Country Status (2)

Country Link
CN (1) CN115049909A (en)
WO (1) WO2022179281A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105339841A (en) * 2013-12-06 2016-02-17 华为终端有限公司 Photographing method for dual-camera device and dual-camera device
CN109117693A (en) * 2017-06-22 2019-01-01 深圳华智融科技股份有限公司 A kind of method and terminal of the scanning recognition found a view based on wide-angle
CN109903393A (en) * 2019-02-22 2019-06-18 清华大学 New Century Planned Textbook Scene Composition methods and device based on deep learning
US10504008B1 (en) * 2016-07-18 2019-12-10 Occipital, Inc. System and method for relocalization and scene recognition
US20200265554A1 (en) * 2019-02-18 2020-08-20 Beijing Xiaomi Mobile Software Co., Ltd. Image capturing method and apparatus, and terminal

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105339841A (en) * 2013-12-06 2016-02-17 华为终端有限公司 Photographing method for dual-camera device and dual-camera device
US10504008B1 (en) * 2016-07-18 2019-12-10 Occipital, Inc. System and method for relocalization and scene recognition
CN109117693A (en) * 2017-06-22 2019-01-01 深圳华智融科技股份有限公司 A kind of method and terminal of the scanning recognition found a view based on wide-angle
US20200265554A1 (en) * 2019-02-18 2020-08-20 Beijing Xiaomi Mobile Software Co., Ltd. Image capturing method and apparatus, and terminal
CN109903393A (en) * 2019-02-22 2019-06-18 清华大学 New Century Planned Textbook Scene Composition methods and device based on deep learning

Also Published As

Publication number Publication date
CN115049909A (en) 2022-09-13

Similar Documents

Publication Publication Date Title
CN108960209B (en) Identity recognition method, identity recognition device and computer readable storage medium
CN109034102B (en) Face living body detection method, device, equipment and storage medium
US20220076000A1 (en) Image Processing Method And Apparatus
US9811910B1 (en) Cloud-based image improvement
WO2020048308A1 (en) Multimedia resource classification method and apparatus, computer device, and storage medium
CN110647865A (en) Face gesture recognition method, device, equipment and storage medium
WO2019033747A1 (en) Method for determining target intelligently followed by unmanned aerial vehicle, unmanned aerial vehicle and remote controller
CN111368811B (en) Living body detection method, living body detection device, living body detection equipment and storage medium
CN110059652B (en) Face image processing method, device and storage medium
US20220262035A1 (en) Method, apparatus, and system for determining pose
JP2021503659A (en) Biodetection methods, devices and systems, electronic devices and storage media
CN110062171B (en) Shooting method and terminal
CN110650379A (en) Video abstract generation method and device, electronic equipment and storage medium
US20230005277A1 (en) Pose determining method and related device
CN113066048A (en) Segmentation map confidence determination method and device
WO2021218695A1 (en) Monocular camera-based liveness detection method, device, and readable storage medium
CN112818979B (en) Text recognition method, device, equipment and storage medium
CN115880213A (en) Display abnormity detection method, device and system
WO2022179281A1 (en) Environment identification method and apparatus
CN110163192B (en) Character recognition method, device and readable medium
CN113468929A (en) Motion state identification method and device, electronic equipment and storage medium
CN111341307A (en) Voice recognition method and device, electronic equipment and storage medium
WO2022105793A1 (en) Image processing method and device
WO2022161011A1 (en) Method for generating image and electronic device
CN111538009A (en) Radar point marking method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21927705

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21927705

Country of ref document: EP

Kind code of ref document: A1