CN116866103A

CN116866103A - Indoor household appliance interaction intelligent judgment method based on nerf

Info

Publication number: CN116866103A
Application number: CN202310941429.XA
Authority: CN
Inventors: 高辰宇; 姚旭泽
Original assignee: Individual
Current assignee: Individual
Priority date: 2023-07-28
Filing date: 2023-07-28
Publication date: 2023-10-10

Abstract

The invention discloses a method for intelligently judging interaction of indoor household appliances based on a surf, which specifically comprises the following steps: s1, scanning by a mobile phone: in the process, a user only needs to shoot a video which covers all the IoT devices in the room by using a mobile phone, and the invention relates to the technical field of intelligent interaction systems. According to the intelligent judging method for indoor household appliance interaction based on the NERF, indoor scanning is carried out by using a mobile phone, so that a user can directly use the mobile phone to acquire information of an indoor environment without purchasing additional hardware, and indoor environment reconstruction is carried out by using a NERF technology: the existing intelligent home system is generally based on preset room layout and equipment positions, in the scheme, the indoor environment of the indoor intelligent home control system based on the Nerf reconstruction information and gesture perception is automatically reconstructed through mobile phone scanning and NeRF technology, so that usability and flexibility of the system are certainly improved, and the intelligent home control is performed by using a digital twin model.

Description

Indoor household appliance interaction intelligent judgment method based on nerf

Technical Field

The invention relates to the technical field of intelligent interaction systems, in particular to a method for intelligently judging interaction of indoor household appliances based on a nerf.

Background

NeRF is a technology for generating new and synthesized views, which is suitable for complex scenes, such as indoor and outdoor environments, and the core idea is to learn a neural network, the network takes a point and a viewing direction in a 3D space as input, the color and opacity of the point in the direction are output, intelligent voice interaction is a new generation interaction mode based on voice input, you can obtain feedback results through speaking, a typical application scene, namely a voice assistant, and intelligent voice interaction applications are rapidly developed since iPhone 4S derives SIRI, and Chinese typical intelligent voice interaction applications are as follows: the voice assistant and the message flying word point of the worm hole are accepted by more and more users.

In order to monitor the indoor environment, the existing user also needs to purchase other equipment additionally, the user needs to manually input the position of the equipment or configure the layout of the room, the whole operation has a certain limitation, and the indoor environment cannot be reflected in real time in the prior art, so that the use efficiency of the internal intelligent equipment is lower.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides the indoor household appliance interaction intelligent judging method based on the nerf, which solves the problem that the use efficiency of the internal intelligent equipment is lower due to certain limitation of the whole operation.

In order to achieve the above purpose, the invention is realized by the following technical scheme: an intelligent judging method for indoor household appliance interaction based on a nerf specifically comprises the following steps:

s1, scanning by a mobile phone: in the process, the user only needs to shoot a video covering all IoT devices in the room with the mobile phone;

s2, deducing camera external parameters:

(3) Feature extraction and matching: when the features are extracted from all the images, the COLMAP starts to perform feature matching, wherein feature matching refers to finding out the same feature points between two or more images, feature descriptors, a vector representing the intensity pattern of pixels around the features plays a key role at this stage, and features with the most similar (i.e. closest) descriptors can be found in different images by using, for example, the euclidean distance method, and are considered as matched feature points;

(4) Sparse reconstruction: with the matched feature points, we can create a sparse three-dimensional model by computing the position of the feature points in space and the position and pose of the camera by solving an optimization problem called "beam-method adjustment", which is basically a process that minimizes projection errors. Projection errors are the distances between the points observed on the image and the 3D points projected by the camera model. In this process, the algorithm will try to find a set of camera parameters (including the position and pose of each camera) that minimize all projection errors, as well as the position of the 3D points, this optimization problem is usually solved by iterative linearization methods, such as the gauss-newton method or the Li Wen berg-quard method. In each iteration, a parameter update is searched to further reduce projection errors, and when the optimization converges, the position and the posture of the camera and the three-dimensional coordinates of the matched feature points can be obtained;

s3, reconstructing an indoor environment by using the NeRF according to the image to obtain a NeRF model;

s4, according to the NeRF model, an image of any view angle in any direction in the room can be obtained, and the image can be obtained by the method:

(1) Position of fixed camera in room

The part of the method is the same as that of the part 2, and the position parameters of the fixed camera can be easily obtained through feature point matching and open reconstruction by using the existing image sequence and NeRF complementary rendering;

(2) The position of the home appliances at each position in the room, and two methods are comprehensively judged by the part:

a.3D object identification

i. Generating a 3D model: first, using the NeRF model, we can generate a 3D model of the room from multiple perspectives, including information on the shape, color, and location of objects, which can cover every corner in the room, so we can obtain the specific location of the appliance from this model;

object recognition: then we need to identify the home appliance in the 3D model. This can be achieved by training a deep learning model that can identify various appliances in the 3D model, e.g., we can train a 3D convolutional neural network that can process 3D data and identify the shape of the appliances;

household appliance positioning: once we identify the appliance in the 3D model, we can determine the location of the appliance, in particular we can find the center of gravity of the appliance model, which can be used as the location of the appliance, which is in 3D space, which can be used to guide the subsequent appliance control;

b, depth perception judgment of 2D:

firstly, carrying out target identification on a video sequence, determining the position of a household appliance in space, then respectively rendering a plurality of view angles facing the household appliance at different positions by using the surf for the position of each household appliance, and obtaining the space information of the household appliance according to the depth perception principle of the binocular camera:

i. target identification: first, it is necessary to identify various home appliances in the image, which can be achieved by target identification algorithms in deep learning, such as YOLO, SSD, or fast R-CNN, etc., which can identify predefined object categories in the image and give their positions in the image (typically represented by a bounding box);

spatial localization: next, the position of each home appliance in three-dimensional space needs to be determined. In order to achieve this, the ability of the NeRF model may be utilized, first, for each identified home appliance, depth information of each pixel point in its bounding box may be used as input, and corresponding three-dimensional coordinates may be obtained through the NeRF model, so that a rough position of each home appliance in a three-dimensional space may be obtained;

perspective rendering and depth perception: in order to acquire accurate three-dimensional information of the household appliance, images of the household appliance can be rendered from different visual angles by using a NeRF model, the principle is similar to that of a binocular camera or multi-view stereoscopic vision, the depth information of an object can be calculated by comparing images of the same object at different visual angles, and in the process, the NeRF model can generate images at any visual angle, so that enough visual angles can be provided for depth perception;

three-dimensional modeling: after the accurate three-dimensional position of the household appliance is obtained, a NeRF model can be further used for carrying out three-dimensional modeling on the household appliance, specifically, each part of the household appliance can be used as input, and the shape and the color of the part in the three-dimensional space can be obtained through the NeRF model, so that the three-dimensional model of the household appliance can be obtained, and the method is very useful for subsequent intelligent household control;

s5, modeling the intelligent household appliances in the room, and obtaining a digital twin;

s6, in order to ensure safety and privacy, a follow-up reasoning and control part is arranged at the edge end;

s7, the camera judges and infers the gesture of the person, and then judges according to the arrival of the position of the intelligent household appliance:

a. human body posture estimation: firstly, human body posture estimation needs to be carried out in a video stream acquired from a camera, which is an important problem in computer vision, the aim is to find key points of a human body such as a head, a shoulder, an elbow, a knee, a foot and the like in an image or a video, then determine the spatial relationship among the key points, so as to represent the posture of the human body, and a plurality of deep learning methods can be used for human body posture estimation such as OpenPose, alphaPose, and the methods firstly use a Convolutional Neural Network (CNN) to process the image, then find the key points of the human body through regression or classification tasks, and the algorithms can process the posture estimation of single person or multiple persons and can process the posture estimation of 2D or 3D;

b. posture judgment and reasoning: once the pose information of the person is obtained, we can make further judgment and reasoning, for example, we can judge the behavior of the person by analyzing the position and movement of the key points, such as walking, sitting down, lifting hands, etc., and we can infer the attention of the person by observing the face direction of the person or the gaze point of eyes, which is important for understanding the intention of the person;

c. judging according to the position of the household appliance: having knowledge of the person's posture and behavior, we also need to consider the location of the smart home, e.g. if the person is looking at the television, he may want to turn on the television; if a person is in the kitchen and looking into the refrigerator, he may want to open the refrigerator, during which we need to know the location of each appliance, which can be obtained by the earlier NeRF modeling, and then we can determine which appliance the person may want to interact with based on the person's location and posture and the location of the appliance.

Preferably, in said S2 (1), feature extraction and matching are the first steps in the workflow and also the basic steps in computer vision, the main goal of this stage being to find and describe specific points in the image that are representative of the object or scene in the image, such as corner points, edges or other salient local features that have unique image properties so that they can be identified and matched with each other between multiple images, these feature points often being used to identify images and estimate camera motion, feature extractors can find and describe these features at various scales, typically using SIFT algorithms, with rotational, scale and illumination invariance.

Preferably, in the step S3, neRF is a technique for generating a new and synthesized view, which is suitable for complex scenes, such as indoor and outdoor environments, and its core idea is to learn a neural network, which takes as input a point in 3D space and a viewing direction, and outputs the color and opacity of the point in that direction, and the main advantage of NeRF is that it can generate a high-quality and realistic synthesized view, and at the same time finely model details of the scene (such as shadows and reflections). However, its main disadvantage is the large computational overhead, since 5D integration is required once for every new view generated.

Preferably, in the step S3, the workflow of NeRF is as follows:

a. input: neRF receives as input a set of 2D images, which should include multiple views of the scene, along with the camera parameters of those images (including camera position and orientation of each image), so that global information of the scene can be obtained;

b. training process: in training, neRF takes as input each point in the scene (in 3D space) and a view direction, then predicts the color and volume density of the point in that direction, which determines the degree of attenuation of the light as it passes through the point, this process can be seen as a 5D problem (point in 3D space +2D visual angle), and in training, neRF uses an integral loss function that measures the error between the predicted color and the observed color of the network, this integral process essentially simulates the propagation of the light in the scene, and by back propagation and gradient descent, the parameters of NeRF are optimized so that the predicted color is as close as possible to the observed color;

c. and (3) outputting: after training is completed, the NeRF may generate a new, synthesized view. This is achieved by forward propagating each point in the scene and a new direction of observation, and then integrating the predicted color and volume density.

Preferably, in the step S6, the digital twin is converted to form a data set composed of the room space and the coordinates and types of different objects in the room, and the subsequent judgment is performed by using the data set.

Advantageous effects

The invention provides a method for intelligently judging interaction of indoor household appliances based on a surf. Compared with the prior art, the method has the following beneficial effects: according to the intelligent judging method for indoor household appliance interaction based on the NERF, indoor scanning is carried out by using a mobile phone, so that a user can directly use the mobile phone to acquire information of an indoor environment without purchasing additional hardware, and indoor environment reconstruction is carried out by using a NERF technology: the existing smart home system is usually based on preset room layout and equipment position, in the scheme, the indoor environment of the indoor intelligent home control system is automatically rebuilt based on the NeRF rebuilding information and attitude sensing through a mobile phone scanning and NeRF technology, the usability and flexibility of the system are certainly improved, a digital twin model is used for intelligent home control, the existing smart home system rarely uses the digital twin model for equipment control, a user does not need to manually input the position of equipment or configure the room layout by utilizing the NeRF technology, the system can automatically acquire and model the information, and the real-time performance and precision of the smart home system are improved: the digital twin model can reflect the state of the indoor environment in real time, so that the equipment control is more accurate and efficient, through automatic scanning and modeling of the indoor environment, a user can obtain more concise and efficient experience when configuring and using the intelligent home system, the user can rescan and update the indoor environment through a mobile phone at any time, the system can adapt to the change of the environment, and through the digital twin model updated in real time, the intelligent home system can obtain more accurate equipment state information, so that the equipment control is more accurate.

Drawings

FIG. 1 is a flow chart of the steps of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, the present invention provides a technical solution: an intelligent judging method for indoor household appliance interaction based on a nerf specifically comprises the following steps:

s2, deducing camera external parameters:

(5) Feature extraction and matching: when the features are extracted from all the images, the COLMAP starts to perform feature matching, wherein feature matching refers to finding out the same feature points between two or more images, feature descriptors, a vector representing the intensity pattern of pixels around the features plays a key role at this stage, and features with the most similar (i.e. closest) descriptors can be found in different images by using, for example, the euclidean distance method, and are considered as matched feature points;

(6) Sparse reconstruction: with the matched feature points, we can create a sparse three-dimensional model by computing the position of the feature points in space and the position and pose of the camera by solving an optimization problem called "beam-method adjustment", which is basically a process that minimizes projection errors. Projection errors are the distances between the points observed on the image and the 3D points projected by the camera model. In this process, the algorithm will try to find a set of camera parameters (including the position and pose of each camera) that minimize all projection errors, as well as the position of the 3D points, this optimization problem is usually solved by iterative linearization methods, such as the gauss-newton method or the Li Wen berg-quard method. In each iteration, a parameter update is searched to further reduce projection errors, and when the optimization converges, the position and the posture of the camera and the three-dimensional coordinates of the matched feature points can be obtained;

(1) Position of fixed camera in room

a.3D object identification

b, depth perception judgment of 2D:

In the present invention, in S2 (1), feature extraction and matching are the first step in the workflow, and also the fundamental step in computer vision, the main goal of this stage is to find and describe special points in the image that are representative of the object or scene in the image, such as corner points, edges or other significant local features that have unique image attributes so that they can be identified and matched with each other between multiple images, these feature points are often used to identify images and estimate camera motion, and feature extractors can find and describe these features at various scales, typically using SIFT algorithms, with rotational, scale and illumination invariance.

In the present invention, in S3, neRF is a technique for generating new, synthesized views, suitable for complex scenes such as indoor and outdoor environments, whose core idea is to learn a neural network that takes as input a point in 3D space and a viewing direction, and outputs the color and opacity of the point in that direction, the main advantage of NeRF is that it can generate high quality, realistic synthesized views, while also finely modeling the details of the scene (such as shadows and reflections). However, its main disadvantage is the large computational overhead, since 5D integration is required once for every new view generated.

In the present invention, in S3, the workflow of NeRF is as follows:

In the invention, in S6, the digital twin is converted to form a data set composed of the room space and the coordinates and types of different objects in the room, and the subsequent judgment is carried out by using the data set.

And all that is not described in detail in this specification is well known to those skilled in the art.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. An intelligent judging method for indoor household appliance interaction based on a nerf is characterized by comprising the following steps of: the method specifically comprises the following steps:

s2, deducing camera external parameters:

(1) Feature extraction and matching: when the features are extracted from all the images, the COLMAP starts to perform feature matching, wherein feature matching refers to finding out the same feature points between two or more images, feature descriptors, a vector representing the intensity pattern of pixels around the features plays a key role at this stage, and features with the most similar (i.e. closest) descriptors can be found in different images by using, for example, the euclidean distance method, and are considered as matched feature points;

(2) Sparse reconstruction: with the matched feature points, we can create a sparse three-dimensional model by computing the position of the feature points in space and the position and pose of the camera by solving an optimization problem called "beam-method adjustment", which is basically a process that minimizes projection errors. Projection errors are the distances between the points observed on the image and the 3D points projected by the camera model. In this process, the algorithm will try to find a set of camera parameters (including the position and pose of each camera) that minimize all projection errors, as well as the position of the 3D points, this optimization problem is usually solved by iterative linearization methods, such as the gauss-newton method or the Li Wen berg-quard method. In each iteration, a parameter update is searched to further reduce projection errors, and when the optimization converges, the position and the posture of the camera and the three-dimensional coordinates of the matched feature points can be obtained;

(1) Position of fixed camera in room

a.3D object identification

b, depth perception judgment of 2D:

2. The intelligent judging method for indoor household appliance interaction based on the surf is characterized by comprising the following steps of: in said S2 (1), feature extraction and matching, which is the first step in the workflow and is also the fundamental step in computer vision, the main goal of this stage is to find and describe special points in the image that are representative of the object or scene in the image, such as corner points, edges or other significant local features that have unique image properties so that they can be identified and matched with each other between multiple images, these feature points are often used to estimate camera motion for images and feature extractors can find and describe these features at various scales, typically using SIFT algorithms, with rotational, scale and illumination invariance.

3. The intelligent judging method for indoor household appliance interaction based on the surf is characterized by comprising the following steps of: in S3, neRF is a technique for generating new and synthesized views, suitable for complex scenes, such as indoor and outdoor environments, and its core idea is to learn a neural network, which takes as input a point in 3D space and a viewing direction, and outputs the color and opacity of the point in that direction, the main advantage of NeRF is that it can generate high-quality, realistic synthesized views, while finely modeling the details of the scene (such as shadows and reflections). However, its main disadvantage is the large computational overhead, since 5D integration is required once for every new view generated.

4. The intelligent judging method for indoor household appliance interaction based on the surf is characterized by comprising the following steps of: in the step S3, the workflow of NeRF is as follows:

5. The intelligent judging method for indoor household appliance interaction based on the surf is characterized by comprising the following steps of: in the step S6, the digital twin is converted to form a data set composed of the room space and the coordinates and types of different objects in the room, and the data set is used for subsequent judgment.