WO2021137349A1

WO2021137349A1 - Environment-based method and system for combining three-dimensional spatial recognition and two-dimensional cognitive region segmentation for real-virtual matching object arrangement

Info

Publication number: WO2021137349A1
Application number: PCT/KR2020/000616
Authority: WO
Inventors: 한상준
Original assignee: 엔센스코리아주식회사
Priority date: 2019-12-31
Filing date: 2020-01-13
Publication date: 2021-07-08

Abstract

According to the present invention, a location and direction of a portable terminal are tracked by using simultaneous localization and mapping (SLAM) that is a spatial recognition technology using a camera and an inertial sensor, and three-dimensional spatial information of a real space is generated, wherein a spatial object is generated by using an image obtained from the camera and cognitive region segmentation, and is projected onto a three-dimensional space, so that a floor and a table can be distinguished from each other, which is a limitation of plane recognition in SLAM and is impossible in the prior art. Therefore, virtual content that allows a user to interact with an object in a real space according to a pre-planned scenario DB can be provided, thereby providing a high sense of presence and interaction.

Description

[Correction 23.03.2020 according to Rule 26] 　 Method and system for combining 3D spatial recognition and 2D cognitive domain division for environment-based real-virtual matching object placement

The present invention is based on a real-virtual matching technology capable of synthesizing virtual objects in real space, and performs cognitive image segmentation using a camera image to match real-virtual objects with a higher presence, and two-dimensionally A method of fusion of the obtained cognitive image segmentation information and three-dimensional spatial information using SLAM, and augmenting objects related to real space information among pre-prepared virtual objects using information obtained through cognitive image segmentation. It relates to a method and system capable of providing a high user experience.

In general, augmented reality technology is a technology field derived from virtual reality technology that synthesizes and superimposes virtual objects on real space and shows them. It can increase the presence of a virtual object by creating the illusion that it actually exists in real space.

As a first example, a 3D point cloud map is generated from a depth image obtained using a depth camera, an object in a real space on which augmented content is to be projected with the 3D point cloud map is tracked, and virtual on the real space using a display device such as a projector. There is a way to directly overlap objects by projecting them and to interact with the user.

As a second example, as a method for generating an event by recognizing a user's motion in a three-dimensional real space, the position value of the target object in the virtual space is calculated from the depth image obtained using the depth camera, and the reference position database and the There is a method of generating an event execution signal by comparison.

However, since the methods of the above examples have only 3D spatial information obtained through a camera or a sensor, there is a limit in that it is impossible to obtain cognitive information such as in which space the user is or whether the object in front is a table or a floor.

In addition, in order for the user to augment the virtual object in the real space, there is a limitation in implementing augmented reality only by the user's intention, such as placing the object selected by the user in the space, adjusting the size of the object, or moving the object. have.

The above limitations are significant in providing a user experience with high presence because interaction with real space is ignored in creating content or scenarios in realizing augmented reality, or arbitrary objects can be expressed only by a pre-planned script. There is a problem.

The present invention is a cognitive domain that tracks the location and direction of a portable terminal based on Simultaneous Localization And Mapping (SLAM) using a camera and an inertial sensor, and infers a real space image obtained from the camera at the pixel level. Augmented reality with a higher presence as a method to cognitively know the components of each space by combining the two-dimensional area image obtained through the segmentation method and the three-dimensional spatial information of real space in a three-dimensional projection method. The purpose of this is to provide interaction for virtual objects according to real space.

In order to solve the present invention, continuous images are acquired through a camera, and feature points are extracted from the t-1 frame and the t frame using a feature point extraction algorithm such as SIFT, SURF, ORB, respectively, and a descriptor is generated for each. The Euclidean distance is calculated with respect to the descriptor of the feature point to create a feature point pair of the nearest neighbor, and the movement and rotation values of the portable terminal in the three-dimensional space are calculated through the calculated matching information of the t-1 frame and the t frame. get And, by accumulating the inertial sensor values from the acquisition of the t-1 frame to the acquisition of the t frame, the movement value and rotation value of the portable terminal are acquired, and a confidence score is calculated for mutual correction of these two movement values and rotation values. It was given and multiplied by the movement value and rotation value to obtain the average of the two values to compose three-dimensional spatial information.

In addition, the image of the t-frame acquired at this time was inferred using an artificial neural network learned in advance to perform pixel-level cognitive domain segmentation. In particular, at this time, using an artificial neural network model such as Mask R-CNN, it was learned to distinguish the floor, wall, table, person, and cup in the image, and through this, a two-dimensional cognitive domain segmentation image was obtained. .

In the present invention, in order to combine these two results, a polygon approximating the outermost line of each class of the two-dimensional cognitive region segmentation image was derived, and for each corner point of the polygon, the two-dimensional spatial information map The three-dimensional coordinates were projected into the three-dimensional space coordinates, and at this time, the average of the coordinate groups of at least three three-dimensional feature information in which the closest collision occurred was obtained and converted into surface information in the three-dimensional space of a two-dimensional polygon.

Through the above method, in the present invention, it is possible to overcome the limitations of the plane detection method, which was impossible with the conventional invention, and to give meaning to the class of the plane as a floor, a wall, a table, etc. to recognize it as a three-dimensional space object. became

In the above method, the present invention can recognize the shape, size, class, etc. of a spatial object, so that it is possible to select virtual content that can create an interaction between the user and the real space, or to express it according to a pre-planned scenario became

The present invention as described above can provide a higher sense of presence by implementing and providing an interaction between real space and virtual objects to the user, and through this, it is also suitable for real space in terms of providing augmented reality contents. Scenarios can be configured or virtual objects can be selectively augmented, making it possible to expand into more diverse businesses.

1 is a flowchart specifically showing a spatial recognition processing unit showing a method and system for generating spatial information combining a camera and an inertial sensor, inferring cognitive domain division, and combining the two to create a spatial object

2 is a flowchart specifically illustrating an object augmentation processing unit for performing an associated operation obtained from a virtual object DB and a scenario DB by using the spatial object information generated by the spatial recognition processing unit;

The present invention will be described in detail with reference to the accompanying drawings as follows.

1 is a portable terminal having a camera and an inertial sensor, and extracting feature points using algorithms such as SIFT, SURF, ORB from consecutive images t-1 frames and t frames acquired through the camera, and the current state is It is determined whether the map is in an initialized state, and if the spatial map is not initialized, the camera posture and origin are estimated using the feature points extracted from the acquired image. At this time, two consecutive images are required to estimate the camera posture and origin. To compare the similarity between the feature points extracted from each image, the Euclidean distance is obtained to obtain a feature point pair matching the two feature points with the shortest distance. can In addition, the relative position of the t frame may be obtained from the t-1 frame using the geometric relationship of the matched feature point pair. In this step, the spatial map is initialized by estimating the initial position of the spatial map by finding the camera position that best matches the spatial map while moving the camera image in 3D space based on the matching pair of feature points between the previous frame and the current frame.

Next, a moving average value is obtained by integrating successive inertial sensor values obtained through the inertial sensor of FIG. 1 to obtain a rotation value and a movement value of the portable terminal.

Next, the spatial information processing unit of FIG. 1 compares the respective rotation values and movement values obtained from the image processing unit and the inertial sensor processing unit, and when the difference occurs more than a threshold, the size of the bias value of the inertial sensor and the characteristic point of the image The reliability index is derived by comparing the size of the error value of tracking using an optical tracking algorithm such as Optical flow. At this time, the sum of the reliability indices is normalized to be 1, and the normalized reliability index is multiplied by each rotation value and movement value, and then summed to correct it, so that the camera image and the inertial sensor value are combined to form more reliable spatial information and Camera tracking is enabled.

Next, the region division inference unit of FIG. 1 is an inference of a pre-learned model capable of pixel-level cognitive region division such as Mask R-CNN, etc., and divides the region by inferring the t frame of the camera image obtained from the image processing unit. run At this time, as a result, a two-dimensional cognitive region segmentation image can be obtained. In the spatial object generator of FIG. 1, the region for each class is first separated, and an outline is made by approximating it to a polygon, and Derive the corner coordinates. The spatial object generating unit obtains a line segment orthogonal to the two-dimensional coordinate and the camera coordinate with respect to the corner coordinate, and projects it to the three-dimensional space information, and the Euclidean of the line segment and each feature point coordinate present in the three-dimensional space information. The distance is calculated, and the average of these values is obtained by finding at least three feature points that are found first from the camera and are close to each other among the values that come within the threshold. Using this value, the distance of the two-dimensional coordinates from the camera is obtained in the three-dimensional space, and the distance is then projected onto the three-dimensional coordinates to be converted into three-dimensional coordinates.

After converting all corner points into three-dimensional coordinates by the above-described method, the direction orthogonal to the direction of gravity is detected using the gyro sensor, the height of each corner point is obtained in the direction of gravity of the earth, and the height of each point is It compares whether it is horizontal within the threshold, and if it exceeds the threshold, removes the corresponding corner coordinates and regenerates the polygon to create a polygon containing at least three or more corners and create it as a spatial object.

Next, in the present invention, as shown in FIG. 2 , the following operation is performed to augment the virtual object with respect to the spatial object generated by the spatial recognition processing unit.

First, the spatial object generated by the spatial recognition processing unit is transmitted to the object augmentation processing unit, and the object augmentation processing unit queries the scenario DB for the class of each spatial object. At this time, if there is a virtual object currently being augmented, it is transmitted together to enable sequential selection of virtual objects. The scenario DB has the class of spatial object, the list of virtual objects that require interaction, the operation type of the virtual object, the ID of the virtual object, the size of the virtual object, and the relative value of the spatial coordinates of the virtual object. The object augmentation processing unit acquires at least one or more scenario information from the scenario DB, queries the ID of the virtual object among the information to the virtual object DB to acquire the contents or physical file of the virtual object, and 3D to convert the size of the object to fit the size, and to augment the virtual object by adding or subtracting the spatial coordinate relative value of the virtual object of the scenario DB based on the user-specified location using the touch screen of FIG. 2 or the central coordinate of the spatial object Modify spatial coordinates.

The virtual object created through the above process can implement the present invention by realizing augmented reality by synthesizing the t-frame image and the virtual object obtained from the camera and outputting them on the display.

Claims

In a portable terminal equipped with a camera and an inertial sensor, a step of recognizing a three-dimensional space and acquiring a continuous color image to compose spatial information, a step of acquiring and recording information of a continuous inertial sensor, from the continuous color image A step of extracting and matching features, a step of correcting the matching information using information from a continuous inertial sensor, and a step of performing area division inference on the color image obtained from the camera, and converting the divided 2D area into 3D Method and system for creating spatial objects by converting them into spatial objects
Obtaining a polygon approximating the divided two-dimensional region for which region division inference has been performed in claim 1, projecting the corner of the polygon to a three-dimensional space map to obtain the nearest three-dimensional coordinates and distance, the value of the gyro sensor Comparing the heights in three-dimensional space of the corners orthogonal to the Earth's gravitational direction obtained through Way
Obtaining object information by querying the scenario DB for the class of the spatial object created in claim 2, retrieving an object from the virtual object DB using the acquired object information, and determining the size of the fetched virtual object as an object of the scenario DB A method of calculating three-dimensional coordinates by determining the size of the virtual object and adding or subtracting the location correction value of the scenario DB from the touch coordinates of the touch screen to the location of the called virtual object.
Obtaining object information by querying the scenario DB for the class of the spatial object created in claim 2, retrieving an object from the virtual object DB using the acquired object information, and determining the size of the fetched virtual object as an object of the scenario DB A method of calculating three-dimensional coordinates by determining the size of a virtual object and adding or subtracting an object position correction value of the scenario DB from the median value of the spatial object created in claim 2 for the location of the called virtual object
A method of determining a virtual object in the above claims 3 to 4, transforming the object using the calculated size and position, synthesizing the virtual object on the color image obtained in claim 1, and displaying the synthesized image on a display and system