WO2024010323A1

WO2024010323A1 - Method and device for visual localization

Info

Publication number: WO2024010323A1
Application number: PCT/KR2023/009385
Authority: WO
Inventors: 최성광
Original assignee: 주식회사 브이알크루
Priority date: 2022-07-04
Filing date: 2023-07-04
Publication date: 2024-01-11
Also published as: KR20240004102A

Abstract

A method performed by means of a computing device is disclosed. The method may comprise the steps of: receiving, from a device of a client, a query image and additional information about the device; extracting, from the query image, first key points about the query image by using an artificial intelligence-based key point extraction model; generating, from the query image, a first detection result corresponding to a predetermined movable object in the query image by using an artificial intelligence-based movable object detection model; determining, from among the first key points, on the basis of the comparison result between the first detection result and the first key points, 1-2 key points removed from the query image and 1-1 key points maintained on the query image; and performing, at least partially on the basis of the 1-1 key points on the query image and the additional information about the device, visual localization on the device that has captured the query image. The representative drawing can be figure 3.

Description

Method and apparatus for visual localization

This disclosure relates to image processing and more specifically to visual localization.

As the demand for Location Based Services increases, the need for accurate location information has increased. The most common way to determine location on mobile devices and mobile platforms is GNSS (Global Navigation Satellite System). However, because GNSS signals may be blocked by obstacles in an indoor environment, there is a limitation that they can only be easily used in an outdoor environment.

Although various technologies have been proposed to perform location recognition indoors, the reality is that many indoor location recognition techniques do not go beyond the fingerprint printing-based location recognition algorithm using wireless signals. In this method, the collected data related to Wi-Fi RSS (Received Signal Strength) or MFS (Magnetic Field Strength) is compared with data from a fingerprinting database. Fingerprinting-based systems have the advantage of being easy to build, but it can be difficult to maintain good performance because the signal pattern itself is affected by changes in the system environment. To overcome the deficiencies of these fingerprint-based systems, many alternatives have been proposed, including Optical, RFID (Radio Frequency Identification), Bluetooth Beacons, ZigBee, and Pseudo Satellite, but these alternatives also have difficulty achieving high accuracy in complex indoor environments. It is evaluated as difficult.

Recently, research on VPS (Visual Positioning System) has been actively conducted as an alternative to implement highly accurate position estimation even in indoor environments. VPS technology can also be expressed as visual localization technology, which refers to a technology that estimates the current location or pose of a device using images taken indoors or outdoors. Pose estimation of a device (or camera) refers to determining the translation and rotation information of a dynamically changing camera viewpoint. This camera pose estimation technology is being used in a variety of fields such as mixed reality, augmented reality, robot navigating, and 3-Dimensional Scene Reconstruction. In relation to this, Republic of Korea Patent No. 10-2225093 has been issued.

This disclosure was made in response to the above-described background technology, and is intended to efficiently use computing resources in camera pose estimation.

The present disclosure is intended to increase the accuracy of camera pose estimation and reduce the time required for camera pose estimation.

The technical problems of the present disclosure are not limited to the technical problems mentioned above, and other technical problems not mentioned will be clearly understood by those skilled in the art from the description below.

According to some embodiments of the present disclosure for solving the problems described above, a method performed by a computing device is disclosed. The method includes receiving a query image and additional information about the device from a client device; Obtaining first keypoints for the query image from the query image using an artificial intelligence-based keypoints acquisition model; Obtaining a first detection result corresponding to a predetermined dynamic object in the query image from the query image using an artificial intelligence-based dynamic object detection model; Based on a comparison result between the first detection result and the first feature points, determining 1-2 feature points to be removed from the query image and 1-1 feature points to be maintained on the query image among the first feature points. steps; And it may include performing visual localization on the device that captured the query image, based at least in part on the 1-1 feature points on the query image and additional information of the device. .

In one embodiment, the additional information about the device may include location information of the device. In one embodiment, if there is a record of estimating the device's past camera pose, the additional information about the device is a preliminary estimate of the device's camera pose, calculated from the value of the device's Inertial Measurement Unit (IMU). May include estimated information.

In one embodiment, the step of performing visual localization on the device that captured the query image includes: selecting a candidate reference image that is the subject of feature point matching among a plurality of reference images in a database using additional information of the device; determining them; and performing visual localization including pose estimation for the camera of the device based on feature point matching between the determined candidate reference images and the query image.

In one embodiment, the additional information of the device further includes camera information of the device, and based on the camera information of the device, a pose estimation algorithm used to perform the visual localization among a plurality of pose estimation algorithms. This can be decided.

In one embodiment, the method includes: recognizing a first text present in the query image from the query image using an artificial intelligence-based text recognition model; And based on a comparison result of the first detection result and the recognized first text, a 1-2 text to be removed from the query image among the first text and a 1-1 text to be maintained on the query image. A further decision step may be included.

In one embodiment, the step of determining the 1-2 text to be removed from the query image and the 1-1 text to be maintained on the query image among the first texts may include the predetermined text included in the first detection result. It may include determining the first text included in the detection area corresponding to the dynamic object as the first and second texts to be removed from the query image.

In one embodiment, the method may further include detecting a predetermined first landmark present in the query image from the query image using an artificial intelligence-based landmark detection model. . In one embodiment, the step of performing visual localization on the device that captured the query image is at least partially based on 1-1 feature points on the query image, additional information of the device, and the first landmark. Based on this, it may include performing visual localization on the device that captured the query image.

In one embodiment, the step of performing visual localization on the device that captured the query image includes: using additional information of the device and the first landmark, a target of feature point matching among a plurality of reference images in a database determining candidate reference images; and performing visual localization including pose estimation for the camera of the device based on feature point matching between the determined candidate reference images and the query image.

In one embodiment, the step of performing visual localization on the device that captured the query image is based on a reference image stored in a database, 1-1 feature points on the query image, and additional information of the device, It may include performing visual localization on the device that captured the query image.

In one embodiment, the step of performing visual localization on the device that captured the query image includes: matching between first feature points of the query image and second feature points of the reference image, among a plurality of reference images. determining relative camera pose information for the query image according to a comparison result between 3D camera coordinates assigned to a pose reference image determined based on the 2D coordinates of 1-1 feature points of the query image; determining absolute camera pose information of the query image with respect to an origin, based on 3D camera coordinates assigned to the pose reference image and the relative camera pose information; and performing visual localization on the device that captured the query image based on the absolute camera pose information.

In one embodiment, extracting first feature points for the query image from the query image includes obtaining pixel coordinates and a descriptor corresponding to the first feature points, And the step of performing visual localization on the device that captured the query image includes comparing a descriptor corresponding to the first feature points of the query image with a descriptor corresponding to the second feature points of the reference image stored in the database. , It may include performing visual localization on the device that captured the query image.

In one embodiment, performing visual localization on the device that captured the query image includes: descriptors corresponding to the first feature points of the query image and second feature points of the reference image stored in the database. determining a pose reference image for visual localization for the device from among a plurality of reference images by comparing descriptors; and 3D coordinate information mapped to the pose reference image, matching information between the first feature points and the second feature points, and pixel coordinates corresponding to the first feature points of the query image, for the device. It may include performing visual localization including camera pose estimation.

In one embodiment, the method includes: extracting second feature points for the reference image from the reference image using the feature point acquisition model; Using the dynamic object detection model, obtaining a second detection result corresponding to a predetermined dynamic object in the reference image from the reference image; Based on a comparison result between the second detection result and the second feature points, determining 2-2 feature points to be removed from the reference image and 2-1 feature points to be maintained on the reference image among the second feature points. steps; And it may further include obtaining 3D camera coordinates corresponding to the 2-1 feature points from a predefined 3D map.

In one embodiment, based on a comparison result between the first detection result and the first feature points, 1-2 feature points removed from the query image among the first feature points and a first feature point maintained on the query image The step of determining -1 feature points includes determining first feature points included in a detection area corresponding to the predetermined dynamic object included in the first detection result as 1-2 feature points to be removed from the query image, and It may include determining first feature points not included in the detection area corresponding to the predetermined dynamic object included in the first detection result as 1-1 feature points maintained on the query image.

In one embodiment, a computer program stored on a computer-readable storage medium is disclosed. The computer program, when executed by a computing device, causes the computing device to perform the following operations, which operations include: receiving a query image from a client's device and additional information of the device; Extracting first feature points for the query image from the query image using an artificial intelligence-based feature point acquisition model; Obtaining a first detection result corresponding to a predetermined dynamic object within the query image from the query image using an artificial intelligence-based dynamic object detection model; Based on a comparison result between the first detection result and the first feature points, determining 1-2 feature points to be removed from the query image and 1-1 feature points to be maintained on the query image among the first feature points. action; and performing visual localization on the device that captured the query image, based at least in part on the 1-1 feature points on the query image and additional information of the device.

In one embodiment, a computing device is disclosed. The computing device includes at least one processor; and memory. The at least one processor may: receive a query image and additional information about the device from a client device; Extracting first feature points for the query image from the query image using an artificial intelligence-based feature point acquisition model; Obtaining a first detection result corresponding to a predetermined dynamic object within the query image from the query image using an artificial intelligence-based dynamic object detection model; Based on a comparison result between the first detection result and the first feature points, determining 1-2 feature points to be removed from the query image and 1-1 feature points to be maintained on the query image among the first feature points. action; And an operation of performing visual localization on the device that captured the query image based at least in part on the 1-1 feature points on the query image and additional information of the device.

Computing resources can be used efficiently in camera pose estimation according to an embodiment of the present disclosure.

One embodiment of the present disclosure can increase the accuracy of camera pose estimation and reduce the time required for camera pose estimation.

The effects that can be obtained from the present disclosure are not limited to the effects mentioned above, and other effects not mentioned can be clearly understood by those skilled in the art from the description below. .

Various aspects will now be described with reference to the drawings, where like reference numerals are used to collectively refer to like elements. In the following examples, for purposes of explanation, numerous specific details are set forth to provide a comprehensive understanding of one or more aspects. However, it will be clear that such aspect(s) may be practiced without these specific details.

1 schematically shows a block diagram of a computing device according to an embodiment of the present disclosure.

Figure 2 is a schematic diagram illustrating a network function according to an embodiment of the present disclosure.

Figure 3 exemplarily shows a method of performing visual localization on a device that captured a query image according to an embodiment of the present disclosure.

FIG. 4 exemplarily shows a method of storing metadata mapped to a reference image to perform visual localization according to an embodiment of the present disclosure.

Figure 5 exemplarily illustrates the operation of a feature point acquisition model according to an embodiment of the present disclosure.

Figure 6 exemplarily illustrates the operation of a dynamic object detection model and a landmark detection model according to an embodiment of the present disclosure.

Figure 7 exemplarily illustrates the operation of a text recognition model according to an embodiment of the present disclosure.

FIG. 8 exemplarily illustrates the flow of image processing to efficiently perform visual localization according to an embodiment of the present disclosure.

FIG. 9 exemplarily illustrates the flow of image processing to efficiently perform visual localization according to an embodiment of the present disclosure.

Figure 10 exemplarily illustrates the operation of models for performing visual localization according to an embodiment of the present disclosure.

11 shows a brief, general schematic diagram of an example computing environment in which embodiments of the present disclosure may be implemented.

Various embodiments are now described with reference to the drawings. In this specification, various descriptions are presented to provide an understanding of the disclosure. However, it is clear that these embodiments may be practiced without these specific descriptions.

As used herein, the terms “component,” “module,” “system,” and the like refer to a computer-related entity, hardware, firmware, software, a combination of software and hardware, or an implementation of software. For example, a component may be, but is not limited to, a process running on a processor, a processor, an object, a thread of execution, a program, and/or a computer. For example, both an application running on a computing device and the computing device can be a component. One or more components may reside within a processor and/or thread of execution. A component may be localized within one computer. A component may be distributed between two or more computers. Additionally, these components can execute from various computer-readable media having various data structures stored thereon. Components can transmit signals, for example, with one or more data packets (e.g., data and/or signals from one component interacting with other components in a local system, a distributed system, to other systems and over a network such as the Internet). Depending on the data being transmitted, they may communicate through local and/or remote processes.

Additionally, the term “or” is intended to mean an inclusive “or” and not an exclusive “or.” That is, unless otherwise specified or clear from context, “X utilizes A or B” is intended to mean one of the natural implicit substitutions. That is, either X uses A; X uses B; Or, if X uses both A and B, “X uses A or B” can apply to either of these cases. Additionally, the term “and/or” as used herein should be understood to refer to and include all possible combinations of one or more of the related listed items.

Additionally, the terms “comprise” and/or “comprising” should be understood to mean that the corresponding feature and/or element is present. However, the terms “comprise” and/or “comprising” should be understood as not excluding the presence or addition of one or more other features, elements and/or groups thereof. Additionally, unless otherwise specified or the context is clear to indicate a singular form, the singular terms herein and in the claims should generally be construed to mean “one or more.”

And, the term “at least one of A or B” should be interpreted to mean “a case containing only A,” “a case containing only B,” and “a case of combining A and B.”

Those skilled in the art will additionally recognize that the various illustrative logical blocks, components, modules, circuits, means, logic, and algorithm steps described in connection with the embodiments disclosed herein may be implemented using electronic hardware, computer software, or a combination of both. It must be recognized that it can be implemented with To clearly illustrate the interchangeability of hardware and software, various illustrative components, blocks, configurations, means, logics, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented in hardware or software will depend on the specific application and design constraints imposed on the overall system. A skilled technician can implement the described functionality in a variety of ways for each specific application. However, such implementation decisions should not be construed as causing a departure from the scope of the present disclosure.

The description of the presented embodiments is provided to enable anyone skilled in the art to use or practice the present invention. Various modifications to these embodiments will be apparent to those skilled in the art. The general principles defined herein may be applied to other embodiments without departing from the scope of the disclosure. Therefore, the present invention is not limited to the embodiments presented herein. The present invention is to be interpreted in the broadest scope consistent with the principles and novel features presented herein.

In the present disclosure, terms represented by N, such as first, second, or third, are used to distinguish a plurality of entities. For example, the entities expressed as first and second may be the same or different from each other. Terms expressed as 1-1, 1-2, and 2-1, 2-2 may also be used to distinguish a plurality of entities.

In the present disclosure, visual localization refers to a technology for estimating the current location or pose of a device using images taken indoors or outdoors. Pose estimation of a device (or camera) refers to determining the translation and rotation information of a dynamically changing camera viewpoint.

The query image in the present disclosure includes an image captured by a device, and the pose of the device corresponding to the query image is determined using a reference image stored in a computing device (e.g., server, etc.) and the query image. It can be determined or estimated.

A reference image in the present disclosure may include a pre-stored image used for pose estimation of the device that captured the query image. The reference image may include actual images taken in a specific area to implement, for example, augmented reality or virtual reality. These reference images may be stored in computing device 100. Metadata mapped to the reference image can also be used to estimate the pose of the device.

Metadata in the present disclosure may refer to additional information used in performing visual localization. For example, metadata may include additional information other than the image, such as information related to the device, pixel coordinate information of feature points, descriptor of feature points, landmark information obtained from the image, character information obtained from the image, etc. . According to an embodiment of the present disclosure, by using various forms of metadata, comparison between query images and reference images can be made more smoothly, and further, camera pose estimation can be made in an accurate manner using fewer computing resources. there is. For example, by utilizing metadata of the query image and/or reference image, the candidate group of reference images to be matched with the query image may be reduced. As another example, by utilizing metadata of the query image and/or reference image, a more suitable camera pose estimation method may be used.

1 schematically shows a block diagram of a computing device 100 according to an embodiment of the present disclosure.

Computing device 100 according to an embodiment of the present disclosure may include a processor 110 and a memory 130.

The configuration of the computing device 100 shown in FIG. 1 is only a simplified example. In one embodiment of the present disclosure, the computing device 100 may include different configurations for performing the computing environment of the computing device 100, and only some of the disclosed configurations may configure the computing device 100.

The computing device 100 in the present disclosure may refer to any type of node constituting a system for implementing embodiments of the present disclosure. Computing device 100 may refer to any type of user terminal or any type of server. The components of the computing device 100 described above are exemplary and some may be excluded or additional components may be included. For example, when the above-described computing device 100 includes a user terminal, an output unit (not shown) and an input unit (not shown) may be included within the scope of the computing device 100.

The client's device in the present disclosure may refer to a device for generating a query image. A query image may be included in information captured by the client device. The query image captured by the client device may be used for comparison with a reference image within a computing device (eg, server), and thus the pose of the client device at the time of capture may be determined.

Hereinafter, for convenience of explanation, the client device and the computing device 100 are distinguished, and a methodology for visual localization for the computing device 100 to receive shooting information from the client device and estimate the pose of the client device is provided. It will be explained. However, embodiments that perform visual localization on a client device that generates a query image may also be included within the scope of the present disclosure. In this embodiment, the client device may function as the computing device 100.

The computing device 100 in the present disclosure may perform technical features according to embodiments of the present disclosure, which will be described later. For example, the computing device 100 may extract feature points for the query image from the query image using a feature point acquisition model, and extract feature points for the query image from the query image using a dynamic object detection model. Detection results can be obtained, comparison between the detection results and feature points can be performed to determine feature points to be removed and features to be maintained on the query image, and visual localization can be performed on the device that captured the query image. For example, the computing device 100 acquires camera pose information including a reference image and the position and posture of the camera that captured the reference image, and uses a feature point acquisition model to obtain feature points for the reference image from the reference image. Extract and use a dynamic object detection model to obtain a detection result corresponding to a predetermined dynamic object in the reference image from a reference image, and based on a comparison result between the detection result and the feature points, reference among the feature points. Metadata that determines feature points to be removed from the image and feature points to be maintained in the reference image, and maps coordinate information corresponding to the feature points to be maintained, descriptor information to be maintained, and camera pose information to the reference image. It can be saved as .

In one embodiment, the processor 110 may consist of at least one core, including a central processing unit (CPU) and a general purpose graphics processing unit (GPGPU) of the computing device 100. , may include a processor for data analysis and/or processing, such as a tensor processing unit (TPU).

The processor 110 may read a computer program stored in the memory 130 and perform visual localization methodologies according to an embodiment of the present disclosure. Additionally, the processor 110 may read a computer program stored in the memory 130 and perform visual localization methodologies according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, the processor 110 may perform an operation for learning a neural network. The processor 110 is used for learning neural networks, such as processing input data for learning in deep learning (DL), extracting features from input data, calculating errors, and updating the weights of the neural network using backpropagation. Calculations can be performed. At least one of the CPU, GPGPU, and TPU of the processor 110 may process learning of the network function. For example, CPU and GPGPU can work together to process learning of network functions and data classification using network functions. Additionally, in one embodiment of the present disclosure, processors of a plurality of computing devices may be used together to process learning of a network function and data classification using a network function. Additionally, a computer program executed in a computing device according to an embodiment of the present disclosure may be a CPU, GPGPU, or TPU executable program. Operations for a neural network according to an embodiment of the present disclosure will be described later with reference to FIG. 2.

Additionally, processor 110 may typically handle overall operations of computing device 100. For example, the processor 110 processes data, information, or signals input or output through components included in the computing device 100 or runs an application program stored in the storage to provide information or information appropriate to the user. Functions can be provided or processed.

According to one embodiment of the present disclosure, the memory 130 may store any type of information generated or determined by the processor 110 and any type of information received by the computing device 100. According to an embodiment of the present disclosure, the memory 130 may be a storage medium that stores computer software that allows the processor 110 to perform operations according to embodiments of the present disclosure. Accordingly, the memory 130 may refer to computer readable media for storing software codes required to perform embodiments of the present disclosure, data to be executed by the codes, and execution results of the codes.

According to one embodiment of the present disclosure, the memory 130 may refer to any type of storage medium. For example, the memory 130 may be a flash memory type or a hard disk type. ), multimedia card micro type, card type memory (e.g. SD or XD memory, etc.), RAM (Random Access Memory), SRAM (Static Random Access Memory), ROM (Read-Only) Memory, ROM), EEPROM (Electrically Erasable Programmable Read-Only Memory), PROM (Programmable Read-Only Memory), magnetic memory, magnetic disk, and optical disk. The computing device 100 may operate in connection with web storage that performs a storage function of the memory 130 on the Internet. The description of the memory described above is only an example, and the memory 130 used in the present disclosure is not limited to the examples described above.

The communication unit (not shown) in the present disclosure can be configured regardless of the communication mode, such as wired or wireless, and can be used in various communication networks such as a personal area network (PAN) and a wide area network (WAN). It can be configured. In addition, the network unit 150 can operate based on the well-known World Wide Web (WWW), and is a wireless transmission technology used for short-distance communication such as Infrared Data Association (IrDA) or Bluetooth. You can also use .

Computing device 100 in the present disclosure may include any type of user terminal and/or any type of server. Accordingly, embodiments of the present disclosure may be performed by a server and/or a user terminal.

A user terminal may include any type of terminal capable of interacting with a server or other computing device. User terminals include, for example, mobile phones, smart phones, laptop computers, personal digital assistants (PDAs), slate PCs, tablet PCs, and ultrabooks. It can be included.

The client's device in the present disclosure may include the user terminal described above. The client's device may include modules for generating a depth map. Accordingly, the device may generate an image and/or depth map and transmit it to computing device 100. In one embodiment, the client's device may refer to any type of equipment for detecting an optical image, converting it into an electrical signal, and transmitting it to the computing device 100. For example, the client's device may include at least one of a camera, scanner, Lidar, and/or vision sensor. The computing device 100 may include a device or may be linked to an external device wirelessly or wired.

Servers may include any type of computing system or computing device, such as, for example, microprocessors, mainframe computers, digital processors, portable devices, and device controllers.

In an additional embodiment, the above-described server may include a storage unit (not shown) that stores and manages a reference image, metadata corresponding to the reference image, 3D map, etc. This storage may be included within the server or may exist under the management of the server. As another example, the storage unit may be implemented in a form that exists outside the server and can communicate with the server. In this case, the storage may be managed and controlled by an external server that is different from the server.

The computing device 100 according to some embodiments of the present disclosure may perform acquisition of feature points, detection of landmarks, detection of dynamic objects, and/or text recognition using various artificial intelligence-based models. For example, the computing device 100 may acquire feature points corresponding to the query image using a feature point acquisition model, which is an artificial intelligence-based model.

In one embodiment, the feature point acquisition model may include a neural network built through deep learning or machine learning. The feature point acquisition model may acquire feature points from a query image and/or depth map included in the input data.

For example, the query image may include at least one red-green-blue (RGB) image and/or at least one grayscale image.

For example, a depth map may include an image or information indicating the relative distances of each pixel within a specific image. Accordingly, the depth map may include information related to the distance from the location where the query image is taken to the surface of the subject.

Figure 2 is a schematic diagram showing a network function according to an embodiment of the present disclosure.

In the present disclosure, at least one of a feature point acquisition model, a dynamic object detection model, a landmark detection model, and/or a text recognition model may correspond to an artificial intelligence-based model.

Throughout this disclosure, the terms artificial intelligence-based model, model, computational model, neural network, network function, and neural network may be used interchangeably. A neural network can generally consist of a set of interconnected computational units, which can be referred to as nodes. These nodes may also be referred to as neurons. A neural network consists of at least one node. Nodes (or neurons) that make up neural networks may be interconnected by one or more links.

Within a neural network, one or more nodes connected through a link may form a relative input node and output node relationship. The concepts of input node and output node are relative, and any node in an output node relationship with one node may be in an input node relationship with another node, and vice versa. As described above, input node to output node relationships can be created around links. One or more output nodes can be connected to one input node through a link, and vice versa.

In a relationship between an input node and an output node connected through one link, the value of the data of the output node may be determined based on the data input to the input node. Here, the link connecting the input node and the output node may have a weight. Weights may be variable and may be varied by the user or algorithm in order for the neural network to perform the desired function. For example, when one or more input nodes are connected to one output node by respective links, the output node is set to the values input to the input nodes connected to the output node and the links corresponding to each input node. The output node value can be determined based on the weight.

As described above, in a neural network, one or more nodes are interconnected through one or more links to form an input node and output node relationship within the neural network. The characteristics of the neural network can be determined according to the number of nodes and links within the neural network, the correlation between the nodes and links, and the value of the weight assigned to each link. For example, if the same number of nodes and links exist and two neural networks with different weight values of the links exist, the two neural networks may be recognized as different from each other.

A neural network may consist of a set of one or more nodes. A subset of nodes that make up a neural network can form a layer. Some of the nodes constituting the neural network may form one layer based on the distances from the first input node. For example, a set of nodes with a distance n from the initial input node may constitute n layers. The distance from the initial input node can be defined by the minimum number of links that must be passed to reach the node from the initial input node. However, this definition of a layer is arbitrary for explanation purposes, and the order of a layer within a neural network may be defined in a different way than described above. For example, a layer of nodes may be defined by distance from the final output node.

The initial input node may refer to one or more nodes in the neural network into which data is directly input without going through links in relationships with other nodes. Alternatively, in a neural network network, in the relationship between nodes based on links, it may mean nodes that do not have other input nodes connected by links. Similarly, the final output node may refer to one or more nodes that do not have an output node in their relationship with other nodes among the nodes in the neural network. Additionally, hidden nodes may refer to nodes constituting a neural network other than the first input node and the last output node.

The neural network according to an embodiment of the present disclosure is a neural network in which the number of nodes in the input layer may be the same as the number of nodes in the output layer, and the number of nodes decreases and then increases again as it progresses from the input layer to the hidden layer. You can. In addition, the neural network according to another embodiment of the present disclosure may be a neural network in which the number of nodes in the input layer may be less than the number of nodes in the output layer, and the number of nodes decreases as it progresses from the input layer to the hidden layer. there is. In addition, the neural network according to another embodiment of the present disclosure may be a neural network in which the number of nodes in the input layer may be greater than the number of nodes in the output layer, and the number of nodes increases as it progresses from the input layer to the hidden layer. You can. A neural network according to another embodiment of the present disclosure may be a neural network that is a combination of the above-described neural networks.

A deep neural network (DNN) may refer to a neural network that includes multiple hidden layers in addition to the input layer and output layer. Deep neural networks allow you to identify latent structures in data. In other words, it is possible to identify the potential structure of a photo, text, video, voice, or music (e.g., what object is in the photo, what the content and emotion of the text are, what the content and emotion of the voice are, etc.) . Deep neural networks include convolutional neural network (CNN), recurrent neural network (RNN), auto encoder, restricted Boltzmann machine (RBM), and deep trust network ( It may include deep belief network (DBN), Q network, U network, Siamese network, generative adversarial network (GAN), etc. The description of the deep neural network described above is only an example and the present disclosure is not limited thereto.

A neural network may be trained in at least one of supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning. Learning of a neural network may be a process of applying knowledge for the neural network to perform a specific operation to the neural network.

Neural networks can be trained to minimize output errors. In neural network learning, learning data is repeatedly input into the neural network, the output of the neural network and the error of the target for the learning data are calculated, and the error of the neural network is transferred from the output layer of the neural network to the input layer in the direction of reducing the error. This is the process of updating the weight of each node in the neural network through backpropagation. In the case of teacher learning, learning data in which the correct answer is labeled in each learning data is used (i.e., labeled learning data), and in the case of non-teacher learning, the correct answer may not be labeled in each learning data. That is, for example, in the case of teacher learning regarding data classification, the learning data may be data in which each learning data is labeled with a category. Labeled training data is input to the neural network, and the error can be calculated by comparing the output (category) of the neural network with the label of the training data. As another example, in the case of non-teachable learning for data classification, the error can be calculated by comparing the input training data with the neural network output. The calculated error is backpropagated in the reverse direction (i.e., from the output layer to the input layer) in the neural network, and the connection weight of each node in each layer of the neural network can be updated according to backpropagation. The amount of change in the connection weight of each updated node may be determined according to the learning rate. The neural network's calculation of input data and backpropagation of errors can constitute a learning cycle (epoch). The learning rate may be applied differently depending on the number of repetitions of the learning cycle of the neural network. For example, in the early stages of neural network training, a high learning rate can be used to increase efficiency by allowing the neural network to quickly achieve a certain level of performance, and in the later stages of training, a low learning rate can be used to increase accuracy.

In the learning of neural networks, the training data can generally be a subset of real data (i.e., the data to be processed using the learned neural network), and thus the error for the training data is reduced, but the error for the real data is reduced. There may be an incremental learning cycle. Overfitting is a phenomenon in which errors in actual data increase due to excessive learning on training data. For example, a phenomenon in which a neural network that learned a cat by showing a yellow cat fails to recognize that it is a cat when it sees a non-yellow cat may be a type of overfitting. Overfitting can cause errors in machine learning algorithms to increase. To prevent such overfitting, various optimization methods can be used. To prevent overfitting, methods such as increasing the learning data, regularization, dropout to disable some of the network nodes during the learning process, and use of a batch normalization layer can be applied. You can.

A computer-readable medium storing a data structure according to an embodiment of the present disclosure is disclosed.

Data structure can refer to the organization, management, and storage of data to enable efficient access and modification of data. Data structure can refer to the organization of data to solve a specific problem (e.g., retrieving data, storing data, or modifying data in the shortest possible time). A data structure may be defined as a physical or logical relationship between data elements designed to support a specific data processing function. Logical relationships between data elements may include connection relationships between user-defined data elements. Physical relationships between data elements may include actual relationships between data elements that are physically stored in a computer-readable storage medium (e.g., a persistent storage device). A data structure may specifically include a set of data, relationships between data, and functions or instructions applicable to the data. Effectively designed data structures allow computing devices to perform computations while minimizing the use of the computing device's resources. Specifically, computing devices can increase the efficiency of operations, reading, insertion, deletion, comparison, exchange, and search through effectively designed data structures.

Data structures can be divided into linear data structures and non-linear data structures depending on the type of data structure. A linear data structure may be a structure in which only one piece of data is connected to another piece of data. Linear data structures may include List, Stack, Queue, and Deque. A list can refer to a set of data that has an internal order. The list may include a linked list. A linked list may be a data structure in which data is connected in such a way that each data is connected in a single line with a pointer. In a linked list, a pointer may contain connection information to the next or previous data. Depending on its form, a linked list can be expressed as a singly linked list, a doubly linked list, or a circularly linked list. A stack may be a data listing structure that allows limited access to data. A stack can be a linear data structure in which data can be processed (for example, inserted or deleted) at only one end of the data structure. Data stored in the stack may have a data structure (LIFO-Last in First Out) where the later it enters, the sooner it comes out. A queue is a data listing structure that allows limited access to data. Unlike the stack, it can be a data structure (FIFO-First in First Out) where data stored later is released later. A deck can be a data structure that can process data at both ends of the data structure.

A non-linear data structure may be a structure in which multiple pieces of data are connected behind one piece of data. Nonlinear data structures may include graph data structures. A graph data structure can be defined by vertices and edges, and an edge can include a line connecting two different vertices. Graph data structure may include a tree data structure. A tree data structure may be a data structure in which there is only one path connecting two different vertices among a plurality of vertices included in the tree. In other words, it may be a data structure that does not form a loop in the graph data structure.

Throughout this specification, computational model, neural network, network function, and neural network may be used interchangeably. Below, it is described in a unified manner as a neural network. Data structures may include neural networks. And the data structure including the neural network may be stored in a computer-readable medium. Data structures including neural networks also include data preprocessed for processing by a neural network, data input to the neural network, weights of the neural network, hyperparameters of the neural network, data acquired from the neural network, activation functions associated with each node or layer of the neural network, neural network It may include a loss function for learning. A data structure containing a neural network may include any of the components disclosed above. In other words, the data structure including the neural network includes preprocessed data for processing by the neural network, data input to the neural network, weights of the neural network, hyperparameters of the neural network, data acquired from the neural network, activation functions associated with each node or layer of the neural network, neural network It may be configured to include all or any combination of the loss function for learning. In addition to the configurations described above, a data structure containing a neural network may include any other information that determines the characteristics of the neural network. Additionally, the data structure may include all types of data used or generated in the computational process of a neural network and is not limited to the above. Computer-readable media may include computer-readable recording media and/or computer-readable transmission media. A neural network can generally consist of a set of interconnected computational units, which can be referred to as nodes. These nodes may also be referred to as neurons. A neural network consists of at least one node.

The data structure may include data input to the neural network. A data structure containing data input to a neural network may be stored in a computer-readable medium. Data input to the neural network may include learning data input during the neural network learning process and/or input data input to the neural network on which training has been completed. Data input to the neural network may include data that has undergone pre-processing and/or data subject to pre-processing. Preprocessing may include a data processing process to input data into a neural network. Therefore, the data structure may include data subject to preprocessing and data generated by preprocessing. The above-described data structure is only an example and the present disclosure is not limited thereto.

The data structure may include the weights of the neural network. (In this specification, weights and parameters may be used with the same meaning.) And the data structure including the weights of the neural network may be stored in a computer-readable medium. A neural network may include multiple weights. Weights may be variable and may be varied by the user or algorithm in order for the neural network to perform the desired function. For example, when one or more input nodes are connected to one output node by respective links, the output node is set to the values input to the input nodes connected to the output node and the links corresponding to each input node. Based on the weight, the data value output from the output node can be determined. The above-described data structure is only an example and the present disclosure is not limited thereto.

As an example and not a limitation, the weights may include weights that are changed during the neural network learning process and/or weights for which neural network learning has been completed. Weights that change during the neural network learning process may include weights that change at the start of the learning cycle and/or weights that change during the learning cycle. Weights for which neural network training has been completed may include weights for which a learning cycle has been completed. Therefore, the data structure including the weights of the neural network may include weights that are changed during the neural network learning process and/or the data structure including the weights for which neural network learning has been completed. Therefore, the above-mentioned weights and/or combinations of each weight are included in the data structure including the weights of the neural network. The above-described data structure is only an example and the present disclosure is not limited thereto.

The data structure including the weights of the neural network may be stored in a computer-readable storage medium (e.g., memory, hard disk) after going through a serialization process. Serialization can be the process of converting a data structure into a form that can be stored on the same or a different computing device and later reorganized and used. Computing devices can transmit and receive data over a network by serializing data structures. Data structures containing the weights of a serialized neural network can be reconstructed on the same computing device or on a different computing device through deserialization. The data structure including the weights of the neural network is not limited to serialization. Furthermore, the data structure including the weights of the neural network is a data structure to increase computational efficiency while minimizing the use of computing device resources (e.g., in non-linear data structures, B-Tree, Trie, m-way search tree, AVL tree, Red-Black Tree) may be included. The foregoing is merely an example and the present disclosure is not limited thereto.

The data structure may include hyper-parameters of a neural network. And the data structure including the hyperparameters of the neural network can be stored in a computer-readable medium. A hyperparameter may be a variable that can be changed by the user. Hyperparameters include, for example, learning rate, cost function, number of learning cycle repetitions, weight initialization (e.g., setting the range of weight values subject to weight initialization), Hidden Unit. It may include a number (e.g., number of hidden layers, number of nodes in hidden layers). The above-described data structure is only an example and the present disclosure is not limited thereto.

The flowchart shown in FIG. 3 may be performed, for example, by computing device 100.

In one embodiment, computing device 100 may receive a query image and additional information about the device from a client's device (310).

Computing device 100 may provide an augmented reality and/or virtual reality platform. The computing device 100 may allow the client to enjoy activities on an augmented reality and/or virtual reality platform by obtaining captured data from the client's device and estimating the current location and direction of the client device.

In one embodiment, the query image may include a two-dimensional RGB image or a grayscale image. In one embodiment, the query image may include a two-dimensional RGB image or grayscale image and a depth map. Additional information about the device may include location information of the device. Location information is, for example, geodetic datum or geodetic system (or geodetic reference datum, geodetic reference system, or geodetic reference frame), spatial reference system (SRS) or coordinate reference system (CRS) used in a specific region or country. , or may include coordinate values and/or direction values for a geographic coordinate system (GCS), etc. Other examples of location information may include GPS.

Additionally, the additional device information may include additional device hardware-related information such as focal length, principal point, resolution, etc. In addition, if there is a record of estimating the past camera pose of the device, the additional information of the device includes preliminary estimation information about the camera pose of the device calculated from the value of the IMU (Inertial Measurement Unit) of the device. It can be included. The preliminary estimate information about the camera pose of the device may include an approximate camera pose estimated using IMU information, and this preliminary estimate information may include approximate location information and direction information about the device. When this preliminary estimation information is used, the candidate group of reference images to be matched with the query image may be reduced.

In one embodiment of the present disclosure, the computing device 100 uses additional information of the client's device to determine candidate reference images that are subject to feature point matching among a plurality of reference images in the database and the determined candidate reference images. Based on feature point matching between images and the query image, visual localization including pose estimation for the camera of the device may be performed. In this way, the additional information of the device can be used to determine a candidate reference image that is subject to feature point matching among a plurality of reference images stored in the database. Accordingly, the technique according to an embodiment of the present disclosure can efficiently use computing resources used for feature point matching.

In one embodiment of the present disclosure, the additional information of the device may further include camera information of the device, and an example of such camera information may include intrinsic parameters of the camera. The computing device 100 may determine a pose estimation algorithm used to perform the visual localization among a plurality of pose estimation algorithms based on the camera information of the device. For example, pose estimation for a device (eg, camera) can estimate 6DoF in a direction that minimizes reprojection error.

For example, when the camera's internal parameters are not given (i.e., an uncalibrated camera), the pose estimation algorithm uses the Direct Linear Transformation (DLT) methodology to determine the elements of the rotation matrix and the translation vector. The reprojection error equation, which includes the elements of (vector) elements and the internal parameters of the camera as unknowns, can be changed to a linear equation and the rotation matrix, translation vector, and camera matrix can be estimated through QR decomposition.

For example, given the camera's internal parameters (i.e., a calibrated camera), the pose estimation algorithm uses the Perspective-n-Point (PnP) methodology to transform the reprojection error, a non-linear equation, into a linear equation, which is different from DLT. Similarly, pose estimation can be performed by directly solving the reprojection error, which is a non-linear equation, using the Gauss-Newton methodology or the Levenberg-Marquardt methodology. In this example, representative PnP methodologies may include P3P, EPnP, and SQPnP methodologies.

In an additional example, the DLT and PnP methodologies described above may introduce RANSAC to minimize camera pose errors resulting from inaccuracies in the 3D coordinates of reference feature points or inaccuracies in query feature points.

Therefore, as described above, the technique according to an embodiment of the present disclosure can apply the camera pose estimation algorithm differently depending on the presence or absence of camera information, for example, DLT or PnP depending on the presence or absence of camera information. By using the methodology, camera pose estimation can be performed.

In one embodiment, the computing device 100 may extract first feature points for the query image from the query image using an artificial intelligence-based feature point acquisition model. (320).

The computing device 100 may use a pre-learned feature point acquisition model to obtain at least one feature point and/or at least one descriptor corresponding to each of the at least one feature point from the query image. For example, a feature point may be the coordinates for each characteristic part or point on a query image. For example, the descriptor may include at least one of information about the directionality and size of each feature point, and/or the relationship between pixels surrounding each feature point.

In one embodiment, the feature point acquisition model includes a pre-trained model based on supervised learning to recognize as a feature point a point where the amount of change in at least one of color or geometric pattern in an object in the image exceeds a predetermined threshold. can do. For example, the feature point acquisition model may correspond to a pre-trained model using an image-based neural network. As another example, the feature point acquisition model may correspond to a pre-trained model using a Transformer-based neural network.

In one embodiment, the computing device 100 may use an artificial intelligence-based dynamic object detection model to obtain a first detection result corresponding to a predetermined dynamic object in the query image from the query image (330). .

In one embodiment, the dynamic object detection model may correspond to an artificial intelligence-based model that detects predetermined dynamic objects in an image. For example, a dynamic object detection model may include a model pre-trained through supervised learning to receive a query image and detect non-fixed, movable objects (e.g., cars, people, etc.) within the query image. For example, a dynamic object detection model may receive a query image as input and output segmentation results for predetermined movable objects (eg, cars, people, etc.) within the query image.

In one embodiment, the first detection result may include a display result for a region of a non-fixed, movable object (eg, car, tree, etc.) within the query image.

In one embodiment, the dynamic object detection model may be used to remove feature points included in the detected dynamic object, and/or may be used to remove text included in the detected dynamic object, as will be described later. When a dynamic object detection model is used, the number of candidate reference images used to perform matching between the feature points of the query image and the feature points of the reference image can be reduced, enabling visual localization on augmented reality and/or virtual reality platforms. It can be implemented more efficiently.

In one embodiment, the computing device 100 may recognize the first text present in the query image from the query image using a text recognition model. As an example, the text recognition model can perform OCR (Optical Character Recognition). For example, a text recognition model can recognize an area of text on an input query image and recognize what the text is. For example, a text recognition model may include a pre-trained model based on artificial intelligence. For example, RNN-based models, Transformer-based models, and Bert-based models may be included as examples of these text recognition models. You can. A text recognition model can recognize, for example, a car number and/or the name of a sign on a query image.

In one embodiment, the computing device 100 selects a 1-2 text and the query image to be removed from the query image among the first text, based on a comparison result of the first detection result and the recognized first text. The 1-1 text maintained in the image can be determined. The computing device 100 may combine the output of the dynamic object detection model and the output of the text recognition model to remove text belonging to the area of the dynamic object from the query image. That is, the computing device 100 may determine the first text included in the detection area corresponding to the predetermined dynamic object included in the first detection result as the 1-2 text to be removed from the query image. Accordingly, the computing device 100 removes information about dynamic objects that are unlikely to be included in the reference image, thereby achieving more efficient comparison and matching between the reference image and the query image.

In one embodiment, the computing device 100 may detect a predetermined first landmark present in the query image from the query image using an artificial intelligence-based landmark detection model. A landmark in the present disclosure may refer to a specific object that can represent a specific area and/or location. For example, buildings with highly identifiable appearances, such as Namsan Tower, Lotte Tower, and/or the Statue of Liberty, may be included in these landmarks. For example, a landmark may be selected among fixed objects.

Landmark detection may correspond to an artificial intelligence-based model that detects predetermined landmark objects in an image. For example, the landmark detection model may include a model pre-trained through supervised learning to receive a query image as input and detect objects corresponding to predetermined landmarks within the query image. For example, a landmark detection model may receive a query image as input and output segmentation results for predetermined landmark objects within the query image.

In one embodiment, a landmark detection model can be used to efficiently perform visual localization. A landmark detection model that receives a query image can determine whether a specific landmark exists in the query image. When the landmark detection model detects a specific landmark, identification information about the landmark may be stored as metadata for the query image. This metadata can be compared with metadata mapped to the reference image and used to reduce the number of candidate reference images to be compared with the query image among the reference images. That is, candidate reference images that include the corresponding landmark as metadata may be determined as reference images to be compared with the query image.

In one embodiment, the computing device 100 creates a visual for the device that captured the query image, based at least in part on the 1-1 feature points on the query image, additional information about the device, and the first landmark. Localization can be performed. For example, the computing device 100 uses the additional information of the device and the first landmark to determine candidate reference images that are subject to feature point matching among a plurality of reference images in the database, and to use the determined candidate reference images and Based on feature point matching between the query images, visual localization including pose estimation for the camera of the device may be performed.

In one embodiment, the computing device 100 selects the first feature points to be removed from the query image among the first feature points based on a comparison result between the first detection result obtained from the dynamic object detection model and the first feature points obtained from the feature point acquisition model. The 1-2 feature points and the 1-1 feature points maintained on the query image may be determined (340). The computing device 100 may perform visual localization on the device that captured the query image based at least in part on the 1-1 feature points on the query image and additional information about the device.

In one embodiment, the computing device 100 may remove feature points within an area corresponding to a dynamic object obtained from a dynamic object detection model from feature points within a query image obtained from a feature point acquisition model. For example, the computing device 100 determines the first feature points included in the detection area corresponding to the predetermined dynamic object included in the first detection result as the 1-2 feature points to be removed from the query image, and 1 First feature points not included in the detection area corresponding to the predetermined dynamic object included in the detection result may be determined as 1-1 feature points maintained on the query image. Accordingly, feature point comparison between the reference image and the query image can be performed more efficiently.

In one embodiment, when performing visual localization on the device that captured the query image, the computing device 100 includes a reference image stored in the database of the computing device 100, 1-1 feature points on the query image, and Based on the additional information of the device, visual localization can be performed on the device that captured the query image.

In one embodiment, when performing visual localization on a device that captured a query image, the computing device 100 selects first feature points of the query image and second feature points of the reference image from among a plurality of reference images. According to the result of comparison between the 3D camera coordinates assigned to the pose reference image and the 2D coordinates of the 1-1 feature points of the query image, relative camera pose information for the query image can be determined. there is. The pose reference image in the present disclosure may refer to a reference image that is subject to matching with a query image among reference images. The pose reference image in the present disclosure may refer to a reference candidate image that is subject to matching with a query image among reference candidate images. For example, the pose reference image may refer to the image with the highest matching rate with the query image among a plurality of candidate images. In this example, computing device 100 may perform visual localization on the query image using 3D camera coordinates assigned to the pose reference image.

In one embodiment, the computing device 100 performs visual localization on a device that captured a query image based on camera pose information (i.e., 3D camera coordinates) and relative camera pose information assigned to the pose reference image. Thus, absolute camera pose information of the query image with respect to the origin may be determined, and visual localization may be performed on the device that captured the query image based on the absolute camera pose information.

Absolute camera pose information in the present disclosure may be, for example, a geodetic datum or geodetic system (or geodetic reference datum, geodetic reference system, or geodetic reference frame), spatial reference system (SRS) used in a specific region or a specific country. ) or coordinate reference system (CRS), or geographic coordinate system (GCS), etc. may include coordinate values and/or direction values. The relative camera pose information in the present disclosure may include camera pose information generated by a comparison result between the 3D coordinates of the camera pose information assigned to the reference image and the 2D coordinates of the 1-1 feature points of the query image.

In one embodiment, when the computing device 100 acquires feature points and extracts first feature points for the query image from the query image, pixel coordinates and a descriptor corresponding to the first feature points are used. It may include obtaining. The computing device 100 performs visual localization on the device that captured the query image by comparing the descriptor corresponding to the first feature points of the query image with the descriptor corresponding to the second feature points of the reference image stored in the database. can do. The descriptors are mapped to the query image and can be compared with the descriptors mapped to the reference image to determine a candidate reference image to be compared to the query image among the plurality of reference images. Descriptors can be used to efficiently perform comparison between reference images and query images. That is, the computing device 100 selects a visual for the device from among a plurality of reference images by comparing a descriptor corresponding to the first feature points of the query image with a descriptor corresponding to the second feature points of the reference image stored in the database. Determining a pose reference image for localization, and 3D coordinate information mapped to the pose reference image, matching information between the first feature points and the second feature points, and corresponding to the first feature points of the query image Based on pixel coordinates, visual localization including camera pose estimation for the device may be performed.

In a further embodiment, the computing device 100 compares the descriptor and pixel coordinates corresponding to the first feature points of the query image with the descriptor and pixel coordinates corresponding to the second feature points of the reference image stored in the database, thereby providing a plurality of Determining a pose reference image for visual localization for the device among reference images, and 3D coordinate information mapped to the pose reference image, matching information between the first feature points and the second feature points, and the query image Visual localization including camera pose estimation for the device may be performed based on pixel coordinates corresponding to the first feature points.

As illustrated in FIG. 3 , computing device 100 may perform pose estimation on a query image captured by a client's device using pre-trained model(s) for performing visual localization. The operations of the computing device 100 in FIG. 3 exemplarily represent a model inference process.

The steps performed in FIG. 4 may be performed, for example, by computing device 100. The operations illustrated in FIG. 4 exemplarily represent a method of generating metadata, a method of building a model, and/or a learning method of the computing device 100.

In one embodiment, the computing device 100 extracts second feature points for the reference image from a reference image using a feature acquisition model, and extracts a dictionary within the reference image from the reference image using a dynamic object detection model. Obtaining a second detection result corresponding to the determined dynamic object, and based on the comparison result between the second detection result and the second feature points, 2-2 feature points and the reference image removed from the reference image among the second feature points 2-1 feature points maintained in the image may be determined, and 3D camera coordinates corresponding to the 2-1 feature points may be obtained from a predefined 3D map.

As shown in FIG. 4, the computing device 100 may acquire camera pose information including a reference image and the location and posture of the camera that captured the reference image (410). To implement an augmented reality or virtual reality platform in a specific area, a plurality of reference images may be captured by a device in the area. Metadata for each of the reference images may be generated and stored in the storage of the computing device 100 together with the reference image or feature points of the reference image. For example, the camera pose information may include absolute camera pose information based on the absolute coordinate system on Earth of the device (eg, camera) that captured the reference image. Absolute camera pose information based on an absolute coordinate system is, for example, a geodetic datum or geodetic system (or geodetic reference datum, geodetic reference system, or geodetic reference frame) or spatial reference system (SRS) used in a specific region or country. Alternatively, it may include coordinate values and/or direction values for a coordinate reference system (CRS), a geographic coordinate system (GCS), etc. As another example, absolute camera pose information based on the absolute coordinate system is information equivalent to the global coordinate system, such as “latitude, longitude, azimuth,” “coordinate system used in a specific region or country,” “items included in GNSS,” etc. It can mean.

In one embodiment, the computing device 100 may extract second feature points for the reference image from the reference image using an artificial intelligence-based feature point acquisition model (420). Here, the feature point acquisition model may correspond to the feature point acquisition model described above with reference to FIG. 3. For example, each of the second feature points output from the feature point acquisition model may include pixel coordinates and descriptors corresponding to the reference image.

In one embodiment, the computing device 100 may obtain a second detection result corresponding to a predetermined dynamic object within the reference image from the reference image using an artificial intelligence-based dynamic object detection model (430). . Here, the dynamic object detection model may correspond to the dynamic object detection model described above with reference to FIG. 3.

The second feature points in FIG. 4 represent feature points corresponding to the reference image, and the first feature points in FIG. 3 represent feature points corresponding to the query image. Likewise, the second detection result in Figure 4 represents the output from the dynamic object detection model corresponding to the reference image and the first detection result in Figure 3 represents the output from the dynamic object detection model corresponding to the query image.

Based on the comparison result between the second detection result and the second feature points, the computing device 100 selects 2-2 feature points removed from the reference image among the second feature points and a second feature point maintained on the reference image. -1 Feature points can be determined (440).

The computing device 100 determines the second feature points included in the detection area corresponding to the predetermined dynamic object included in the second detection result as 2-2 feature points to be removed from the query image, and the second detection result Second feature points not included in the detection area corresponding to the predetermined dynamic object included in may be determined as 2-1 feature points maintained on the query image. When storing feature points for a reference image as metadata, the computing device 100 may remove feature points included in the dynamic object and store feature points not included in the dynamic object. Accordingly, a resource-efficient comparison can be made between the first feature points of the query image and the second feature points of the reference image.

In one embodiment, computing device 100 may recognize a second text present in the reference image from the reference image using a text recognition model. Based on the comparison result of the second detection result and the recognized second text, the computing device 100 selects a 2-2 text to be removed from the reference image among the second texts and a second text to be maintained on the reference image. -1 Text can be determined.

The text recognition model in FIG. 4 may correspond to the text recognition model described in FIG. 3. The second text, the 2-2 text, and the 2-1 text in FIG. 4 are texts corresponding to the reference image, and the first text, the 2-2 text, and the 2-1 text correspond to the query image in FIG. 3. Can be used to distinguish it from text.

When determining the 2-2 text to be removed from the reference image and the 2-1 text to be maintained on the reference image among the second texts, the computing device 100 detects a detection corresponding to a predetermined dynamic object included in the second detection result. The second text included in the area may be determined as the 2-2 text to be removed from the reference image. For example, the coordinate information corresponding to the 2-1 feature points includes 3D camera coordinates corresponding to the 2-1 feature points remaining after the 2-2 feature points are removed from the predefined 3D map. can do. Accordingly, visual localization may be performed according to comparison between the 3D camera coordinates of the feature point corresponding to the reference image stored in the storage unit of the computing device 100 and the feature point (2D coordinates) corresponding to the query image.

The metadata mapped to the reference image of the computing device 100 may include the 2-1 text from which the 2-2 text among the second texts is removed. When a model for visual localization is built through metadata mapped to such a reference image, a faster response to the query image can be achieved depending on the comparison between the metadata mapped to the query image and the metadata mapped to the reference image during the inference process. And more resource-efficient visual localization can be implemented.

In one embodiment, the computing device 100 may detect a predetermined second landmark present in the reference image from the reference image using an artificial intelligence-based landmark detection model. The landmark detection model in FIG. 4 may correspond to the landmark detection model in FIG. 3 . The second landmark in Figure 4 relates to the reference image, and the first landmark in Figure 3 relates to the query image. In one embodiment, metadata mapped to the reference image may include a second landmark detected by a landmark detection model. When a model for visual localization is built through metadata mapped to such a reference image, a faster response to the query image can be achieved depending on the comparison between the metadata mapped to the query image and the metadata mapped to the reference image during the inference process. And more resource-efficient visual localization can be implemented.

In one embodiment, the computing device 100 may store coordinate information corresponding to the 2-1 feature points, descriptor information corresponding to the 2-1 feature points, and the camera pose information as metadata mapped to a reference image. There is (450).

In one embodiment, the computing device 100 is a three-dimensional map corresponding to the 2-1 feature points of the reference image in a pre-scanned and produced three-dimensional map (e.g., point cloud or mesh). Camera coordinates can be obtained. These 3D camera coordinates can also be stored as metadata corresponding to the reference image. This metadata can be compared with the two-dimensional coordinates of feature points of the query image and used to determine relative pose information for the query image.

In one embodiment, the metadata for one reference image includes two-dimensional pixel coordinates of the 2-1 feature points, a descriptor of the 2-1 feature points, three-dimensional camera coordinates corresponding to the 2-1 feature points, and a reference image. It may include at least one of camera pose information and GPS information acquired from a photographing device, landmark information corresponding to a reference image, and/or text information excluding dynamic objects.

In one embodiment, metadata stored in the database of the computing device 100 may consist of a set of metadata allocated to each reference image unit.

In a further embodiment of the present disclosure, metadata mapped to the query image include: pixel coordinates and descriptors corresponding to 1-1 feature points that are maintained without being removed on the query image among the first feature points of the query image. , It may include at least one of location information of the device that captured the query image, a first landmark and first text corresponding to the query image, and/or camera pose information of the device calculated from the IMU value of the device. there is. Therefore, efficient visual localization for the query image can be performed according to comparison between the metadata mapped to the query image and the metadata mapped to the reference image.

In additional embodiments of the present disclosure, the query image and reference image may further include information related to a depth map. By additionally utilizing information about the depth map, the computing device 100 can implement more accurate visual localization.

FIG. 5 exemplarily illustrates the operation of the feature point acquisition model 500 according to an embodiment of the present disclosure.

The description of the feature point acquisition model has been described in detail in FIGS. 3 and 4, and parts that overlap with the description in FIGS. 3 and 4 will be omitted in the description of FIG. 5.

In one embodiment, the feature point acquisition model 500 may be a neural network built through deep learning or machine learning. The feature point acquisition model 500 may acquire feature points from images (eg, query images and/or reference images) included in the input data. Additionally, the feature point acquisition model 500 may be pre-trained using a dataset. For example, when images of corresponding areas in different environments are input, the feature point acquisition model 500 may be trained so that feature points obtained from each of the images correspond to each other.

A dataset may refer to a set of data for performing learning and verification of a neural network. The dataset may include a training dataset and/or a validation dataset. A learning dataset may be a set of data used in the learning process of a neural network. For example, the learning dataset may be a set of data used for learning in the learning process of the feature point acquisition model 500. A validation dataset may be a set of data used to evaluate a neural network. For example, the verification dataset may be a set of data used to evaluate the feature point acquisition model 500. In one embodiment, landmark detection models, dynamic object detection models, and/or text recognition models described below may also utilize the dataset described above.

As illustrated in FIG. 5 , the feature point acquisition model 500 may generate an output image 510 in response to the input image 200 . The input image 200 below may correspond to a query image in the inference process, and may correspond to a reference image in the learning process or DB construction process. Input image 200 may include an image acquired by a client device. In additional embodiments, input image 200 may include RGB or grayscale images as well as depth map information.

In one embodiment, the output image 510 may include a plurality of keypoints and descriptors corresponding to each keypoint as a result of the keypoint extraction model. Feature points on the output image 510 may include, for example, two-dimensional coordinate values. The feature point acquisition model 500 may output inflection points or color change points of objects on the input image 200 as feature points.

FIG. 6 exemplarily illustrates the operation of a dynamic object detection model 600a and a landmark detection model 600b according to an embodiment of the present disclosure.

The description of the dynamic object detection model 600a and the landmark detection model 600b has been described in detail in FIGS. 3 and 4, and parts that overlap with the description in FIGS. 3 and 4 will be omitted in the description of FIG. 6. will be.

In one embodiment, the dynamic object detection model 600a and the landmark detection model 600b may be a neural network built through deep learning or machine learning. The dynamic object detection model 600a and the landmark detection model 600b may acquire feature points from images (eg, query images and/or reference images) included in the input data 200. Additionally, pre-training of the dynamic object detection model 600a and the landmark detection model 600b may be performed using a dataset. For example, when images of corresponding areas in different environments are input to the dynamic object detection model 600a and the landmark detection model 600b, the dynamic objects or landmarks obtained from each of the images are They can be learned to correspond to each other.

In one embodiment, the dynamic object detection model 600a may detect or segment predetermined movable objects in response to the input image 200. In one embodiment, the landmark detection model 600b may detect or segment an object corresponding to a predetermined landmark in response to the input image 200.

In one embodiment, at least one of the dynamic object detection model 600a and the landmark detection model 600b may detect a predetermined object and display a bounding box around the object to distinguish it.

In one embodiment, at least one of the dynamic object detection model 600a and the landmark detection model 600b may include an artificial intelligence-based model for detecting a plurality of objects in an image.

In a further embodiment, the dynamic object detection model 600a and the landmark detection model 600b may be integrated and operated as one model.

As shown in FIG. 6 , the input image 200 may correspond to the query image in FIG. 3 or the reference image in FIG. 4 .

The input image 200 may be input into the dynamic object detection model 600a and the landmark detection model 600b, respectively. In response to the input image 200, the dynamic object detection model 600a may generate an output image 610 including detection results for dynamic objects. The output image 610 may include

dynamic objects

610a, 610b, and 610c corresponding to vehicles and dynamic objects 610d corresponding to trees. The illustrations of dynamic objects (610a, 610b, 610c, and 610d) are only examples, and the dynamic object detection model 600a can detect various types of movable objects. The output 610 of the dynamic object detection model 600a can be used to detect feature points and texts that do not belong to the dynamic object, so that the process of comparing the query image and the reference image can be implemented more efficiently.

In response to the input image 200, the landmark detection model 600b may generate an output image 620 containing detection results for object(s) corresponding to the pre-stored landmark. A landmark 620a corresponding to “Seoul High Court” may be included in the output image 620. The identification information for the landmark 620a can be stored as metadata for the query image and/or the reference image, so that the process of comparing the query image and the reference image can be implemented more efficiently.

FIG. 7 exemplarily illustrates the operation of a text recognition model 700 according to an embodiment of the present disclosure.

The description of the text recognition model 700 has been described in detail in FIGS. 3 and 4, and parts that overlap with the description in FIGS. 3 and 4 will be omitted in the description of FIG. 7.

The input image 200 may be input to the text recognition model 700. As described above, the input image 200 may be input to the feature point acquisition model 500, the dynamic object detection model 600a, the landmark detection model 600b, and the text recognition model 700.

Text recognition model 700 may detect areas of text within input image 200 and/or determine what the text within input image 200 means.

The text recognition model 700 may include any type of model capable of performing optical character recognition (OCR).

The text recognition model 700 may include any type of model for recognizing artificial intelligence-based text. For example, the text recognition model 700 may perform preprocessing to change the brightness and/or color of the input image 200 to facilitate recognition of texts within the input image 200.

Text recognition model 700 can locate texts and generate bounding boxes for these texts. For example, to recognize the positions of texts on the input image 200, the text recognition model 700 may use a CNN (Convolutional Neural Network) series model, which is an image-based deep learning model.

The text recognition model 700 can recognize the content of the text within the bounding box corresponding to the position of the text. For example, the text recognition model 700 may use a Recurrent Neural Network (RNN) series model to recognize text within a bounding box. Additionally, the text recognition model 700 may use a Transformer and/or Attention-based deep learning model to recognize text within a bounding box.

In the example described above, text recognition model 700 may include a model that combines a first model for locating the text and a second model for recognizing the content of the text within the location of the text. there is. In a further embodiment, the function of determining the location for text may be performed by at least one of the dynamic object detection model 600a and the landmark detection model 600b described above.

As shown in FIG. 7 , the output 710 of the text recognition model 700 may be expressed by examples such as

building signs

710a and 710b and vehicle license plates 710c. In this way, the text recognition model 700 can display the location area for the text as a bounding box and generate an output 710 containing recognition information about the text. Recognition information for the text illustrated in FIG. 7 may include “Myeongsan Building” (710a), “Personal Bankruptcy/Rehabilitation Corporate Bankruptcy/Rehabilitation” (710b), and “13bo6436” (710c).

In one embodiment, the text recognition model 700 can recognize text expressed vertically or horizontally in keywords or sentences. In this example, the text recognition model 700 does not recognize the text “Myeongsan Building” corresponding to reference number 710a as “myeong”, “san”, “building” and “ding”, but uses the entire word “myeongsan building”. It can be recognized as one.

The location area for this text can be compared with the output of the dynamic object detection model 600a and divided into text included in the dynamic object and text not included in the dynamic object. Text included in the dynamic object will be removed from the image and text not included in the dynamic object may be maintained on the image. Additionally, recognition information about text may be mapped to the input image 200 (eg, query image or reference image) and stored. Recognition information about text can be used as reference information when performing comparison between a query image and a reference image. Accordingly, as reference images having text information corresponding to the text information included in the query image are determined as candidate reference images, computing resources used for comparison between the query image and the reference image may be reduced.

Image processing shown in FIG. 8 may be performed, for example, by the computing device 100.

As shown in FIG. 8, the output image 710 of the text recognition model 700 and the output image 610 of the dynamic object detection model 600a may be compared. For example, a comparison may be made between regions (eg, segmented regions) detected in the output image 710 of the text recognition model 700 and the output image 610 of the dynamic object detection model 600a. For example, objects in the output image 710 of the text recognition model 700 and the output image 610 of the dynamic object detection model 600a may be compared. According to this comparison, the output image 710 of the text recognition model 700 corresponds to the

regions

610a, 610b, 610c, and 610d detected as dynamic objects on the output image 610 of the dynamic object detection model 600a. Text areas 710c may be determined. In this example, the text area corresponding to reference number 710c in the output image 710 of the text recognition model 700 is determined to be a text area included in the dynamic object, and the text areas corresponding to

reference numbers

710a and 710b are determined as dynamic objects. It can be determined as text areas that are not included in the object.

In one embodiment, in the process of comparing objects in the output image 710 of the text recognition model 700 and the output image 610 of the dynamic object detection model 600a, an intermediate output image corresponding to reference numeral 810 (810) can be generated. In this intermediate output image 810,

border areas

710a, 710b, and 710c corresponding to text and

border areas

610a, 610b, 610c, and 610d for dynamic objects may be integrated. On this intermediate output image 810, objects in the output image 710 of the text recognition model 700 and the output image 610 of the dynamic object detection model 600a may be compared. In one embodiment, this intermediate output image 810 is an example image for convenience of explanation, and depending on the aspect of implementation, the output image 820 corresponding to reference number 820 is generated without the process of generating the intermediate output image 810. may be created.

In one embodiment, according to a comparison of the objects in the output image 710 of the text recognition model 700 and the output image 610 of the dynamic object detection model 600a, text 710c corresponding to the dynamic object is removed and the

text

710a, 710b that does not correspond to the dynamic object is maintained, and an output image or output result 820 can be generated. In one embodiment, based on the comparison between these text areas and dynamic object areas on the query image or reference image, the computing device 100 can efficiently construct reference images that are the objects of comparison in performing visual localization. And an effective comparison between the query image and the reference image can be made in performing visual localization.

As such, the image processing technique according to an embodiment of the present disclosure can efficiently and selectively store text information stored as metadata, enabling efficient management of metadata. Furthermore, the image processing technique according to an embodiment of the present disclosure stores text information about fixed objects as metadata, thereby combining metadata extracted from a query image input through a virtual reality and/or augmented reality platform. The number of reference images to be compared to the query image can be reduced by comparing metadata corresponding to the reference image.

Image processing shown in FIG. 9 may be performed, for example, by the computing device 100.

As shown in FIG. 8, output information 510 of the feature point acquisition model 500 and output information 610 of the dynamic object detection model 600a may be compared. For example, a comparison may be made between the output information 510 of the feature point acquisition model 500 and regions (eg, segmented regions) detected in the output information 610 of the dynamic object detection model 600a. For example, a comparison may be made to determine whether regions in the output information 510 of the feature point acquisition model 500 and the output information 610 of the dynamic object detection model 600a correspond to each other. A comparison to determine whether the output information 510 of the feature point acquisition model 500 and the output information 610 of the dynamic object detection model 600a correspond to each other is exemplified by reference numeral 910 in FIG. 9. It is shown. According to this comparison, the output information 510 of the feature point acquisition model 500 corresponding to the

areas

610a, 610b, 610c, and 610d detected as dynamic objects on the output information 610 of the dynamic object detection model 600a. The characteristic points of can be determined. In this example, the feature points included in the

reference numbers

610a, 610b, 610c, and 610d of the dynamic object detection model 600a among the output information 510 of the feature point acquisition model 500 are determined as feature points included in the dynamic object, And feature points not included in

reference numbers

610a, 610b, 610c, and 610d may be determined as feature points not included in the dynamic object.

For example, if the correspondence between the location information of the feature points and the area information of the dynamic object exceeds the predetermined threshold correspondence value, or if the distance between the location information of the feature points and the area information of the dynamic object is less than the predetermined threshold distance value, the corresponding feature points It can be determined by feature points included in the dynamic object.

As shown by reference number 920, among the output information 510 of the feature point acquisition model 500, feature points included in

reference numbers

610a, 610b, 610c, and 610d of the dynamic object detection model 600a are feature points included in the dynamic object. Since the corresponding feature points have been determined, the corresponding feature points can be removed from the output image 920. As shown by reference number 920, among the output information 510 of the feature point acquisition model 500, feature points that are not included in

reference numbers

610a, 610b, 610c, and 610d of the dynamic object detection model 600a are not included in the dynamic object. Since the feature points are determined to be non-existent, the corresponding feature points can be maintained on the output image 920.

In one embodiment, based on the comparison between these feature points and dynamic object areas on the query image or reference image, the computing device 100 can efficiently construct reference images that are the subject of comparison in performing visual localization. And in performing visual localization, an effective comparison between the query image and the reference image can be made.

As such, the image processing technique according to an embodiment of the present disclosure can efficiently and selectively store feature point information stored as metadata, enabling efficient management of metadata. Furthermore, the image processing technique according to an embodiment of the present disclosure stores feature point information about fixed objects as metadata, thereby combining metadata extracted from a query image input through a virtual reality and/or augmented reality platform. The number of reference images to be compared to the query image can be reduced by comparing metadata corresponding to the reference image.

FIG. 10 exemplarily illustrates the operation of a model 1000 for performing visual localization according to an embodiment of the present disclosure.

In one embodiment, the model 1000 for performing visual localization may be included in the computing device 100 and executed by a processor of the computing device 100.

The input image 200 may be input to the model 1000. The input image 200 may be input to a feature point acquisition model 500, a dynamic object detection model 600a, a landmark detection model 600b, and a text recognition model 700.

The first output 510 of the feature point acquisition model 500 and the second output 610 of the dynamic object detection model 600a may be compared. According to this comparison, the model 1000 of the computing device 100 may generate a sixth output 920. The sixth output 920 may include information about feature points not included in the dynamic object. Through this sixth output 920, the computing device 100 can selectively store feature points that are not included in the dynamic object, so that comparison between feature points between the query image and the reference image can be performed more resource-efficiently.

The fourth output 710 of the text recognition model 700 and the second output 610 of the dynamic object detection model 600a may be compared. According to this comparison, the model 1000 of the computing device 100 may generate a fifth output 820. The fifth output 820 may include information about texts not included in the dynamic object. Through this fifth output 820, the computing device 100 can selectively store texts that are not included in the dynamic object. The computing device 100 uses a method of comparing text information between the query image and the reference image to select only some of the reference images among all reference images as reference images to be compared with the query image, making visual localization more resource-efficient. -Can be performed efficiently.

A third output 620 of the landmark detection model 600b may be output by the model 1000 of the computing device 100. The third output 620 includes information corresponding to landmarks and can be used as metadata for reference data and metadata for query data. When a query image including a specific landmark is input, the computing device 100 may select reference image(s) including the specific landmark as a reference image to be compared with the query image.

As shown in FIG. 10, the visual localization technique according to an embodiment of the present disclosure may use at least one of the third output 620, the fifth output 820, and the sixth output 920. Since at least one of these

outputs

620, 820, and 920 is used as metadata, the computing device 100 can effectively select a reference image to be compared to the query image without comparing the entire reference image and the query image.

In one embodiment, in the process of storing a reference image in a database, 2D pixel coordinate values of feature points included in the sixth output 920 and/or descriptor values of the feature points may be stored as metadata. In addition, in the process of storing the reference image in the database, the 3D camera coordinates of the feature points included in the sixth output 920 on the scanned and produced 3D map (e.g., point cloud format or mesh format map) are metadata. It can be saved as . Additionally, in the process of storing a reference image in the database, the relative camera pose or 6DoF value and/or GPS value with respect to the absolute coordinate system of the 3D map of the client's device (e.g., camera) that takes the reference image are stored as metadata. It can be saved. Additionally, information corresponding to the landmark included in the third output 620 may be stored as metadata. Additionally, text information not included in the dynamic object included in the fifth output 820 may be stored as metadata. Additionally, the camera pose and GPS information of the reference image corresponding to the input image 200 may be stored as metadata.

In one embodiment, in the process of receiving and processing a query image, 2D pixel coordinate values of feature points included in the sixth output 920 and/or descriptor values of the feature points may be stored as metadata. Additionally, in the process of receiving and processing the query image, information corresponding to the landmark included in the third output 620 may be stored as metadata. Additionally, in the process of receiving and processing a query image, text information not included in the dynamic object included in the fifth output 820 may be stored as metadata. Additionally, the camera pose and GPS information of the query image corresponding to the input image 200 may be stored as metadata. In addition, preliminary estimation information of the query image corresponding to the input image 200 (e.g., preliminary estimation information about the camera pose of the query image estimated based on IMU information of the device that captured the query image) will be stored as metadata. You can. At least some of the metadata of the query image can be used to quickly reduce the candidate group of reference images in the database for performing feature point matching. When performing feature point matching between reference image candidates and a query image, pixel coordinates and descriptors of feature points extracted from the two images can be used. Using the 3D coordinate values of feature points of the reference image that best matches the query image (i.e., has the highest correspondence), matching information between feature points, and pixel coordinate values of feature points of the query image, etc., the query image is Camera pose estimation can be performed.

A component, module, or unit in the present disclosure includes routines, procedures, programs, components, data structures, etc. that perform a specific task or implement a specific abstract data type. Additionally, one of ordinary skill in the art will understand that the methods presented in this disclosure can be used in uni-processor or multiprocessor computing devices, minicomputers, mainframe computers, as well as personal computers, handheld computing devices, microprocessor-based or programmable consumer electronics, etc. ( It will be fully appreciated that each of these may be implemented with other computer system configurations, including those capable of operating in conjunction with one or more associated devices.

Embodiments described in this disclosure can also be practiced in distributed computing environments where certain tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.

Computing devices typically include a variety of computer-readable media. Computer-readable media can be any medium that can be accessed by a computer, and such computer-readable media includes volatile and non-volatile media, transitory and non-transitory media, removable and non-transitory media. Includes removable media. By way of example, and not limitation, computer-readable media may include computer-readable storage media and computer-readable transmission media.

Computer-readable storage media refers to volatile and non-volatile media, transient and non-transitory media, removable and non-removable, implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Includes media. Computer readable storage media may include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital video disk (DVD) or other optical disk storage, magnetic cassette, magnetic tape, magnetic disk storage or other magnetic storage. This includes, but is not limited to, a device, or any other medium that can be accessed by a computer and used to store desired information.

A computer-readable transmission medium typically implements computer-readable instructions, data structures, program modules, or other data on a modulated data signal, such as a carrier wave or other transport mechanism. Includes all information delivery media. The term modulated data signal refers to a signal in which one or more of the characteristics of the signal have been set or changed to encode information within the signal. By way of example, and not limitation, computer-readable transmission media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared, and other wireless media. Combinations of any of the above are also intended to be included within the scope of computer-readable transmission media.

An example environment 2000 is shown that implements various aspects of the invention, including a computer 2002, which includes a processing unit 2004, a system memory 2006, and a system bus 2008. do. Computer 200 herein may be used interchangeably with computing device. System bus 2008 couples system components, including but not limited to system memory 2006, to processing unit 2004. Processing unit 2004 may be any of a variety of commercially available processors. Dual processors and other multiprocessor architectures may also be used as processing units 2004.

System bus 2008 may be any of several types of bus structures that may further be interconnected to a memory bus, peripheral bus, and local bus using any of a variety of commercial bus architectures. System memory 2006 includes read only memory (ROM) 2010 and random access memory (RAM) 2012. The basic input/output system (BIOS) is stored in non-volatile memory (2010), such as ROM, EPROM, and EEPROM, and is a basic input/output system (BIOS) that helps transfer information between components within the computer (2002), such as during startup. Contains routines. RAM 2012 may also include high-speed RAM, such as static RAM for caching data.

Computer 2002 may also read from or use an internal hard disk drive (HDD) 2014 (e.g., EIDE, SATA), magnetic floppy disk drive (FDD) 2016 (e.g., removable diskette 2018). (for writing to), SSDs, and optical disk drives (2020) (e.g., for reading CD-ROM disks (2022) or for reading from or writing to other high-capacity optical media, such as DVDs). Includes. The hard disk drive 2014, magnetic disk drive 2016, and optical disk drive 2020 are connected to a system bus 2008 by a hard disk drive interface 2024, magnetic disk drive interface 2026, and optical drive interface 2028, respectively. ) can be connected to. The interface 2024 for implementing an external drive includes, for example, at least one or both of Universal Serial Bus (USB) and IEEE 1394 interface technologies.

These drives and their associated computer-readable media provide non-volatile storage of data, data structures, computer-executable instructions, and the like. For the computer 2002, drive and media correspond to storing any data in a suitable digital format. Although the description of computer-readable storage media above refers to removable optical media such as HDDs, removable magnetic disks, and CDs or DVDs, those skilled in the art will also understand removable optical media such as zip drives, magnetic cassettes, flash memory cards, cartridges, etc. It will be appreciated that other types of computer-readable storage media may also be used in the exemplary operating environment and that any such media may contain computer-executable instructions for performing the methods of the invention. .

A number of program modules may be stored in the drive and RAM 2012, including an operating system 2030, one or more application programs 2032, other program modules 2034, and program data 2036. All or portions of the operating system, applications, modules and/or data may also be cached in RAM 2012. It will be appreciated that the invention may be implemented on various commercially available operating systems or combinations of operating systems.

A user may enter commands and information into the computer 2002 through one or more wired/wireless input devices, such as a pointing device such as a keyboard 2038 and a mouse 2040. Other input devices (not shown) may include microphones, IR remote controls, joysticks, game pads, stylus pens, touch screens, etc. These and other input devices are connected to the processing unit 2004 through an input device interface 2042, which is often connected to the system bus 2008, but may also include a parallel port, an IEEE 1394 serial port, a game port, a USB port, an IR interface, It can be connected by other interfaces, etc.

A monitor 2044 or other type of display device is also connected to system bus 2008 through an interface, such as a video adapter 2046. In addition to the monitor 2044, computers typically include other peripheral output devices (not shown) such as speakers, printers, etc.

Computer 2002 may operate in a networked environment using logical connections to one or more remote computers, such as remote computer(s) 2048, via wired and/or wireless communications. Remote computer(s) 2048 may be a workstation, server computer, router, personal computer, portable computer, microprocessor-based entertainment device, peer device, or other conventional network node, and generally refers to computer 2002. For simplicity, only memory storage device 2050 is shown, although it includes many or all of the components described. The logical connections depicted include wired/wireless connections to a local area network (LAN) 2052 and/or a larger network, such as a wide area network (WAN) 2054. These LAN and WAN networking environments are common in offices and companies and facilitate enterprise-wide computer networks, such as intranets, all of which can be connected to a worldwide computer network, such as the Internet.

When used in a LAN networking environment, computer 2002 is connected to local network 2052 through wired and/or wireless communications network interfaces or adapters 2056. Adapter 2056 may facilitate wired or wireless communication to LAN 2052, which also includes a wireless access point installed thereon for communicating with wireless adapter 2056. When used in a WAN networking environment, the computer 2002 may include a modem 2058, connected to a communication server on the WAN 2054, or other means of establishing communication over the WAN 2054, such as via the Internet. has Modem 2058, which may be internal or external and a wired or wireless device, is coupled to system bus 2008 via serial port interface 2042. In a networked environment, program modules described for computer 2002, or portions thereof, may be stored in remote memory/storage device 2050. It will be appreciated that the network connections shown are exemplary and that other means of establishing a communications link between computers may be used.

Computer 2002 may be associated with any wireless device or object deployed and operating in wireless communications, such as a printer, scanner, desktop and/or portable computer, portable data assistant (PDA), communications satellite, wirelessly detectable tag. Performs actions to communicate with any device or location and telephone. This includes at least Wi-Fi and Bluetooth wireless technologies. Accordingly, communication may be a predefined structure as in a conventional network or may simply be ad hoc communication between at least two devices.

It is to be understood that the specific order or hierarchy of steps in the processes presented is an example of illustrative approaches. It is to be understood that the specific order or hierarchy of steps in processes may be rearranged within the scope of the present disclosure, based on design priorities. The method claims of this disclosure provide elements of the various steps in a sample order but are not meant to be limited to the specific order or hierarchy presented.

As described above, the relevant content has been described in the best form for carrying out the invention.

It can be used in computing devices, systems, etc. for visual localization.

Claims

A method performed by a computing device, comprising:

Receiving a query image and additional information about the device from a client device;

Obtaining first keypoints for the query image from the query image using an artificial intelligence-based keypoints acquisition model that receives the query image;

Obtaining a first detection result corresponding to a predetermined dynamic object in the query image from the query image using an artificial intelligence-based dynamic object detection model that receives the query image;

Recognizing a first text existing in the query image from the query image using a text recognition model that receives the query image;

Based on a first comparison result between the predetermined dynamic object included in the first detection result and the first feature points and a second comparison result between the predetermined dynamic object included in the first detection result and the first text Thus, among the first feature points, 1-2 feature points to be removed from the query image and 1-1 feature points to be maintained on the query image are determined, and among the first text, 1-2 feature points to be removed from the query image are determined. 2. Determining a text and a 1-1 text maintained on the query image - a 1-2 text in which a text included in a detection area corresponding to the predetermined dynamic object among the first text is removed from the query image decided -; and

Determine at least one candidate reference image to be compared with the query image by comparing the 1-1 text on the query image and text information assigned as metadata to the reference image, and on the determined candidate reference image and the query image performing visual localization on the device that captured the query image based at least in part on the 1-1 feature points and additional information of the device;

Including,

method.
According to claim 1,

The additional information of the device includes location information of the device, and

If there is a record of estimating the past camera pose of the device, the additional information of the device includes preliminary estimation information about the camera pose of the device calculated from the value of the IMU (Inertial Measurement Unit) of the device. ,

method.
According to claim 2,

The steps of performing visual localization on the device that captured the query image are:

Using additional information from the device, determining candidate reference images that are subject to feature point matching among a plurality of reference images in a database; and

performing visual localization including pose estimation for a camera of the device based on feature point matching between the determined candidate reference images and the query image;

Including,

method.
According to claim 2,

The additional information of the device further includes camera information of the device, and

Further based on the camera information of the device, a pose estimation algorithm used to perform the visual localization among a plurality of pose estimation algorithms is determined.

method.
According to claim 1,

Detecting a predetermined first landmark present in the query image from the query image using an artificial intelligence-based landmark detection model;

It further includes, and

The step of performing visual localization on the device that captured the query image is:

Perform visual localization on the device that captured the query image based at least in part on the determined candidate reference image, 1-1 feature points on the query image, additional information of the device, and the first landmark. Including the steps of:

method.
According to claim 5,

The steps of performing visual localization on the device that captured the query image are:

determining candidate reference images that are subject to feature point matching among a plurality of reference images in a database, based additionally on the additional information of the device and the first landmark; and

performing visual localization including pose estimation for a camera of the device based on feature point matching between the determined candidate reference images and the query image;

Including,

method.
According to claim 1,

The steps of performing visual localization on the device that captured the query image are:

Among the plurality of reference images, 3D camera coordinates assigned to a pose reference image determined based on matching between first feature points of the query image and second feature points of the reference image and 1-1 feature points of the query image determining relative camera pose information for the query image according to a comparison result between 2D coordinates;

determining absolute camera pose information of the query image with respect to an origin, based on 3D camera coordinates assigned to the pose reference image and the relative camera pose information; and

performing visual localization on the device that captured the query image based on the absolute camera pose information;

Including,

method.
According to claim 1,

The step of extracting first feature points for the query image from the query image includes:

It includes obtaining pixel coordinates and a descriptor corresponding to the first feature points, and

The step of performing visual localization on the device that captured the query image is:

Comprising a step of performing visual localization on the device that captured the query image by comparing a descriptor corresponding to the first feature points of the query image with a descriptor corresponding to the second feature points of a reference image stored in a database. doing,

method.
According to claim 8,

The steps of performing visual localization on the device that captured the query image are:

By comparing the descriptor corresponding to the first feature points of the query image with the descriptor corresponding to the second feature points of the reference image stored in the database, a pose reference image for visual localization for the device is selected from among a plurality of reference images. deciding step; and

Based on 3D coordinate information mapped to the pose reference image, matching information between the first feature points and the second feature points, and pixel coordinates corresponding to the first feature points of the query image, a camera for the device performing visual localization including pose estimation;

Including,

method.
According to claim 1,

extracting second feature points for the reference image from the reference image using the feature point acquisition model;

Using the dynamic object detection model, obtaining a second detection result corresponding to a predetermined dynamic object in the reference image from the reference image; and

Based on a comparison result between the second detection result and the second feature points, determining 2-2 feature points to be removed from the reference image and 2-1 feature points to be maintained on the reference image among the second feature points. steps;

Containing more,

method.
According to claim 1,

Based on a comparison result between the first detection result and the first feature points, determining 1-2 feature points to be removed from the query image and 1-1 feature points to be maintained on the query image among the first feature points. The steps are:

Determine first feature points included in the detection area corresponding to the predetermined dynamic object included in the first detection result as 1-2 feature points to be removed from the query image, and determine the first feature points included in the first detection result. determining first feature points not included in a detection area corresponding to a predetermined dynamic object as 1-1 feature points maintained on the query image;

Including,

method.
A computer program stored in a computer-readable storage medium, wherein the computer program, when executed by a computing device, causes the computing device to perform the following operations, the operations being:

Receiving a query image and additional information about the device from a client device;

extracting first feature points for the query image from the query image using an artificial intelligence-based feature point acquisition model that receives the query image;

Obtaining a first detection result corresponding to a predetermined dynamic object within the query image from the query image using an artificial intelligence-based dynamic object detection model that receives the query image;

Recognizing a first text present in the query image from the query image using a text recognition model that receives the query image;

Based on a first comparison result between the predetermined dynamic object included in the first detection result and the first feature points and a second comparison result between the predetermined dynamic object included in the first detection result and the first text Thus, among the first feature points, 1-2 feature points to be removed from the query image and 1-1 feature points to be maintained on the query image are determined, and among the first text, 1-2 feature points to be removed from the query image are determined. 2 An operation of determining a text and a 1-1 text maintained on the query image - a 1-2 text in which a text included in a detection area corresponding to the predetermined dynamic object among the first text is removed from the query image decided -; and

Determine at least one candidate reference image to be compared with the query image by comparing the 1-1 text on the query image and text information assigned as metadata to the reference image, and on the determined candidate reference image and the query image An operation of performing visual localization on the device that captured the query image based at least in part on the 1-1 feature points and additional information of the device;

Including,

A computer program stored on a computer-readable storage medium.
As a computing device,

at least one processor; and

Memory;

Includes,

The at least one processor:

Receiving a query image and additional information about the device from a client device;

extracting first feature points for the query image from the query image using an artificial intelligence-based feature point acquisition model that receives the query image;

Obtaining a first detection result corresponding to a predetermined dynamic object within the query image from the query image using an artificial intelligence-based dynamic object detection model that receives the query image;

Recognizing a first text present in the query image from the query image using a text recognition model that receives the query image;

Based on a first comparison result between the predetermined dynamic object included in the first detection result and the first feature points and a second comparison result between the predetermined dynamic object included in the first detection result and the first text Thus, among the first feature points, 1-2 feature points to be removed from the query image and 1-1 feature points to be maintained on the query image are determined, and among the first text, 1-2 feature points to be removed from the query image are determined. 2 An operation of determining a text and a 1-1 text maintained on the query image - a 1-2 text in which a text included in a detection area corresponding to the predetermined dynamic object among the first text is removed from the query image decided -; and

Determine at least one candidate reference image to be compared with the query image by comparing the 1-1 text on the query image and text information assigned as metadata to the reference image, and on the determined candidate reference image and the query image An operation of performing visual localization on the device that captured the query image based at least in part on the 1-1 feature points and additional information of the device;

To perform,

Computing device.