WO2022203342A1

WO2022203342A1 - Method for processing image acquired from imaging device linked with computing device, and system using same

Info

Publication number: WO2022203342A1
Application number: PCT/KR2022/003965
Authority: WO
Inventors: 이충열
Original assignee: 이충열
Priority date: 2021-03-22
Filing date: 2022-03-22
Publication date: 2022-09-29

Abstract

An image processing method performed by a computing device including a processor according to several embodiments disclosed herein comprises the steps of: acquiring an image; acquiring, from the image, analysis information corresponding to an object included in the image, wherein the analysis information is acquired using an object analysis model; acquiring posture information about the object from the analysis information corresponding to the object by using a posture discrimination model; and acquiring behavioral information about the object from the posture information about the object by using a behavior discrimination model, the behavioral information including an N-dimensional movement vector corresponding to the object.

Description

A method for processing an image obtained from a photographing device interworking with a computing device and a system using the same

The present disclosure discloses a method for processing an image obtained from a photographing device and a system using the same. Specifically, according to the method of the present disclosure, the computing device acquires the entire image from a photographing device integrated with the computing device or interoperates with the computing device, detects one or more objects appearing in the entire image, and the detected Classification is performed to calculate a category of each object, and detailed classification information including characteristics and state of the object is generated as a result of analysis of each object.

A portable computing apparatus generally refers to a device equipped with a processor, a display, a microphone, and a speaker, and some of them may be used as a portable terminal, which is a type of communication device. The mobile terminal has conventionally received a user's command through an input device that requires user contact, such as a keypad and a touch display. Through this, it is possible to receive a user's command from a distance and to interact according to the command.

However, since the portable computing device does not have its own mobility, the range of input/output is limited according to its physical location.

As an example of the physical limitation of the input device mounted on the portable computing device, the camera, which is a non-contact input device, is mounted on the front or back of the mobile phone and has a limited field of view, so that the desired image can be captured in the camera. The user must manually change the composition (angle of view or field of view; FOV) after holding it by hand. As another example of such a physical limitation, the reception sensitivity of a microphone mounted on a portable computing device is lowered according to the direction of the sound source.

Not only the input but also the output are limited according to the physical location. As an example, the touch display, which is an output device, is attached to some of the six surfaces of the mobile terminal with a shape close to a plane, so that the user can hold the mobile phone. After doing so, use the mobile phone with the touch display of the mobile terminal facing in the direction of one's face. As another example, in the infrared projector for obtaining a three-dimensional shape, the irradiation direction and angle of the infrared ray are limited to one direction range of the mobile terminal, so the user holds the mobile terminal and then adjusts the irradiation direction according to the guidance shall. In addition, in this regard, Korean Patent Laid-Open Publication Nos. 10-2011-0032244, 10-2019-0085464, 10-2019-0074011, 10-2019-0098091, 10-2019-0106943, No. No. 10-2018-0109499 has been devised. And, in this regard (Non-Patent Document 1) Y. Zhou et al., "Learning to Reconstruct 3D Manhattan Wireframes From a Single Image," 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019, pp. 7697-7706, doi: 10.1109/ICCV.2019.00779., (Non-Patent Document 2) Shin, D., & Kim, I. (2018). Deep Image Understanding Using Multilayered Contexts. Mathematical Problems in Engineering, 2018, 1-11. https://doi.org/10.1155/2018/5847460 , (Non-Patent Document 3) Mo, K., Zhu, S., Chang, AX, Yi, L., Tripathi, S., Guibas, LJ, & Su , H. (2019). PartNet: A Large-Scale Benchmark for Fine-Grained and Hierarchical Part-Level 3D Object Understanding. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. https://doi.org/10.1109/cvpr.2019.00100 , (Non-Patent Document 4) Babaee, M., Li, L., & Rigoll, G. (2019). Person identification from partial gait cycle using fully convolutional neural networks. Neurocomputing, 338, 116-125., (Non-Patent Document 5) Muhammad, UR, Svanera, M., Leonardi, R., & Benini, S. (2018). Hair detection, segmentation, and hairstyle classification in the wild. Image and Vision Computing, 71, 25-37. https://doi.org/10.1016/j.imavis.2018.02.001 , (Non-Patent Document 6) Mougeot, G., Li, D., & Jia, S. (2019). A Deep Learning Approach for Dog Face Verification and Recognition. In PRICAI 2019: Trends in Artificial Intelligence (pp. 418-430). Springer International Publishing. https://doi.org/10.1007/978-3-030-29894-4_34 , (Non-Patent Document 7) Raduly, Z., Sulyok, C., Vadaszi, Z., & Zolde, A. (2018). Dog Breed Identification Using Deep Learning. In 2018 IEEE 16th International Symposium on Intelligent Systems and Informatics (SISY). IEEE. https://doi.org/10.1109/sisy.2018.8524715 , (Non-Patent Document 8) Wu, Z., Yao, T., Fu, Y., & Jiang, Y.-G. (2017). Deep learning for video classification and captioning. In Frontiers of Multimedia Research (pp. 3-29). ACM. https://doi.org/10.1145/3122865.3122867 , (Non-Patent Document 9) Wu, C.-Y., Girshick, R., He, K., Feichtenhofer, C., & Krahenbuhl, P. (2020). A Multigrid Method for Efficiently Training Video Models. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. https://doi.org/10.1109/cvpr42600.2020.00023 , (Non-Patent Document 10) Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., & Baik, SW (2018). Action Recognition in Video Sequences using Deep Bi-Directional LSTM With CNN Features. IEEE Access, 6, 1155-1166. https://doi.org/10.1109/access.2017.2778011 has been devised.

The present disclosure aims to suggest a method capable of overcoming the limitations regarding image input among the limitations of the aforementioned prior art portable computing devices, and is an image capturing device such as a camera interworking with the portable computing device. We intend to propose a technical method that allows the image capture device to more dynamically receive and process an image by combining the gimbal with the device and allowing the computing device to control the gimbal to rotate the image capturing device more than one axis.

The present disclosure solves the problems of the prior art, and in a portable computing device, it is possible to recognize and track an object in an image captured by a camera or the like, and to actively acquire information of the object and environment, in particular, based on the image Presenting an image processing method that enables remote input through an image, such as identifying the position of an object in space by identifying relative position information and space information of an object in the coordinate system centered on the system of the present disclosure The purpose.

A method of processing an image performed by a computing device including a processor, the method comprising: acquiring an image; obtaining, from the image, analysis information corresponding to the object included in the image by using the object analysis model; obtaining posture information about the object from the analysis information corresponding to the object by using the posture determination model; and obtaining, from the posture information on the object, the behavior information on the object, including an N-dimensional motion vector corresponding to the object, by using the behavior discrimination model.

Alternatively, the analysis information may include at least one of classification information indicating the category of the object, location information indicating the location of the object, or importance information indicating the priority of the object in the image.

Alternatively, in the posture determination model, a posture determination method to be applied to the object among a plurality of posture determination methods may be determined based on the analysis information corresponding to the object.

Alternatively, the posture determination model may apply a different posture determination method according to classification information indicating the category of the object.

Alternatively, the posture determination model may generate posture information about the object, including information about an N-dimensional posture corresponding to the object, from the analysis information.

Alternatively, the posture information on the object may include information on the temporally continuous posture with respect to the object.

Alternatively, the movement vector may include a two-dimensional movement vector including at least one of a two-dimensional direction or a velocity of the object.

Alternatively, the movement vector may include a three-dimensional movement vector of the object calculated based on at least one of a position, a speed, and an acceleration of a photographing device that has captured the image.

Alternatively, the behavior determination model may generate behavior information of the object with respect to the context of the object, including an N-dimensional behavior classification corresponding to the object, from the posture information of the object.

Alternatively, the context of the object may include at least one of the object's state or purpose of action.

Alternatively, the behavior classification may include: identifying a position of a partial object included in the object, determining a posture of the object based on the position of the partial object, and determining a posture of the object based on the posture of the object; It can be calculated by determining the behavior classification of

Alternatively, the partial object may include at least one of a part of the object or an attribute belonging to the object.

Alternatively, the method of claim 1 , wherein the behavior discrimination model may apply a different behavior discrimination method according to classification information indicating the category of the object.

A non-transitory computer readable medium including a computer program for solving the above problems, the computer program causing a computing device to perform a method for processing an image, the method comprising: acquiring an image; obtaining, from the image, analysis information corresponding to the object included in the image by using the object analysis model; obtaining posture information about the object from the analysis information corresponding to the object by using the posture determination model; and obtaining, from the posture information on the object, the behavior information on the object, including an N-dimensional motion vector corresponding to the object, by using the behavior discrimination model.

In order to solve the above problems, a computing device, comprising: a processor; and a communication unit, wherein the processor obtains an image, uses an object analysis model, obtains analysis information corresponding to the object included in the image, from the image, and uses the posture determination model, wherein the Obtaining posture information about the object from the analysis information corresponding to the object, and using a behavior discrimination model, from the posture information on the object, including an N-dimensional motion vector corresponding to the object, the object behavior information can be obtained.

According to an exemplary embodiment of the present disclosure, one or more objects can be recognized and tracked using an image, and information about objects and environments can be actively acquired, and in particular, state information of objects can be converted into images from a distance. It is possible to acquire, determine the object with which the object interacts as an image, obtain a higher resolution detailed image of a part of the object from a distance, and output it printed on the object or using other means such as a display. Since characters can be recognized from a distance, there is an effect that a remote input using an image of a portable computing device is possible.

The accompanying drawings for use in the description of the embodiments of the present invention are only a part of the embodiments of the present invention, and for a person of ordinary skill in the art to which the present invention belongs (hereinafter referred to as "a person skilled in the art") Other drawings may be obtained based on these drawings without an effort leading to the invention.

1 is a conceptual diagram schematically illustrating an exemplary configuration of a computing device performing a method of processing an image (hereinafter referred to as an “image processing method”) by a computing device according to an embodiment of the present disclosure.

FIG. 2 is a conceptual diagram exemplarily illustrating an overall hardware and software architecture including a computing device, a photographing device, and a gimbal as a system for performing an image processing method according to an embodiment of the present disclosure.

3 is a flowchart exemplarily illustrating an image processing method according to an embodiment of the present disclosure, and FIG. 4 exemplarily shows modules performing each step of the image processing method according to an embodiment of the present disclosure. It is a block diagram.

5 is a block diagram exemplarily illustrating machine learning models used in modules for an image processing method according to an embodiment of the present disclosure.

6A to 6D are flowcharts exemplarily illustrating methods that may be used to detect a floor plane of an object in the image processing method of the present disclosure.

7A is a diagram exemplarily illustrating object segmentation obtained by an image processing method according to an embodiment of the present disclosure.

7B is an exemplary diagram for explaining steps of detecting two or more floor planes in an image processing method according to an embodiment of the present disclosure.

7C is a conceptual diagram for explaining a method of generating and using a reference plane circle and a measurement plane circle in an image processing method according to an embodiment of the present disclosure;

8A to 8C are flowcharts exemplarily illustrating methods that may be used to determine a target position in the image processing method of the present disclosure.

9A and 9B are flowcharts exemplarily illustrating methods that may be used to control the orientation of a photographing apparatus in the image processing method of the present disclosure.

10A to 10C are flowcharts exemplarily illustrating methods that may be used to perform OCR in the image processing method of the present disclosure.

11A to 11D are diagrams exemplified to describe methods of performing OCR in the image processing method of the present disclosure.

All prior publications cited in this disclosure are incorporated by reference in their entirety as if they were all set forth in this disclosure. Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art. Terms such as those defined in a commonly used dictionary should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present specification. does not

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The following detailed description of the present invention refers to the accompanying drawings, which show by way of illustration a specific embodiment in which the present invention may be practiced, in order to clarify the objects, technical solutions and advantages of the present invention. These embodiments are described in sufficient detail to enable those skilled in the art to practice the present invention. In the description with reference to the accompanying drawings, the same components are assigned the same reference numerals regardless of the reference numerals, and the overlapping description thereof will be omitted.

Specific structural or functional descriptions of the embodiments are disclosed for purposes of illustration only, and may be changed and implemented in various forms. Accordingly, the embodiments are not limited to a specific disclosure form, and the scope of the present specification includes changes, equivalents, or substitutes included in the technical spirit.

Although terms such as "first" or "second" may be used to describe various elements, these terms should be interpreted only for the purpose of distinguishing one element from another, and no order is implied. because it doesn't For example, a first component may be termed a second component, and similarly, a second component may also be termed a first component.

When a component is referred to as being “connected” to another component, it may be directly connected or connected to the other component, but it should be understood that another component may exist in between.

The singular expression includes the plural expression unless the context clearly dictates otherwise. In the present specification, terms such as "comprise" or "have" are intended to designate that the described feature, number, step, operation, component, part, or combination thereof exists, but one or more other features, number, step , it should be understood that it does not preclude the possibility of the existence or addition of , operation, components, parts or combinations thereof.

Also, "part" or "portion" of an object may mean only a part, but not all, of the object, but should be understood to include the whole of the object unless the context dictates otherwise. A subset of a set is the same as a concept that includes the set itself.

In the present disclosure, a 'module' may mean hardware capable of performing functions and operations according to each name described in the present disclosure, or may mean computer program code capable of performing specific functions and operations, , or may refer to a recording medium on which a computer program code capable of performing a specific function and operation is loaded. In other words, a module may mean a functional and/or structural combination of hardware for carrying out the spirit of the present disclosure and/or software for driving the hardware.

Strictly speaking, a 'model' refers to a function configured to produce output data from input data as trained by machine learning. Such a 'model' may be used by the aforementioned 'module' as a kind of data structure or function.

However, a habit of mixing 'module' and 'model' is found in some of ordinary engineers in the field to which artificial intelligence is applied. Accordingly, in this disclosure, 'module' and 'model' are interchanged with each other. They may be used interchangeably as possible, since they can be easily understood by those skilled in the art without confusing concepts with each other.

In this disclosure, 'training' and 'learning' are terms that refer to performing machine learning through procedural computing, and are intended to refer to mental actions such as human educational activities. It will be understood by those skilled in the art that this is not the case. As is commonly used in the field of statistics, the term 'machine learning' refers to a series of processes that create a target function (f) that maps input variables (X) to output variables (Y). It is often used to refer to Calculating the output variables from the input variables by the target function is referred to as 'prediction', and 'mapped well' means that the difference between the true value and the predicted value is reasonably reduced. The reason for rationally reducing the difference rather than minimizing the difference is that optimization may cause the so-called overfitting problem, that is, a problem of poor prediction when applying real data that deviate from the training data. This is because appropriate empirical means are devised for

Also, in the present disclosure, 'inference' is a term referring to a process of calculating output data from input data by a machine-learning model, and in particular, it is used to refer to a mechanical imitation of a human mental action. Similarly, in the present disclosure, 'analysis' by a machine is used to refer to a mechanical imitation of a human's mental action, such as reasoning.

In this disclosure, 'Manhattan space' is a non-patent literature paper Y. Zhou et al., "Learning to Reconstruct 3D Manhattan Wireframes From a Single Image," 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019, pp. 7697-7706, doi: 10.1109/ICCV.2019.00779.

Moreover, the present invention encompasses all possible combinations of embodiments indicated in the present disclosure. It should be understood that various embodiments of the present invention are different but need not be mutually exclusive. For example, certain shapes, structures, and characteristics described herein may be implemented in other embodiments without departing from the spirit and scope of the invention in relation to one embodiment. In addition, it should be understood that the position or arrangement of individual components in each disclosed embodiment may be changed without departing from the spirit and scope of the present invention. Accordingly, the detailed description set forth below is not intended to be taken in a limiting sense. Like reference numerals in the drawings refer to the same or similar functions throughout the various aspects.

Unless otherwise indicated herein or otherwise clearly contradicted by context, items referred to in the singular encompass the plural unless the context requires otherwise. In addition, in describing the present invention, if it is determined that a detailed description of a related known configuration or function may obscure the gist of the present invention, the detailed description thereof will be omitted.

Hereinafter, in order to enable those skilled in the art to easily practice the present invention, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

1 is a conceptual diagram schematically illustrating an exemplary configuration of a computing device performing an image processing method according to an embodiment of the present disclosure.

Referring to FIG. 1 , a computing device 100 according to an embodiment of the present disclosure includes a communication unit 110 and a processor 120 , and may communicate with an external computing device (not shown) through the communication unit 110 . can communicate directly or indirectly.

Specifically, the computing device 100 includes typical computer hardware (eg, a computer; a device that may include a processor, memory, storage, input and output devices, and other components of a conventional computing device; a router; , electronic communication devices such as switches, etc.; electronic information storage systems such as network-attached storage (NAS) and storage area networks (SAN)) and computer software (ie, enabling the computing device to perform a specific method to achieve the desired system performance using a combination of instructions that make it function as The storage may include a storage device such as a hard disk and a Universal Serial Bus (USB) memory, as well as a storage device based on a network connection such as a cloud server. Here, the memory may be DDR2, DDR3, DDR4, SDP, DDP, QDP, magnetic hard disk, flash memory, etc., but is not limited thereto.

The communication unit 110 of such a computing device may transmit/receive a request and a response between an interworking other computing device, for example, a mobile terminal, etc. As an example, such a request and a response are the same Transmission Control Protocol (TCP) session ( session), but is not limited thereto, and may be transmitted and received as, for example, User Datagram Protocol (UDP) datagrams.

Specifically, the communication unit 110 may be implemented in the form of a communication module including a communication interface. For example, the communication interface is WLAN (Wireless LAN), WiFi (Wireless Fidelity) Direct, DLNA (Digital Living Network Alliance), WiBro (Wireless Broadband), WiMax (World interoperability for Microwave access), HSDPA (High Speed Downlink Packet Access), Wireless Internet interfaces such as 4G and 5G and short-distance such as Bluetooth™, RFID (Radio Frequency IDentification), Infrared Data Association (IrDA), UWB (Ultra-WideBand), ZigBee, NFC (Near Field Communication), etc. It may include a communication interface. In addition, the communication interface may represent any interface (eg, a wired interface) capable of performing communication with the outside.

For example, the communication unit 110 may transmit/receive data to and from other computing devices through an appropriate communication interface as described above. In addition, in a broad sense, the communication unit 110 includes a keyboard, a mouse, a touch sensor, an input unit of a touch screen, a microphone, a video camera, or an external input such as a LIDAR, a radar, a switch, a button, a joystick, etc. for receiving commands or instructions. An external output device such as a device, a sound card, a graphic card, a printing device, a display, for example, a display unit of a touch screen, or the like may be interlocked with these devices. In order to enable interaction with a user by displaying and providing an appropriate user interface to a user of a computing device, for example, a portable terminal, the computing device 100 has a built-in display device or an external display device through the communication unit 110 . It is known that it can be linked with For example, such a display device may be a touch screen capable of a touch input. A touchscreen may detect an object, such as a finger, a stylus pen, in contact with or proximity to a display, capacitively or inductively or optically, and determine a position on the detected display.

The input device may include a microphone. The type of the microphone may include a dynamic microphone, a condenser microphone, and the like, and a microphone having characteristics such as omni-directional, unidirectional, and super-directional may be used. A beamforming microphone and a microphone array may also be used, but are not limited thereto. A microphone array refers to two or more microphones used to detect the direction of a sound source.

The output device may include a speaker. The type of speaker may include, but is not limited to, an omni-directional speaker, a directional speaker, and a super-directional speaker using ultrasonic waves.

In addition, the processor 120 of the computing device is a micro processing unit (MPU), central processing unit (CPU), graphics processing unit (GPU), neural processing unit (NPU), ASIC, CISC, RISC, FPGA, SOC chip or TPU. It may include a hardware configuration such as a tensor processing unit, a cache memory, and a data bus. In addition, it may further include an operating system, a software configuration of an application for performing a specific purpose. According to an embodiment of the present disclosure, the processor 120 may perform an operation for learning a neural network of various models. The processor 120 for learning of the neural network, such as processing input data for learning in deep learning (DL), extracting features from the input data, calculating an error, updating the weight of the neural network using backpropagation calculations can be performed. At least one of a CPU of the processor 110 , a general purpose graphics processing unit (GPGPU), and/or a TPU may process learning of a network function. For example, the CPU and the GPGPU can process learning of a network function and data classification using the network function. Also, in an embodiment of the present disclosure, learning of a network function and data classification using the network function may be processed by using the processors of a plurality of computing devices together. In addition, the computer program executed in the computing device 100 according to an embodiment of the present disclosure may be a CPU, GPGPU, or TPU executable program.

Throughout this specification, computational model, neural network, network function, and neural network may be used interchangeably. A neural network may be composed of a set of interconnected computational units, which may generally be referred to as nodes. These nodes may also be referred to as neurons. A neural network is configured to include at least one or more nodes. Nodes (or neurons) constituting the neural networks may be interconnected by one or more links.

In the neural network, one or more nodes connected through a link may relatively form a relationship between an input node and an output node. The concepts of an input node and an output node are relative, and any node in an output node relationship with respect to one node may be in an input node relationship in a relationship with another node, and vice versa. As described above, an input node-to-output node relationship may be created around a link. One or more output nodes may be connected to one input node through a link, and vice versa.

In the relationship between the input node and the output node connected through one link, the value of the data of the output node may be determined based on data input to the input node. Here, a link interconnecting the input node and the output node may have a weight. The weight may be variable, and may be changed by the user or algorithm in order for the neural network to perform a desired function. For example, when one or more input nodes are interconnected to one output node by respective links, the output node sets values input to input nodes connected to the output node and links corresponding to the respective input nodes. An output node value may be determined based on the weight.

As described above, in a neural network, one or more nodes are interconnected through one or more links to form an input node and an output node relationship in the neural network. The characteristics of the neural network may be determined according to the number of nodes and links in the neural network, the correlation between the nodes and the links, and the value of a weight assigned to each of the links. For example, when the same number of nodes and links exist and there are two neural networks having different weight values of the links, the two neural networks may be recognized as different from each other.

A neural network may consist of a set of one or more nodes. A subset of nodes constituting the neural network may constitute a layer. Some of the nodes constituting the neural network may configure one layer based on distances from the initial input node. For example, a set of nodes having a distance n from the initial input node may constitute n layers. The distance from the initial input node may be defined by the minimum number of links that must be traversed to reach the corresponding node from the initial input node. However, the definition of such a layer is arbitrary for description, and the order of the layer in the neural network may be defined in a different way from the above. For example, a layer of nodes may be defined by a distance from the final output node.

The initial input node may mean one or more nodes to which data is directly input without going through a link in a relationship with other nodes among nodes in the neural network. Alternatively, in a relationship between nodes based on a link in a neural network, it may mean nodes that do not have other input nodes connected by a link. Similarly, the final output node may refer to one or more nodes that do not have an output node in relation to other nodes among nodes in the neural network. In addition, the hidden node may mean nodes constituting the neural network other than the first input node and the last output node.

The neural network according to an embodiment of the present disclosure may be a neural network in which the number of nodes in the input layer may be the same as the number of nodes in the output layer, and the number of nodes decreases and then increases again as the input layer progresses to the hidden layer. can In addition, the neural network according to another embodiment of the present disclosure may be a neural network in which the number of nodes in the input layer may be less than the number of nodes in the output layer, and the number of nodes decreases as the number of nodes progresses from the input layer to the hidden layer. have. In addition, the neural network according to another embodiment of the present disclosure may be a neural network in which the number of nodes in the input layer may be greater than the number of nodes in the output layer, and the number of nodes increases as the number of nodes progresses from the input layer to the hidden layer. can The neural network according to another embodiment of the present disclosure may be a neural network in a combined form of the aforementioned neural networks.

A deep neural network (DNN) may refer to a neural network including a plurality of hidden layers in addition to an input layer and an output layer. Deep neural networks can be used to identify the latent structures of data. In other words, it can identify the potential structure of photos, texts, videos, voices, and music (e.g., what objects are in the photos, what the text and emotions are, what the texts and emotions are, etc.) . Deep neural networks are a convolutional neural network (CNN), a recurrent neural network (RNN), an auto encoder, a restricted boltzmann machine (RBM), and a deep trust network ( DBN: deep belief network), Q network, U network, Siamese network, and may include a generative adversarial network (GAN: Generative Adversarial Network), and the like. The description of the deep neural network described above is only an example, and the present disclosure is not limited thereto.

The neural network may be trained using at least one of supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning. Learning of the neural network may be a process of applying knowledge for the neural network to perform a specific operation to the neural network.

A neural network can be trained in a way that minimizes output errors. In the training of a neural network, iteratively input the training data into the neural network, calculate the output of the neural network and the target error for the training data, and calculate the error of the neural network from the output layer of the neural network to the input layer in the direction to reduce the error. It is the process of updating the weight of each node of the neural network by backpropagation in the direction. In the case of teacher learning, learning data in which the correct answer is labeled in each learning data is used (ie, labeled learning data), and in the case of comparative learning, the correct answer may not be labeled in each learning data. That is, for example, the learning data in the case of teacher learning regarding data classification may be data in which categories are labeled for each of the learning data. Labeled training data is input to the neural network, and an error can be calculated by comparing the output (category) of the neural network with the label of the training data. As another example, in the case of comparison learning about data classification, an error may be calculated by comparing the input training data with the output of the neural network. The calculated error is back propagated in the reverse direction (ie, from the output layer to the input layer) in the neural network, and the connection weight of each node of each layer of the neural network may be updated according to the back propagation. A change amount of the connection weight of each node to be updated may be determined according to a learning rate. The computation of the neural network on the input data and the backpropagation of errors can constitute a learning cycle (epoch). The learning rate may be applied differently depending on the number of repetitions of the learning cycle of the neural network. For example, in the early stage of learning of a neural network, a high learning rate can be used to enable the neural network to quickly acquire a certain level of performance, thereby increasing efficiency, and using a low learning rate at the end of learning can increase accuracy.

In training of a neural network, in general, the training data may be a subset of real data (that is, data to be processed using the trained neural network), and thus the error on the training data is reduced, but the error on the real data is reduced. There may be increasing learning cycles. Overfitting is a phenomenon in which errors on actual data increase by over-learning on training data as described above. For example, a phenomenon in which a neural network that has learned a cat by seeing a yellow cat does not recognize that it is a cat when it sees a cat other than yellow may be a type of overfitting. Overfitting can act as a cause of increasing errors in machine learning algorithms. In order to prevent such overfitting, various optimization methods can be used. In order to prevent overfitting, methods such as increasing the training data, regularization, and dropout that deactivate some of the nodes of the network in the process of learning, and the use of a batch normalization layer are applied. can

A computer-readable medium storing a data structure is disclosed according to an embodiment of the present disclosure.

The data structure may refer to the organization, management, and storage of data that enables efficient access and modification of data. A data structure may refer to an organization of data to solve a specific problem (eg, data retrieval, data storage, and data modification in the shortest time). A data structure may be defined as a physical or logical relationship between data elements designed to support a particular data processing function. The logical relationship between data elements may include a connection relationship between user-defined data elements. Physical relationships between data elements may include actual relationships between data elements physically stored on a computer-readable storage medium (eg, persistent storage). A data structure may specifically include a set of data, relationships between data, and functions or instructions applicable to data. Through an effectively designed data structure, a computing device can perform an operation while using the computing device's resources to a minimum. Specifically, the computing device may increase the efficiency of operations, reads, insertions, deletions, comparisons, exchanges, and retrievals through effectively designed data structures.

A data structure may be classified into a linear data structure and a non-linear data structure according to the type of the data structure. The linear data structure may be a structure in which only one piece of data is connected after one piece of data. The linear data structure may include a list, a stack, a queue, and a deck. A list may mean a set of data in which an order exists internally. The list may include a linked list. The linked list may be a data structure in which data is linked in such a way that each data is linked in a line with a pointer. In a linked list, a pointer may contain information about a link with the next or previous data. A linked list may be expressed as a single linked list, a doubly linked list, or a circularly linked list according to a shape. A stack can be a data enumeration structure with limited access to data. A stack can be a linear data structure in which data can be processed (eg, inserted or deleted) at only one end of the data structure. The data stored in the stack may be a data structure LIFO-Last in First Out. A queue is a data listing structure that allows limited access to data, and unlike a stack, it may be a data structure that comes out later (FIFO-First in First Out) as data stored later. A deck can be a data structure that can process data at either end of the data structure.

The nonlinear data structure may be a structure in which a plurality of data is connected after one data. The nonlinear data structure may include a graph data structure. A graph data structure may be defined as a vertex and an edge, and the edge may include a line connecting two different vertices. A graph data structure may include a tree data structure. The tree data structure may be a data structure in which one path connects two different vertices among a plurality of vertices included in the tree. That is, it may be a data structure that does not form a loop in the graph data structure.

Throughout this specification, computational model, neural network, network function, and neural network may be used interchangeably. Hereinafter, the neural network is unified and described. The data structure may include a neural network. And the data structure including the neural network may be stored in a computer-readable medium. Data structures, including neural networks, also include preprocessed data for processing by the neural network, data input to the neural network, weights of the neural network, hyperparameters of the neural network, data obtained from the neural network, activation functions associated with each node or layer of the neural network, and the neural network. It may include a loss function for learning of . A data structure comprising a neural network may include any of the components disclosed above. That is, the data structure including the neural network includes preprocessed data for processing by the neural network, data input to the neural network, weights of the neural network, hyperparameters of the neural network, data obtained from the neural network, activation functions associated with each node or layer of the neural network, and the neural network It may be configured to include all or any combination thereof, such as a loss function for learning of . In addition to the above-described configurations, a data structure including a neural network may include any other information that determines a characteristic of the neural network. In addition, the data structure may include all types of data used or generated in the operation process of the neural network, and is not limited thereto. Computer-readable media may include computer-readable recording media and/or computer-readable transmission media. A neural network may be composed of a set of interconnected computational units, which may generally be referred to as nodes. These nodes may also be referred to as neurons. A neural network is configured to include at least one or more nodes.

The data structure may include data input to the neural network. A data structure including data input to the neural network may be stored in a computer-readable medium. The data input to the neural network may include learning data input in a neural network learning process and/or input data input to the neural network in which learning is completed. Data input to the neural network may include pre-processing data and/or pre-processing target data. The preprocessing may include a data processing process for inputting data into the neural network. Accordingly, the data structure may include data to be pre-processed and data generated by pre-processing. The above-described data structure is merely an example, and the present disclosure is not limited thereto.

The data structure may include the weights of the neural network. (In this specification, a weight and a parameter may be used interchangeably.) And a data structure including a weight of a neural network may be stored in a computer-readable medium. The neural network may include a plurality of weights. The weight may be variable, and may be changed by the user or algorithm in order for the neural network to perform a desired function. For example, when one or more input nodes are interconnected to one output node by respective links, the output node sets values input to input nodes connected to the output node and links corresponding to the respective input nodes. A data value output from the output node may be determined based on the weight. The above-described data structure is merely an example, and the present disclosure is not limited thereto.

By way of example and not limitation, the weight may include a weight variable in a neural network learning process and/or a weight in which neural network learning is completed. The variable weight in the neural network learning process may include a weight at the start of the learning cycle and/or a variable weight during the learning cycle. The weight for which neural network learning is completed may include a weight for which a learning cycle is completed. Accordingly, the data structure including the weights of the neural network may include a data structure including the weights that vary in the neural network learning process and/or the weights on which the neural network learning is completed. Therefore, it is assumed that the above-described weights and/or combinations of weights are included in the data structure including the weights of the neural network. The above-described data structure is merely an example, and the present disclosure is not limited thereto.

The data structure including the weights of the neural network may be stored in a computer-readable storage medium (eg, memory, hard disk) after being serialized. Serialization can be the process of converting a data structure into a form that can be reconstructed and used later by storing it on the same or a different computing device. The computing device may serialize the data structure to send and receive data over a network. A data structure including weights of the serialized neural network may be reconstructed in the same computing device or in another computing device through deserialization. The data structure including the weight of the neural network is not limited to serialization. Furthermore, the data structure including the weights of the neural network is a data structure to increase the efficiency of computation while using the resources of the computing device to a minimum (e.g., B-Tree, Trie, m-way search tree, AVL tree, Red-Black Tree). The foregoing is merely an example, and the present disclosure is not limited thereto.

The data structure may include hyper-parameters of the neural network. In addition, the data structure including the hyperparameters of the neural network may be stored in a computer-readable medium. The hyper parameter may be a variable variable by a user. Hyperparameters are, for example, learning rate, cost function, number of iterations of the learning cycle, weight initialization (e.g., setting the range of weight values to be initialized for weights), Hidden Unit The number (eg, the number of hidden layers, the number of nodes of the hidden layer) may be included. The above-described data structure is merely an example, and the present disclosure is not limited thereto.

An overview of the configuration of the method and apparatus according to the present invention with reference to FIG. 2 , the computing device 100 may include the photographing device 200 , and may interwork with the external photographing apparatus 200 wirelessly or by wire. In addition, the computing device 100 may work with or include a gimbal 300 that performs a function of controlling the posture of the photographing device 200 wirelessly or wiredly. In order to control the posture of the photographing apparatus 200 , the gimbal 300 may include the photographing apparatus 200 or may include a predetermined mechanism (eg, a sucker, etc.) capable of fixing the photographing apparatus 200 .

In order to control the posture, the gimbal 300 may have one or more rotation axes, and an example thereof can be found in Korean Patent Application Laid-Open No. 10-2019-0036323. The gimbal 300 may actively improve the input range of the photographing apparatus 200 through posture control.

When the gimbal 300 has one axis of rotation, the axis of rotation may be a yaw axis (Y). The yaw axis enables the photographing apparatus 200 to interact with an object in space with minimal rotation.

When the gimbal 300 has two rotation axes, the rotation axes may be a yaw axis and a pitch axis P. Also, when the gimbal 300 has three rotation axes, the rotation axes may be a yaw axis, a pitch axis, and a roll axis R.

The gimbal 300 may include a power supply unit 310 as a component of its hardware. The power supply unit 310 may be supplied with external power by wire or wirelessly and by direct current or alternating current. The power supplied to the power supply unit 310 may be used in the gimbal 300 or the computing device 100 . In addition, the power supply unit 310 may be used to charge a battery built into the gimbal 300 or a battery built into the computing device.

Also, the gimbal 300 may include at least one gimbal motor 330 (not shown) as a component of its hardware. Each of the gimbal motors 330 is configured to change the direction of the photographing device 200 or the computing device 100 in which the photographing apparatus 200 is embedded according to the above-described rotation axis, and the gimbal motor 330 is a DC motor. , a stepper motor, or a brushless motor, but is not limited thereto. The gimbal 300 may further include a gear for converting torque of the motor as well as the gimbal motor 330 .

The motor 330 of the gimbal 300 is for orienting the photographing device 200 or the computing device 100 attached to the gimbal in the direction of a specific object, and the respective rotation axis is the photographing device 200 or the computing device. Those of ordinary skill in the art will readily understand that it is preferable to arrange the apparatus 100 parallel to each axis, such as the yaw, pitch, and roll axis, but there is no reason to be limited thereto.

The gimbal 300 may further include at least one sensor 340 (not shown) as a hardware component thereof. The sensor 340 may perform a function of detecting one or more of a position, an angular position, a displacement, an angular displacement, a speed, an angular velocity, an acceleration, and an angular acceleration with respect to the fixed part of the gimbal 300 or the motor 330. , the types of the sensor 340 include an acceleration sensor, a gyro sensor, a magnetic sensor such as a geomagnetic sensor, a hall sensor, a pressure sensor, an infrared sensor, a proximity sensor, a motion sensor, a photosensitive sensor, an image (video) sensor, a GPS sensor, a temperature There may be a sensor, a humidity sensor, a barometric pressure sensor, a LIDAR sensor, and the like, but is not limited thereto.

The sensor 340 that is not mounted on the computing device 100 due to restrictions on the weight and volume that the computing device 100, particularly the portable computing device can have, may be mounted on the gimbal 300, which is (300) can be used to obtain surrounding information.

Now, specific functions and effects of the present invention that can be achieved by the individual components schematically described with reference to FIG. 2 will be described below in detail with reference to FIGS. 3 to 11D . Although the components shown in FIG. 2 are exemplified as being realized in one computing device for convenience of description, it will be understood that the computing device 100 for performing the method of the present disclosure may be configured such that a plurality of devices may interwork with each other. . For example, since the gimbal 300 is configured to be an independent computing device, the gimbal 300 and the computing device 100, for example, a portable computing device such as a portable terminal, may be interlocked with each other, in that case The gimbal 300 may instead perform at least some functions performed by the mobile terminal 100 . That is, a person skilled in the art will be able to configure a plurality of devices to perform the method of the present disclosure by interworking with each other in various ways.

3 is a flowchart exemplarily illustrating an image processing method according to an embodiment of the present disclosure, and FIG. 4 exemplarily shows modules performing each step of the image processing method according to an embodiment of the present disclosure. It is a block diagram. 5 is a block diagram exemplarily illustrating machine learning models used in modules for an image processing method according to an embodiment of the present disclosure.

Referring to FIG. 3 , in the image processing method according to the present disclosure, first, the image input module 4100 implemented by the computing device 100 is included in the computing device 100 or a communication unit of the computing device 100 . It includes an image acquisition step (S1000), which is a step of acquiring the entire image from the photographing device 200 interworking through 110 .

Here, the 'full image' refers to an image that is contrasted with an image corresponding to a portion of the entire image, such as an object image, which will be described later.

Next, in the image processing method, the object analysis module 4200 implemented by the computing device 100 detects one or more objects appearing in the entire image, and classifies each of the detected object categories. It further includes a category classification step (S2000) that is performed.

Here, the category of the object refers to the result of classifying the object into a person, a tree, a dog, and the like.

In the category classification step ( S2000 ), the position of each of one or more objects in the entire image may be calculated as part of the two-dimensional measurement at the same time as the classification.

In this two-dimensional measurement and the three-dimensional measurement described later, the two-dimensional measurement is a measurement based on the two-dimensional coordinate system shown in the image without considering the three-dimensional depth information, whereas the three-dimensional measurement is not only the two-dimensional coordinates in the image It is different in that it is a measurement that considers depth information.

Specifically, the category classification step (S2000) includes resizing the entire image to a resolution lower than the original resolution of the entire image (S2100), and converting the resized image to an object analysis model (M420). ) and calculating the category, location, and importance of each of the objects ( S2200 ).

For example, the processor 120 of the computing device 100 may obtain analysis information corresponding to an object included in the image from the image by using the object analysis model M420 . The analysis information may include at least one of classification information indicating a category of an object, location information indicating a location of an object, and/or importance information indicating a priority of an object in an image.

The object analysis model M420 is a model for performing analysis on an object included in an image, and includes an object classification model, a localization model, an object detection model, a segmentation model, and the like. may include

According to an embodiment of the present disclosure, the processor 120 may classify a class of an object in an image given as an input by using the object analysis model M420 that classifies the object. For example, when there is a person in the image given as an input, the processor 120 may obtain an output of “the type of the input image is a person” by using the object analysis model M420 for the input image. The foregoing is merely an example, and the present disclosure is not limited thereto.

According to an embodiment of the present disclosure, the processor 120 may also output location information indicating where the object in the image is located in the image by using the object analysis model M420 for classifying and localizing. For example, when using the object analysis model M420 for classifying and localizing, the processor 120 may recognize an object in an image using a bounding box and output location information. The bounding box may return object location information by outputting the left, right, upper, and lower coordinates of the box. The foregoing is merely an example, and the present disclosure is not limited thereto.

According to an embodiment of the present disclosure, the processor 120 may detect at least one object using the object analysis model M420 that detects the object. The processor 120 may simultaneously classify and localize at least one object using the object analysis model M420 that detects a body to detect a plurality of objects and extract location information. The foregoing is merely an example, and the present disclosure is not limited thereto.

According to an embodiment of the present disclosure, the processor 120 may classify a pixel using the segmented object analysis model M420 to distinguish the boundary line of the object in the image from the background to detect the object. The foregoing is merely an example, and the present disclosure is not limited thereto.

It is well known to those skilled in the art that the resizing as in step S2100 is for improving the processing speed of the object analysis module 4200 by reducing the amount of computation by the object analysis model M420.

The importance resulting from step S2200 may be used as a measure for determining the priority of the object. can be given. The priority will be described later in detail.

With respect to the object whose importance is greater than or equal to a predetermined boundary value, the computing device 100 may extract and sample an object image, which is an image of the object, from the entire image (crop feed step; S2500}. Since the resized image is used in step S2200 and a part of the data of the entire image is lost, the crop feed step S2500 is to use image information before loss for an object of relatively high importance.

Continuing to refer to FIG. 3 , in the image processing method according to the present disclosure, the detailed classification module 4300 implemented by the computing device 100 analyzes each detected object as a result of the object's characteristics and state. It further includes a detailed classification step (S3000) of generating detailed classification information including.

Here, the object is a concept including a spatial object that is an object corresponding to the 'space' itself in which the entire image is captured, and detailed classification information of the spatial object may include information about the corresponding space.

객체의 특성 및 상태Properties and State of Objects

A property of the object refers to a property of an object that is largely immutable with respect to time, whereas a state of the object refers to a property of an object that is substantially changeable over time.

Specifically, the property of the object may include information on a partial object that forms a part of the object or is a component to which the object belongs. As an example, if a person is detected as an object in the entire image, parts of the person's arms, legs, eyes, etc., and clothes and shoes worn by the person are partial objects of the object.

For the detection of such partial objects, the detailed classification step (S3000) may include: attempting to detect a partial object that forms a part of the object or is a component of the object (S3920); and when the partial object is detected, further generating an analysis result of characteristics and states of the partial object as a part of the detailed classification information (S3940).

In addition, the characteristic of the object may include a main color of the object, a general object that is information indicating the partial object of the object, a subject that is information indicating another object when the object is a partial object of another object, and the object. It may include at least one of the size of the object, one or more materials of the object including the main material of the object, transparency of the object, text displayed on the surface of the object, and whether magnetic movement of the object is possible.

Here, the size of the object may be a size measured by two-dimensional measurement or three-dimensional measurement. In addition, the transparency of the object is a property that can be possessed when the object is an object having a transparent portion such as a glass window. For example, an opaque object has a value of 0, and an object of a transparent material such as glass has a positive value. can have

Meanwhile, the state of the object may include at least one of a position of the object, a posture of the object, an action of the object, a direction of the object, whether the object is in contact with the floor, and a speed of the object.

Here, the position of the object may be a position measured by two-dimensional measurement or three-dimensional measurement. The posture of the object may be inferred based on location information of partial objects of the object, and the behavior of the object may be inferred from the temporally continuous posture.

Also, the direction of the object may be inferred based on location information of the object or an action of the object.

Whether the object is in contact with the floor indicates whether the object is in contact with the floor plane of the spatial object to which the object belongs, and for example, a chair, a desk, a power pole, a tire of a car, etc. have a true value.

The detailed classification information of the object may further include an attribute indicating information related to system input/output of the object as well as the characteristics and state. The property of the object may include at least one of a data input time including a time when raw data of the object was initially input, and an operation right for the system according to the present disclosure granted to the object. .

In order to generate the detailed classification information of the object as described above, the detailed classification step ( S3000 ) is performed in each of the objects corresponding to the individual category to which the individual object belongs, with respect to the individual object whose importance is the object having the second predetermined value or more. selecting (S3200) a detailed classification model (M430), which is a set of models consisting of at least one model trained in advance to be fitted to the individual category to obtain the characteristic and the state; and inputting an individual object image, which is an image of the individual object, into the selected detailed classification model (M430), an identifier of the individual object, and an object record belonging to the individual object through the identifier, including the detailed classification information It may include generating an object record (S3400).

Here, the detailed classification model M430 is for distinguishing one or more objects from each other. In other words, as detailed classification information generated by the detailed classification model M430, each of the objects can be distinguished from each other A possible identifier may be assigned.

Also, the object record herein refers to a record including information attributed to the identifier of each object. As an example of information attributed to the identifier of each object, if the object is a person, it may include the person's face shape, height, gait aspect, tattoo, hair style, etc., and if the object is a dog, the It may include the shape of the head, the shape and color of the hair, the breed, and the like. The object record may include information of other objects that are owned by each object, and this may be an identifier of the other object.

Among them, in particular, it is possible to classify human gait patterns by artificial intelligence methodologies, for example, the non-patent paper Babaee, M., Li, L., & Rigoll, G. (2019). Person identification from partial gait cycle using fully convolutional neural networks. Neurocomputing, 338, 116-125.

In addition, it is reported that human hair styles can be classified by AI methodologies, for example, in the non-patent literature papers Muhammad, U. R., Svanera, M., Leonardi, R., & Benini, S. (2018). Hair detection, segmentation, and hairstyle classification in the wild. Image and Vision Computing, 71, 25-37. As revealed in https://doi.org/10.1016/j.imavis.2018.02.001.

However, those of ordinary skill in the art will be able to understand that information other than those shown in such prior documents may also be obtained by the artificial intelligence methodology.

Selecting the detailed classification model M430 for each individual object ( S3200 ) may be performed by the classification model selection module 4320 implemented by the computing device 100 . The classification model selection module 4320 performs a function of selecting a detailed classification model suitable for the category after the category of the object is obtained together with an algorithm applied to the detailed classification model.

For example, if the category of the object is a person, the classification model selection module 4320 generates detailed classification information that can specify a person, such as a person's face shape, height, gait aspect, tattoo, and hair style. A classification model may be selected, and if the category of an object is an individual, a detailed classification model that generates detailed classification information that can specify a dog, such as a dog's head shape, hair shape and color, and breed, may be selected.

It is reported that not only human face shape but also dog head shape can be classified by AI methodology, for example, the non-patent paper Mougeot, G., Li, D., & Jia, S. (2019). A Deep Learning Approach for Dog Face Verification and Recognition. In PRICAI 2019: Trends in Artificial Intelligence (pp. 418-430). Springer International Publishing. As revealed in https://doi.org/10.1007/978-3-030-29894-4_34.

In addition, regarding the AI methodology for classifying dog breeds, the non-patent literature paper Raduly, Z., Sulyok, C., Vadaszi, Z., & Zolde, A. (2018). Dog Breed Identification Using Deep Learning. In 2018 IEEE 16th International Symposium on Intelligent Systems and Informatics (SISY). IEEE. See https://doi.org/10.1109/sisy.2018.8524715.

Meanwhile, the detailed classification model M430 is a set of models, and by performing at least one of two-dimensional measurement and three-dimensional measurement of the object, the position of the object, whether the object is in contact with the floor, the direction of the object, and the object The measurement model M431 for calculating at least one (one or more) of the velocity of the object, the posture of the object, and/or the size of the object may be included. Here, the size of the object is a one-dimensional dimension including at least one (one or more) of height, width and/or depth, a two-dimensional dimension including the surface area of the object, and/or a three-dimensional dimension including the volume of the object. It may include at least one (one or more) of

For example, the volume of the object may be calculated based on at least one (one or more) of object segmentation of the object, posture and/or depth information of the object.

When the object is a spatial object, the metrology model M431 indicates a system position that is the origin of at least one of an orientation and coordinates of the spatial object, a Manhattan space that is a volume space including an object included in the space of the spatial object, and the space. At least one of a floor plane detected in , a vector of gravity applied to the space, an empty volume space excluding an object included in the space in the space, a partial object of the space object, and a direction of the space can be calculated have. The space is irrespective of indoor or outdoor space.

Here, a partial object of a spatial object refers to a partial object constituting the space of the spatial object, and the partial object is fixed in the space. Examples of partial objects of such spatial objects may include glass windows, doors, walls, kitchens, roads, overpasses, and the like.

A system location refers to a location of a system according to the present disclosure that serves as an origin of azimuth or coordinates within an indoor or outdoor space.

In addition, the direction of the space refers to the direction of the object that is the spatial object, and may be based on the system or part of the space of the present disclosure.

Depth information in 3D measurement may be predicted by an artificial intelligence methodology for deriving depth information from the image or may be supplementally provided with respect to the image from other sensors 340 such as lidar, radar, and ultrasonic sensors.

Specifically, in step S3400, the three-dimensional measurement is performed under the condition that the deviation of the length of at least one part among the object or the partial object that forms a part of the object or is a component belonging to the object is smaller than a predetermined criterion. The method may further include a process ( S3410 ) of identifying a length reference object that is a satisfying object, measuring a two-dimensional length of the length reference object ( S3410 ), and a process ( S3420 ) of detecting a floor plane of the object.

The purpose of this three-dimensional measurement is to determine the relative position between the system of the present disclosure and the object or/and the absolute position of the object using information of one or more objects.

For example, the height of an adult male, which is an example of a length reference object, may be used as a reference length of an object belonging to another category, such as a door, a pencil, and a cup, which may also be used for distance measurement.

In addition, as another example of the length reference object, when a plurality of doors are detected in one entire image or space (or space object), the doors of the same design in one space have similar heights, so they are used as length reference objects for distance determination. can be used

In addition, a partial object can also be used as a length reference object when the deviation of the length of the partial object is a relatively small object. For example, the horizontal length of the human eye is used as a length reference object because the standard deviation is relatively small can be

In measuring the two-dimensional length of the length reference object in the process S3410, if there is posture information of the length reference object, a corrected length may be calculated by reflecting the inclination due to the posture of the length reference object. The posture of the object and its calculation will be described later.

바닥 평면floor plane

There may be various methods for detecting the floor plane of the object in the process S3420, for example, there may be a method of detecting the floor plane touching the object in a vector direction of gravity acting on the object.

In addition, in the detection of the floor plane of the object, a method using Manhattan space detection, a method of generating a floor plane between the length reference objects and extending them when there are two or more length reference objects, and a method of moving the length reference object In one case, there may be a method of creating a floor plane between the length reference object before the movement and the length movement object after the movement and extending it.

Referring to FIG. 6A , in a specific first embodiment of detecting the floor plane of an object ( S3420 ), a deviation in the size of an object corresponding to a certain category among at least one category or at least one part of a partial object of the object is predetermined. It starts with the step of detecting at least one size similarity object corresponding to the size similarity category, which is a category that satisfies a condition smaller than the criterion of ( S3422a ). Here, the size may be one or more of width, height, and depth.

For example, a category satisfying a condition in which the size deviation is smaller than a predetermined criterion may be a desk category. Certain types of desks have relatively small variations in height. Assuming that the height of a certain kind of desk is 70 ~ 74 cm on average, the standard length of an object similar to that size can be set as 72 cm.

Optionally, the size-like object may be a floor reference object at the same time. The floor reference object is an object used for detection of a floor plane, and refers to an object in which at least a portion of the object segmentation generally touches the floor. For example, desks and chairs correspond to such floor reference objects.

Step S3422a Next, a first embodiment of the process S3420 includes a step S3424a of detecting a bottom contact point that is a contact point for the lowermost end of the object segmentation of the size-like object.

7A is a diagram exemplarily illustrating object segmentation obtained by the image processing method of the present disclosure.

Methods for obtaining object segmentation are known to those skilled in the art, for example, in the non-patent papers Mo, K., Zhu, S., Chang, A. X., Yi, L., Tripathi, S., Guibas, L. J. , & Su, H. (2019). PartNet: A Large-Scale Benchmark for Fine-Grained and Hierarchical Part-Level 3D Object Understanding. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. can be referred to.

Referring to FIG. 7A , among the exemplified object segmentation of the chair, the lowermost end of the

regions

712 , 714 , 716 , and 718 corresponding to the chair leg may be considered to be in contact with the floor. 720 can be detected. As another example, since the lowermost end of the door object segmentation contacts the floor, a floor contact point that is a contact point may be detected.

Step S3424a Next, a first embodiment of the process S3420 includes determining a three-dimensional distance of the size-like object based on the two-dimensional length of the size-like object (S3426a). For example, since at least one (one or more) of a two-dimensional length (height), width, and/or depth is measured to be smaller than a chair positioned closer to a chair positioned farther away, the three-dimensional distance may be determined using this.

After step S3426a, the first embodiment of the process S3420 connects the floor contact points 720 to each other to create a three-dimensional reference plane 730, and expands the three-dimensional reference plane to form a floor plane ( It further includes generating (S3428a) 740). Referring to FIG. 7A , a three-dimensional reference plane may be generated by connecting the lowermost ends of the chair legs to each other, which may further extend to the floor plane 740 .

Here, the extension from the three-dimensional reference plane is performed up to the boundary of the edge of the gravity horizontal object, the starting point of the wall, the lowest end of the gravity horizontal object, and the like, and the result is the floor plane. Here, the gravity horizontal object generally refers to an object disposed to stand upright in parallel to the direction of gravity, for example, a glass window and a wall surface. Specifically, the gravity horizontal object may be an object (eg, a door, a window, etc.) having lines and/or faces disposed in a horizontal direction to the gravity vector. A gravity horizontal object may be an object (eg, a door frame, a wall, etc.) with one or more lines and/or faces touching the floor plane. Gravity A horizontal object may have a horizontal to vertical ratio (a value obtained by dividing a horizontal length by a vertical length) of 1 or less.

Meanwhile, the second embodiment of the detection of the floor plane using a plurality of objects ( S3420 ) starts with the step of detecting a plurality of similar objects having a degree of similarity equal to or greater than a predetermined value ( S3422b ). For example, a door, a desk, a chair, etc. of a certain design may be such a similar object.

Considering that the image processing method of the present disclosure can be continuously and repeatedly performed, if the image processing method has already been applied to a space in which the similar object exists, in step S3422b, the previously detected similar object is The similar objects can be detected with reference to the position and length.

After the step S3422b, the second embodiment of the process S3420 further includes the step of detecting at least one lowermost object, which is an object located at the lowermost level among the plurality of similar objects (S3424b). As in the example of FIG. 7A , the legs of the chair and the lowermost end of the door are in contact with the floor, and the chair and the door are the lowermost objects.

After step S3424b, the second embodiment of the process S3420 further includes a step S3426b of detecting a bottom contact point that is a contact point for the lowest end of the object segmentation of each of the lowermost objects, and following step S3426b The method further includes generating a three-dimensional reference plane by connecting the floor contact points to each other, and generating a floor plane by expanding the three-dimensional reference plane (S3428b).

As a modification of the above-described second embodiment, the process ( S3420 ) includes detecting a plurality of similar objects having a degree of similarity equal to or greater than a predetermined value ( S3422b ′), one of the plurality of similar objects being detected. Step (S3424b') of determining whether an object is located on the complete top of another object among the plurality of similar objects, based on whether the object is located on the complete top Including objects in different floor sets (S3426b'), and for each of the different floor sets, a contact point to the lowest end of object segmentation of a floor set object that is an object belonging to each floor set Creating a three-dimensional reference plane by connecting floor contact points to each other, and generating a floor floor plane by expanding the three-dimensional reference plane, thereby generating the floor floor plane as two or more floor planes (S3428b'). can

Since step S3422b' is the same as step S3422b, the 'complete upper end' of steps S3424b' and S3426b' will be described with reference to FIG. 7B. First, an object having a similarity greater than or equal to a predetermined threshold can be assumed that the height h is the same within the error range.

If each object is located apart from each other along the vertical axis (y-axis) in the entire image, it can be detected that (i) there is a distance between the objects or (ii) there is a difference in height of the floor that supports the object. If the leftmost lower end of the entire image is taken as the origin of the coordinate system, and when there is a distance between the objects, the object 700b having a lower y-axis coordinate value than the other object 700a is closer to the imaging device, so the height (hb) ) is larger and the height ha of the object 700a having a relatively higher y-axis coordinate value should be reduced by a certain ratio compared to the height hb.

That is, objects (eg, 700b and 700a) whose height varies at a certain rate according to the y-axis coordinate value, and objects with a difference in the horizontal axis (x-axis) coordinate value, but with a difference below a certain level in the y-axis coordinate value It is determined that the groups (eg, 700b and 700d) exist in the same floor, and thus the corresponding floor set may be generated.

If the y-axis coordinate value is greater than that of the other object 700a, the height hc does not decrease at a certain rate from the height ha of the other object 700a, or the height hc is equal to or greater than the object 700c. ) can be determined to exist on different floors, so different sets of floors can be created.

By repeating such an inference process, the floor-to-floor sets can be generated.

Alternatively, instead of the height of each object, the height or area of the three-dimensional reference plane formed by each object may be used.

On the other hand, in the third embodiment using the Manhattan space in the detection of the floor plane ( S3420 ), an object corresponding to a certain category among at least one category or a deviation in the length of at least one part among partial objects of the object is a predetermined value. It starts with the step of detecting at least one length similarity object corresponding to the length similarity category, which is a category that satisfies a condition smaller than the reference ( S3422c ).

Such length-like objects may include desks with relatively small variations in height.

Step S3422c Next, a third embodiment of the process S3420 includes the steps of detecting a Manhattan space created by the set of length-like objects (S3424c), and detecting the bottom of the Manhattan space (S3426c) , and expanding the floor of the Manhattan space in a horizontal direction to generate a floor plane (S3428c).

More specifically, the detection of the Manhattan space (S3424c) includes a first step (S3424c-1) of detecting a floor object that is an object forming a floor among the objects and generating a boundary of the floor object (S3424c-1), the boundary of the floor object A second step of detecting a wall object that is an object perpendicular to (S3424c-2), and a third step of detecting a ceiling object that is an object perpendicular to the wall object as an object other than the floor object (S3424c- 3) can be configured.

That is, the Manhattan space here refers to a space surrounded by a floor object, a wall object, and a ceiling object. Examples of wall objects perpendicular to the floor object include a glass window, a door, and the like.

Unlike the above-described embodiments, the fourth embodiment using objects having the same pattern in the detection of the floor plane ( S3420 ) starts with the step ( S3422d ) of detecting the same pattern objects, which are a plurality of objects having the same pattern.

After step S3422d, a fourth embodiment of the process S3420 includes, when the same pattern objects are detected, detecting lower ends of the same pattern objects (S3424d), and occlusion between the same pattern objects and the The method further includes measuring a relative distance between the same pattern objects based on one of the length differences in the image between the same pattern objects (S3426d).

For example, if one object obscures the object segmentation of another object in step S3426d, the one object may be detected as being closer to the system than the other object. Also, if one of the objects having the same pattern is smaller than the other objects, the one object may be detected as being farther away from the system than the other objects.

Next to step S3426d, a fourth embodiment of the process S3420 is performed virtual from the bottom contact point of two or more of the same pattern objects, or from three or more points included in one of the same pattern objects. The method further includes generating a plane (S3428d). When there are several generated virtual planes, a person skilled in the art will easily understand that a virtual plane for each floor can be created using the plurality of virtual planes as each floor according to differences in length, state, and position.

Next to step S3428d, a fourth embodiment of the process S3420 further includes generating a floor plane by expanding the virtual plane (S3429d).

Now, returning to the three-dimensional measurement in step S3400 and continuing to explain this, the three-dimensional measurement is a process of setting a virtual length reference line on the detected floor plane after the detection of the floor plane (S3420). (S3430); and a process (S3440) of measuring a distance between the object and a system position, which is the origin of at least one of the orientation and coordinates of the spatial object, or the position of the object based on the system position.

Meanwhile, the two-dimensional measurement in step S3400 includes posture and direction measurement for calculating the two-dimensional posture and two-dimensional direction of the object based on the relative positions between the partial objects of the object, and the two-dimensionality of the object. Area measurement for calculating the area may be included, and this may be performed by the detailed classification module 4300 using the measurement model M431 of the detailed classification model M430.

Due to the nature of the two-dimensional measurement, the area here means the area in the entire image or object image without considering the depth. Measurement of this two-dimensional area may be performed by classification and measurement of object segmentation.

In addition, in order to generate the characteristic of the object in step S3400, the detailed classification model M430 may include information on a partial object that forms a part of the object or is a component belonging to the object, and deep classification information of the object. , an advanced characteristic model (M432) for calculating at least one of the main color of the object, the type of the object, the subject of the object, one or more materials of the object, the transparency of the object, and whether the object can be moved by magnetic force may include more.

Here, the deep classification information of the object refers to information obtained by deep classifying the category of the object. For example, if the category of the object is a dog, the deep classification information may be the breed of the dog.

자세 및 행위posture and behavior

In addition, the detailed classification model M430 may further include a posture determination model M433 for calculating the posture of the object based on the processing results of the measurement model M431 and the deep characteristic model M432, and the posture determination It may further include a behavior determination model M434 for classifying the behavior of the object based on the temporally continuous posture calculated from the model M433.

For example, the processor 120 of the computing device 100 may obtain posture information on the object from the analysis information corresponding to the object by using the posture determination model M433 .

Also, the processor 120 of the computing device 100 may obtain behavior information on the object from the posture information on the object by using the behavior determination model M434 .

Here, an action is a concept that includes both a behavior that does not reflect a context and an action that reflects the context, and the behavior will be described later.

Since a method for determining a posture may be different according to a category of an object, the posture determination model M433 may be a posture determination model for each category that is different for each category. For example, a dog posture determination model that calculates a dog's posture in a sitting state and a human posture determination model that calculates a human posture in a sitting state may be different from each other. Accordingly, a posture determination method to be applied to an object among a plurality of posture determination methods may be determined based on the analysis information corresponding to the object in the posture determination model M433 . The posture determination model M433 may apply different posture determination methods according to classification information indicating the category of the object.

Similarly, since a method for determining an action may be different depending on the category of an object, the action determination model M434 may be a different category-specific action determination model for each category. The behavior determination model M434 may apply a different method of determining an action according to classification information indicating a category of an object.

The posture determination model M433 may generate posture information about an object, including information about an N-dimensional (eg, one-dimensional, two-dimensional, three-dimensional, etc.) posture corresponding to the object, from the analysis information. have. The posture information on the object may include information on the temporally continuous posture with respect to the object.

The behavior determination model M434 may obtain behavior information about the object, including the N-dimensional (eg, 1-dimensional, 2-dimensional, 3-dimensional, etc.) movement vector corresponding to the object, from the posture information on the object. can In addition, the behavior determination model M434 includes the N-dimensional (eg, 1-dimensional, 2-dimensional, 3-dimensional, etc.) behavior classification corresponding to the object from the object's posture information, including the object's context. You can create behavioral information. An object's context may include at least one of the object's state and/or the purpose of its behavior. The behavior classification may be calculated by identifying the position of the partial object included in the object, determining the posture of the object based on the position of the partial object, and determining the behavior classification of the object based on the posture of the object. A partial object may include at least one of a part of the object or an attribute belonging to the object.

Determination of posture and behavior may be performed two-dimensionally or three-dimensionally. Accordingly, the movement vector may include a two-dimensional movement vector including at least one of a two-dimensional direction and/or a velocity of the object. In addition, the movement vector may include a three-dimensional movement vector of the object calculated based on at least one of a position, a speed, and/or an acceleration of a photographing device that has captured an image.

In an embodiment in which the determination of posture and behavior is performed two-dimensionally, in step S3400, the detailed classification module 4300 performs a two-dimensional motion vector of the object and a two-dimensional action of the object from temporally consecutive object images. It is also possible to calculate the two-dimensional motion information of the object including the classification, the two-dimensional behavior classification, the step of identifying the position of each partial object included in the object (S3450a), the posture determination model (M433) determining the two-dimensional posture of the object based on the relative position of each of the partial objects using the It may be calculated by performing the step (S3470a) of determining the two-dimensional behavior classification of the object.

Here, the two-dimensional motion vector represents the two-dimensional direction and speed of the object, and the two-dimensional behavior classification represents the type of action determined from the two-dimensional posture of the object.

On the other hand, in an embodiment in which the determination of posture and behavior is performed in three dimensions, in step S3400 , the detailed classification module 4300 determines the three-dimensional movement vector of the object and the three-dimensional movement vector of the object from temporally consecutive object images. It is also possible to calculate the three-dimensional motion information of the object including the three-dimensional behavior classification, the three-dimensional behavior classification, the step of identifying the position of each partial object included in the object (S3450b), the posture determination model (M433) ) to determine the three-dimensional posture of the object based on the relative positions of each of the partial objects using (S3460b), and based on the temporally continuous three-dimensional posture of the object using the behavior discrimination model M434 Thus, it can be calculated by performing the step (S3470b) of determining the three-dimensional behavior classification of the object.

Unlike the determination of the two-dimensional posture and behavior, in the action and determination of the three-dimensional posture, it is necessary to reflect the movement in the depth direction from the imaging device, so the computing device 100 uses the interlocking sensor 340 . Thus, at least one (one or more) of the position, speed, and/or acceleration of the photographing device may be calculated or estimated, and the three-dimensional movement vector of the object may be calculated by reflecting the calculation.

For example, the system of the present disclosure drives the gimbal motor 330 to rotate the axis of the gimbal to track a bicycle, which is an object running in the right direction based on the system, while calculating a three-dimensional movement vector of the bicycle. In this case, both the motion vector by the motor 330 and the motion vector in the image may be reflected in the 3D motion vector.

A standard focal length of the photographing apparatus 200 and/or a standard focal length of the entire image or the object image may be used to calculate the 3D position and 3D motion vector of the object. The distance between the photographing device 200 and the object, the standard focal length, the three-dimensional height (actual height) of the object, the height of the image (ie, the vertical size of the image), the height of the object in the image, the photographing device 200 The relationship between the heights is expressed by Equation 1 below.

Using Equation 1, the distance from the photographing apparatus 200 to the object or the three-dimensional height of the object may be calculated.

The distance reference line, which is one of the criteria used to measure the distance from the imaging device 200 to the object, is, for example, a concentric sphere based on the system of the present disclosure when two or more objects having the same length are detected. It may be a distance reference line set using This line is composed of a reference plane circle and a measurement plane circle, which will be described later. Each step of setting this up is as follows.

Referring to FIG. 7C , first, a reference plane circle 750 is generated in a plane direction perpendicular to gravity, wherein when determining that it is perpendicular to gravity, at least one of an accelerometer, a gyroscope and/or the gravity horizontal object ( more than one) may be used.

Here, the reference plane circle 750 corresponds to the shape of the cross-section of the system, that is, the surface of the concentric sphere 740 with the imaging device 200 as the origin, that is, the concentric sphere surface cut in the xy plane.

Next, objects having the same length, which are objects having the same length, among objects of similar length, are detected.

When the set of points separated by an angle θ from the z-axis on the concentric sphere of FIG. 7C is referred to as a metrology plane circle, the object using the same length is interposed between the metrology plane circles for each angle θ. Measure the difference in angle (θ) coordinate values between them.

Then, the distance from the photographing device 200 to each object may be measured using the difference in the angular coordinate values. In this case, at least one of the standard focal length and the lens and/or aperture size of the photographing device. (one or more) optical properties may be used as an auxiliary.

Meanwhile, the models listed in the present disclosure including the behavior discrimination model M434 may be generated by supervised learning or reinforcement learning. It is known that an interpretation value (text or classification index) and video information corresponding to each behavior of an object can be used as training data for supervised learning. In addition, reinforcement learning of a method of outputting behavior information by the behavior discrimination model M434 to the user and modifying the model using the output when the user reports positive or negative feedback may be used.

맥락(context)context

The detailed classification model M430 as a set may further include a context model M435 for inferring a context from the entire image. Context is a description of the object's state and the purpose of its behavior. As an example, if an image of a person peeling oranges with a knife in the kitchen is input, the context model M435 uses the sentence 'The person in the kitchen is peeling oranges with a knife' or a signal corresponding to the context. can be printed out.

Behavior refers to an action of an object that does not reflect context. For example, both a person running on a basketball court and a person running on a treadmill show the action of 'running'. Such behavior may be combined with a context to be described later, and an action in consideration of the context may be determined. According to this, the former corresponds to the act of 'playing basketball', and the latter corresponds to the act of 'using a treadmill'.

Specifically, the context includes at least one of an action and a state of each individual object displayed in the entire image, and a context interaction object that is another object detected as interacting with the individual object by the action. It may be the containing object context.

In addition, the context refers to at least one of a place that is a type of the spatial object inferred from each of the spatial object and individual objects other than the spatial object displayed in the entire image, the action and state of each individual object, and an individual corresponding to the subject of the action. It may be a spatial context including an actor, which is an object, and a context interaction object, which is an object that is detected as interacting with the individual object by the action.

In an embodiment, the synthesis of an action and a context for determining a context-considered action may be implemented as supervised learning using an artificial neural network model. For example, a sequence of images, that is, one scene of a video, is taken as input data, and language interpretation (ie, data expressed in language) including the behavior and spatial context of an object is labeled as the correct output data. An artificial neural network model can be trained using the training data.

This is a non-patent literature article Wu, Z., Yao, T., Fu, Y., & Jiang, Y.-G. (2017). Deep learning for video classification and captioning. In Frontiers of Multimedia Research (pp. 3-29). As found in ACM.

On the other hand, a more specific method for determining the behavior of each object for each object is described in Wu, C.-Y., Girshick, R., He, K., Feichtenhofer, C., & Krahenbuhl, P. (2020). A Multigrid Method for Efficiently Training Video Models. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. See https://doi.org/10.1109/cvpr42600.2020.00023.

On the other hand, there may be cases where the action is heterogeneous with the spatial context, and in this case, there is room for the action to be interpreted out of context. For example, for an object that is a person running in a cafe, it can be interpreted as an act of 'moving in a hurry' rather than an act of 'running'. This discrepancy is because the behavior does not sufficiently reflect the context or the context does not sufficiently reflect the behavior. That is, since the action can be inferred from the context, and the context can also be inferred from the action, a cyclical process of inferring the context from the action and inferring the action back from the context may be required, which, for example, in non-patent literature In the dissertation Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., & Baik, S. W. (2018). Action Recognition in Video Sequences using Deep Bi-Directional LSTM With CNN Features. IEEE Access, 6, 1155-1166. This may be achieved by a neural network model such as a recurrent neural network (RNN) or a bidirectional LSTM as disclosed in https://doi.org/10.1109/access.2017.2778011.

우선순위Priority

Now, the priority of the object closely related to the importance of the aforementioned object will be described in detail. Here, since the importance is a value handled to give priority, in general, after calculating the importance of each object, an order based on the importance may be prioritized.

The priority of the object may include a permission-based priority that is designated for each of the objects based on the permission of the object. This is to assign a priority according to the privileges of each classified object. For example, if an object, which is a specific user, has the highest privilege for the system of the present disclosure, a high priority will be given to the specific user. can

The authority-based priority is determined based on the characteristics of the object analyzed using the detailed classification model (M430), determining whether the object is a predetermined authority to handle the computing device, and the object is If it is a predetermined authority, it may be calculated by performing the step of setting a predetermined priority for the authority to the authority-based priority of the object.

For example, since the authority is expected to be a person, it is possible to first detect an object that is a person, and set the authority-based priority only for the object that is the detected person.

In addition, the computing device of the present disclosure can specify an individual from among a plurality of people by using the above-described deep classification information. , or an object record of the object.

On the other hand, in the priority of the object, there may be a classification-based priority designated for each of the object sets divided according to at least one characteristic including at least the category of the object or the deep classification information of the object among the characteristics of the object. have.

The classification-based priority may be a priority given in advance to categories such as people, animals, other objects, or the in-depth classification information, and the authority of the system may manually set it. As another example, the computing device of the present disclosure may variably automatically set a priority for each category or deep classification information based on a pattern in which the authority of the system uses the system of the present disclosure.

Next, the priority of the object may include a size-based priority designated for each of the object sets divided according to at least one characteristic including at least the size of the object among the characteristics of the object.

The size-based priority is a step in which the detailed classification module 4300 acquires the object segmentation of the object, and selects an object in which the ratio of the size of the object segmentation in the entire image among the objects is equal to or greater than a predetermined ratio and setting the size-based priority according to the size of the object segmentation of the selected object. This is a method of setting a high priority considering that the larger the object or the closer the distance between the system and the object of the present disclosure is, the greater the object segmentation of the object occupies the entire image. .

In addition, the priority of the object may include an action-based priority designated for each of the object sets divided according to at least one state including the behavior of the object among the states of the object.

Specifically, the detailed classification module 4300 may search for an object performing a specific action and assign a priority corresponding to the action.

For example, when an act of a person falling is detected, the highest priority may be set for the person who has fallen down. As another example, if there is an object that is a person who gives a command with a specific gesture such as a swipe in the air, a high priority may be given to the object that is the commander.

As a final example, the priority of the object may have a context-based priority designated for each of the object sets divided according to at least one state including at least the context of the object among the states of the object.

According to the context-based priority, the detailed classification module 4300 infers an object context and a spatial context from the entire image by performing a state analysis or a spatial analysis on the entire image, and the object determines the inferred object context and Giving a relatively high priority to the actor or the contextual interaction object based on whether it is an actor by at least one of spatial contexts or a context interaction object that is an object that is detected as interacting with the actor by an action of the actor. can be set.

For example, if the spatial context is 'people of the age of children are playing baseball on the playground', a relatively high priority may be given to a person detected as holding a baseball bat.

The steps and processes described in this disclosure are not meant to be performed in the order described unless otherwise logically contradicted or otherwise indicated by context, and each of the steps and processes may be performed concurrently or at different times. It will be understood by those skilled in the art that it can be performed.

In addition, the above steps may be performed once, but preferably, the steps are performed in real-time and/or iteratively to obtain temporally continuous images as described above. can be

That is, the image processing method of the present disclosure may further include a step ( S4000 ) of returning to the image acquisition step ( S1000 ) in order to acquire a new full image, in which case the available resources are determined by the tracking controller 4400 . By controlling the image acquisition step (S1000) may be performed again.

Here, the tracking controller 4400 is a module similar to the image input module 4100 , the object analysis module 4200 , and the detailed classification module 4300 . It performs a function of implementing the movement of the photographing device, ie, tracking, so as to be able to direct the space, where available resources refer to hardware and/or software that enables such tracking.

In an embodiment, the tracking controller 4400 is configured to obtain a target position determination module 4420 that determines a target position and a magnification corresponding to a composition to be captured by the photographing device 200 and the entire image of the target position. It may include a tracking resource controller 4440 for controlling available resources, that is, hardware and software resources.

목표 위치 결정 모듈target positioning module

In the embodiment illustrated in FIG. 8A , the target location determination module 4420 obtains an object priority of at least one candidate object appearing in the entire image (S4110a), obtaining an object segmentation of the candidate object Step (S4120a), the step of identifying a context interaction object that is an object detected to interact with the candidate object according to the context inferred from the entire image (S4130a), an object priority of a predetermined rank or higher Determining the composition to be photographed by the photographing apparatus 200 to include at least one target object that is a candidate object and the context interaction object of the target object (S4140a), and the direction of the target object according to the direction of the target object By expanding the composition, an additional composition prediction step (S4150a) of determining the target position and magnification may be performed.

The context interaction object may be identified by the context model M435. This context model (M435) can be generated by supervised learning is a non-patent literature paper Shin, D., & Kim, I. (2018). Deep Image Understanding Using Multilayered Contexts. As disclosed in Mathematical Problems in Engineering, 2018, 1-11.

More specifically, the additional composition prediction step (S4150a) may include obtaining the direction of the target object (S4152a) and obtaining the speed of the target object (S4154a). As described above, the direction of the target object may be obtained based on relative positions between partial objects included in the target object, and the speed of the target object, that is, a 3D motion vector, is a 2D motion vector obtained from an image and It may be calculated based on the information of the sensor 340 .

Next, in the additional composition prediction step (S4150a), calculating an influence range of the target object, which is a range in which interaction with the target object is possible, based on the direction and speed of the target object (S4156a), and the The method may further include determining the target position and magnification by expanding the composition by reflecting the influence range ( S4158a ).

Here, the influence range is the range of space that the target object is likely to occupy at least temporarily over a predetermined time range starting from the present, or is likely to be occupied by the context interaction object with which it is likely to interact, that is, , a range of a space in which the target object can be physically contacted, for example, a range in which a human body part can be physically contacted, or a context in which the target object can transmit and receive signals through eyes, ears, mouth, etc. It refers to a range in which an action object can exist, for example, a field of view (FOV) in which signals are received and transmitted with the target object.

As an example, in step S4156a, the range of influence of the eye is 20 m. The range of influence of the hand may be 1 m. This may be combined with the behavior information (action classification) of each object and used to adjust the composition. For example, if 'a child standing with a bat at the baseball batting table' is the target object, the composition may be adjusted to include the direction of the child's gaze and the bat object held in the child's hand in step S4158a.

Meanwhile, in another embodiment illustrated in FIG. 8B , the target positioning module 4420 may determine the composition to be photographed based on the priority of the object. Specifically, the target positioning module 4420 may include: Obtaining an object priority of at least one candidate object appearing in the entire image (S4120b), determining at least one target object that is a candidate object having an object priority of a predetermined priority or higher (S4140b) , and determining the composition to be photographed by the photographing apparatus 200 to include the target object ( S4160b ).

Referring to FIG. 8C , in another embodiment reflecting the priority of the object, the target location determination module 4420 obtains the object priority of at least one candidate object appearing in the entire image (S4120c), generating a virtual segmentation including the candidate object according to the object priority of the candidate object, wherein the virtual segmentation is generated so that the area of the virtual segmentation becomes larger as the object priority is higher (S4140c); The attractive force between the virtual segments monotonically increasing according to the area of the virtual segmentation by a predetermined increasing function, a first repulsive force between the candidate objects monotonically decreasing according to the distance between the candidate objects according to a predetermined first decreasing function, and a predetermined calculating a second repulsive force between the candidate object and a frame boundary of the entire image according to a second decreasing function of By calculating the target center, which is the center point of the at least one candidate object in the equilibrium state, the step of determining the target position and magnification corresponding to the composition ( S4180c ) may be performed.

추적 자원 컨트롤러tracking resource controller

The tracking resource controller 4440 is an image conversion controller 4442 that controls to acquire the new image from the image acquired from the photographing device 200 without controlling the posture of the gimbal 300 and the photographing apparatus 200, and photographing When the apparatus 200 is mounted and the gimbal 300 having one or more rotation axes for controlling the posture of the photographing apparatus 200 and the computing apparatus 100 interwork, the gimbal 300 and the photographing apparatus 200 The control may include a frame switching controller 4444 that controls the photographing device 200 to acquire the new image.

The frame transition controller 4444 and the image transition controller 4442 may operate to complement each other, and the tracking resource controller 4440 first acquires a new image of a desired resolution from the frame transition controller 4444, and In case of failure, it is possible to control to acquire a new image through image reconstruction by the image conversion controller 4442 .

In a specific embodiment, the video conversion controller 4442 determines whether there is an idle resource capable of performing image reconstruction ( S4310 ). If the idle resource exists, the pre-obtained full image or the entire image loading a portion of the original image into the memory as an original image (S4320), and cropping the original image according to the target position and magnification, or when the resolution of the original image is less than a predetermined threshold, Acquiring the new image according to the target position and magnification by performing super-resolution (S4330) may be performed.

It is known to those skilled in the art that super-resolution, for example, can be performed using a neural network such as an autoencoder.

Meanwhile, the frame change controller 4444 is a gimbal controller that controls the direction of the photographing device 200 by operating the one or more rotation axes of the gimbal 300 to achieve the target position (that is, to reach the target position). 4444a, a zoom controller 4444b that controls zoom-in and zoom-out of the photographing device 200 to achieve the magnification, and an environment while reducing the operation of hardware among the available resources It may include a front and rear photographing device controller 4444c for controlling the photographing, and at least a gimbal controller 4444a among them.

The photographing apparatus 200 may be composed of two or more, for example, the photographing apparatus 200 mounted on the front and rear surfaces of the portable terminal 100 , respectively. The front and rear photographing device controller 4444c can scan the entire space surrounding the system, that is, the surrounding space, using the front photographing device 200a (not shown) and the rear photographing device 200b; According to the user's convenience, the photographing apparatus 200 used may be selected from the front and rear photographing apparatuses.

For example, the front and rear photographing device controller 4444c may select the front photographing device 200a as a photographing device to be used to acquire an image of the user when the user is looking at the display of the portable terminal 100 .

As another example, after recognizing an object as the front photographing device 200a, the front and rear photographing device controller 4444c may request an image having a higher resolution than that of the front photographing device 200a for the object. It is possible to control the object to be recognized as the rear photographing device 200b having a resolution.

In addition, the front and rear photographing device controller 4444c can use the front photographing device 200a and the rear photographing device 200b simultaneously or sequentially to obtain images of the surrounding space while minimizing the rotation of the gimbal 300 axis. .

On the other hand, the frame change controller 4444 sets the ratio of the width to the height of the specific object or the partial object included in the specific object or the specific partial object so that the composition includes all of the at least one specific object or the specific partial object (horizontal length to vertical length). According to a value divided by ), the orientation of the photographing apparatus 200 may be controlled in one of a portrait direction and a landscape direction. For example, this may be achieved by rotation of the roll axis R of the gimbal 300 . In the ratio of the width to the length, the width may mean perpendicular to the gravity vector. In the ratio of the width to the length, the length may mean that it is horizontal to the gravity vector. Accordingly, the horizontal to vertical ratio may be a horizontal to vertical ratio in which a portion of one reference plane is calculated as a horizontal length. The reference plane may be a plane perpendicular to the gravity vector and/or a plane in contact with the floor. However, the reference plane is not limited thereto, and various planes may be set as the reference plane. A tendency (eg, 1 or more or 1 or less) may not change in the ratio of the width to the length depending on the direction of the object.

Specifically, the frame change controller 4444 determines that the object segmentation of the at least one specific object or the specific partial object or an object box including the at least one specific object or the specific partial object determines the frame boundary of the entire image. The orientation of the photographing device may be controlled in any one of a vertical direction and a horizontal direction so as not to contact the .

When an embodiment of the frame change controller 4444 for orientation of the photographing device is described in more detail with reference to FIG. 9A , the frame change controller 4444 may perform object segmentation of the at least one specific object or a specific partial object. Alternatively, when the object box first contacts the frame boundary of the entire image, the gimbal moves in the opposite direction to the first contacted position within the limit that the object segmentation or the center point of the object box exists in the composition. Resolving the first contact by controlling (S4210a), and when the object segmentation or the object box makes a second contact with the frame boundary line on the opposite side of the first contacted position after the movement, (i) the photographing an act of switching the orientation of the device from one of a portrait and a landscape orientation to another; and (ii) an act of controlling the zoom-in and zoom-out of the imaging device. Resolving the second contact through at least one of (S4220a) may be performed.

Another embodiment of the frame change controller 4444 will be described in detail with reference to FIG. 9B. The frame change controller 4444 is a step of acquiring a ratio characteristic among the characteristics of the at least one specific object or a specific partial object, The ratio characteristic is a ratio of width to height (a value obtained by dividing a horizontal length by a vertical length) that is predetermined for each category of the specific object or the specific partial object, or calculated for each category by measurement of the specific object or the specific partial object. Ratio characteristic for each category, the ratio of width to length (horizontal length to vertical length) determined in advance for each detailed classification of the specific object or the specific partial object, or calculated for each detailed classification by measurement of the specific object or the specific partial object A ratio characteristic for each detailed classification that is a divided value), and a step (S4210b), including a ratio of width and height (a value obtained by dividing a horizontal length by a vertical length) of the composition determined by the target position determination module (S4210b), and in the ratio characteristic Based on the determination, whether to perform roll rotation, which is an operation for changing the orientation of the photographing device from any one of a portrait and a landscape to another, is determined, and the orientation of the photographing device is determined according to whether or not A step of adjusting (S4220b) may be performed.

For example, since the horizontal to vertical ratio (a value obtained by dividing the horizontal length by the vertical length) of the television is greater than 1, the orientation of the photographing apparatus 200 may be adjusted in the horizontal direction, and the horizontal and vertical ratio of a person standing Since the ratio (a value obtained by dividing the horizontal length by the vertical length) is 1 or less, the orientation of the photographing apparatus 200 may be adjusted in the vertical direction.

Step (S4220b) Next, the frame switching controller 4444 of this embodiment, when the object segmentation or object box of the at least one specific object or specific partial object comes into contact with the frame boundary line of the entire image, (i) the photographing an act of switching the orientation of the device from one of a portrait and a landscape orientation to another; and (ii) an act of controlling the zoom-in and zoom-out of the imaging device. The step of allowing the at least one specific object or a specific partial object to be included in the composition by resolving the contact through at least one of (S4230b) may be further performed. Here, if the object segmentation or the object box does not contact the frame boundary line, step S4220b is terminated and step S4230b is not performed.

OCR(광학 문자 인식)OCR (Optical Character Recognition)

Now, returning to the description of the detailed classification model M430 again, OCR by this will be described. The classification model M430 may further include an OCR model M436 for reading the text displayed on the surface of the object. Accordingly, the processor 120 of the computing device 100 may obtain a character included in the object from the analysis information corresponding to the object by using the OCR model M436 .

Correspondingly, in the above-described step S3400, the detailed classification module 4300 performs OCR using the OCR model M436 of the detailed classification model M430 to calculate the text of the object as a part of the characteristics of the object. It may include a step (S3500) of doing.

Specifically describing step S3500, the OCR model M436 may include, first, determining whether characters are displayed on the surface of the object (S3520).

In step S3520, it is known that the object on which the character is displayed can be determined by detecting the character using deep learning or a conventional OCR technique.

Following the step S3520, the OCR model M436 performs OCR on the displayed characters (eg, the entire character and/or character set) if characters are displayed on the surface of the object, and the result of the OCR An OCR performing step (S3540) of inputting (saving) text as a characteristic of the object may be performed.

According to an embodiment of the present disclosure, the OCR model (M436) determines whether characters are displayed on the surface of the object, and when it is determined that the characters are displayed on the surface of the object, OCR is applied to all characters including The steps to be performed can be performed.

According to another embodiment of the present disclosure, the OCR model M436 includes the steps of determining whether characters are displayed on the surface of the object and performing OCR for each character set when it is determined that characters are displayed on the surface of the object. can be done

문자 집합의 구분 방법How to distinguish character sets

At least one of the processor 120 or the OCR model M436 may classify a character set to which a character belongs based on at least one of a type, a shape, a size, and an arrangement of the character. In addition, the processor 120 may distinguish a character set with a high probability of being closely related in the interpretation of one context.

A method of distinguishing a character set belonging to a character or a character set belonging to a context may be as follows. For example, at least one of the processor 120 and the OCR model M436 may classify the character set based on the character set characteristic.

Character set attributes include the language type of the character, line spacing, kerning, length (e.g., aspect ratio), size, thickness, color, font (font), style, related characters at the beginning or end of a sentence, blanks (e.g., For example, it may include one or more of the following sentence, a space with a boundary of at least one of an object or a partial object), or a position where a character is displayed in the object. Accordingly, at least one of the processor 120 or the OCR model M436 determines the language type of the character, line spacing, kerning, length (eg, aspect ratio), size, thickness, color, font (font), style, A character set can be distinguished using one or more of the following characteristics: a character at the beginning or end of a sentence, a space (e.g., a space between the boundaries of at least one of the following sentences, an object, or a sub-object), or a position in an object where the character appears. .

For example, at least one of the processor 120 or the OCR model M436 may classify a completed sentence or context according to a language type to which a character belongs. At least one of the processor 120 or the OCR model M436 may classify a completed sentence or context according to a language type to which each character belongs when the recognized characters such as the surface of the product manual are multilingual.

As another example, in the case of sentences having the same size and font, at least one of the processor 120 or the OCR model M436 may classify paragraphs based on at least one of line spacing, letter spacing, length, and color.

As another example, at least one of the processor 120 and the OCR model M436 may classify the character set based on the top or start position of one paragraph. At the top or starting position of a paragraph, the character or title of the paragraph may be displayed in a bold style or in another size.

문자 집합의 객체화objectification of character sets

The processor 120 and/or the OCR model M436 may classify the character set based on the above-described 'character set discrimination method'. The processor 120 and/or the OCR model M436 may recognize a character set object based on the character set. The character set object may be a character set set as one target. The character set object may include at least one of a character image, text information, or character set characteristics. For example, a character set object may contain a character image prior to OCR or natural language understanding (processing). For another example, the character set object may contain text information after OCR or natural language understanding (processing). Here, the character set object may be divided into a plurality of character set objects after OCR or natural language understanding (processing).

The processor 120 and/or the OCR model M436 may distinguish one or more character sets displayed on the surface of the object or partial object. The processor 120 and/or the OCR model M436 may perform context analysis based on at least one of characteristics or state information of an object in which a character set is displayed or a partial object. Properties and/or states of objects and/or partial objects may be included in character set properties.

The processor 120 and/or the OCR model M436 may use the type and/or location of the object or partial object on which the character set is indicated as additional information for natural language understanding (processing) and/or context interpretation.

For example, the processor 120 and/or OCR model M436 may determine that the text displayed on the shirt (object or partial object) and the text displayed on the car (object or partial object) are the same, the text displayed on either the shirt or the vehicle. (eg brand name) can be used as additional information. Here, the processor 120 and/or the OCR model M436 determines that the characters displayed on the car will reflect the characteristics of the object and/or the characteristics of the entire context (determining that the probability of reflection will be relatively higher), And the text displayed on the car can be used as additional information.

When a plurality of characters are present in the object, the processor 120 and/or the OCR model M436 may acquire characteristics of the object based on positions where the plurality of characters are displayed. The plurality of characters may have different meanings according to positions. For example, the processor 120 and/or the OCR model M436 may determine that when the text displayed on the front of the shirt (object or partial object) and the text displayed on the label of the shirt (eg, care label) exist at the same time, It is determined that the text displayed on the label of the shirt reflects the characteristics of the object (it is determined that the probability of reflection is relatively higher), and the text displayed on the label of the shirt can be used as additional information.

The processor 120 and/or the OCR model M436 may perform OCR and/or natural language understanding (processing) of the objectified character set for each character set object. The processor 120 and/or the OCR model M436 may perform OCR and/or natural language understanding (processing) of one or more character set objects based on character set characteristics. Here, the processor 120 and/or the OCR model M436 may sequentially or simultaneously perform OCR and/or natural language understanding (processing) of one or more character set objects based on character set characteristics.

The unit of OCR and/or natural language understanding (processing) is not limited to the entire character of the surface of an object or a partial object, and a character or character set classified by the above-described character set characteristic may be the minimum unit.

Meanwhile, the OCR performing step ( S3540 ) may be performed in various ways.

For example, performing OCR on characters (eg, the entire character and/or character set) performed by the processor 120 of the computing device 100 ( S3540 ) may include at least one Based on at least one of extracting an image sample, determining a boundary line of a region in which characters are displayed from at least one image sample, and a boundary point that is a boundary line or a point belonging to the boundary line among at least one image sample, For example, the method may include generating the entire character display image and/or character set display image) and performing OCR on the character display image.

The image sample may be an image pattern existing at at least one of a boundary line and/or a boundary point. The image pattern may include at least one of a partial text, a border portion of the text, a portion of the text, and/or a background. The background may mean an image pattern that does not constitute a character.

The generating a text display image and performing OCR on the text display image based on at least one of a boundary line or a boundary point that is a point belonging to the boundary line among the at least one image sample may include: It may include acquiring image patterns located at one location as boundary markers, and generating a text display image, which is an image including boundary markers, as a partial image included in an image of an object on which text is displayed. It may include at least one of a marker and/or an end marker. The start marker may be an image pattern corresponding to the start character of the character. The end marker may be an image pattern corresponding to the end character of the character.

The step of acquiring image patterns located at at least one of a boundary line or a boundary point among at least one image sample as boundary markers includes, using the tracking controller 4400, acquiring a character image including a start marker and a character recognition rate equal to or greater than a threshold value and determining whether an end marker is included in the text image.

Also, when the end marker is included in the text image, the processor 120 may determine the text image as the text display image.

When the end marker is not included in the text image, the processor 120 includes the next marker of the last marker among the boundary markers included in the text image using the tracking controller 4400 and the text recognition rate is higher than or equal to the threshold. image can be obtained. The processor 120 may generate the merged text image by merging the additional text image with the text image. And, when the end marker is included in the merged text image, the processor 120 may determine the merged text image as the text display image.

On the other hand, for another example, the step of performing OCR on the character performed by the processor 120 of the computing device 100 ( S3540 ) includes the starting point of the character using a tracking controller, and the character recognition rate is a threshold Obtaining an image of an object equal to or greater than a value, performing OCR on the first sentence region of the image of the object, and performing natural language understanding (NLU) on the first text, which is the primary result of OCR, means The method may include calculating a first semantic value that is a numerical value of the value.

When the first semantic value is equal to or greater than the threshold, the processor 120 may determine the first text as the result text resulting from the OCR.

When the first semantic value is less than the threshold, the processor 120 may perform OCR on the next sentence region of the first sentence region. The processor 120 may calculate a second semantic value that is a numerical value of the semantic value by performing natural language understanding on the second text, which is the primary result of the OCR. Then, when the second semantic value is equal to or greater than the threshold, the processor 120 may determine the second text as the result text resulting from the OCR.

The initial sentence region may be a region occupied by a sentence detected as being arranged first in an arrangement method of characters according to a language used in the sentence.

The computing device 100 may interwork with the photographing device 200 and the gimbal 300 . For example, the computing device 100 may include the photographing device 200 and may interwork with a gimbal 300 that controls the posture of the photographing apparatus 200 .

The processor 120 of the computing device 100 may control the direction of the photographing device 200 by operating one or more rotation axes of the gimbal 300 using the tracking controller 4400 .

The processor 120 may acquire an image through zoom-in and/or zoom-out of the photographing device 200 using the tracking controller 4400 .

10A to 10C are flowcharts exemplarily showing methods that can be used to perform OCR in the image processing method of the present disclosure, and FIGS. 11A to 11D are OCRs performed in the image processing method of the present disclosure The drawings are exemplified to explain the methods.

First, referring to FIGS. 10A and 11A , the OCR performing step (S3540a) in the embodiment using the boundary marker is a step of extracting image samples from the object on which the character is displayed (eg, a paper document of reference numeral 1110). It starts with ( S3542a ), and then includes a step ( S3544a ) of defining a boundary line (eg, a closed curve at reference numeral 1130 ) of an area in which characters are displayed from the image samples following the step ( S3542a ).

Here, the image sample is an image pattern existing at the position of the boundary line or boundary point, and the image pattern may be composed of some characters, a boundary part of the characters, a part of the characters, or a background. The boundary line refers to a boundary line of a group of characters displayed on the surface of an object, and the boundary point refers to a set of points that may constitute the boundary line. Here, the background refers to an image pattern that does not constitute a character by itself.

In the OCR performing step (S3540a) of this embodiment, following the step (S3544a), the boundary marker acquisition step (S3546a) of acquiring image patterns located at the boundary line or a boundary point that is a point belonging to the boundary line among the image samples as boundary markers The method further includes, in this step, an image pattern corresponding to a start character among all characters among the boundary markers may be referred to as a start marker, and an image pattern corresponding to an end character among all characters among the boundary markers may be referred to as an end marker.

The boundary marker functions as a marker for determining a coordinate region of a space occupied by each character belonging to the entire character in the object image.

The boundary part of the character may be used as one of the image patterns located at the boundary marker, that is, the boundary line or a boundary point that is a point belonging to the boundary line. The character boundary part refers to a part of an image constituting a part of a character extending along the width or length of a sentence in a sentence. For example, in the sentence "Kanadaramabashi", the letter 'ㄱ' and 'ㅣ' in the poem are part of the letter and can be said to be a boundary marker that is the boundary of the letter.

Also, a part (or all) of the character may be used as one of the boundary markers. For example, in the sentence "ABCDEFG", A and G may be used as border markers that are part of a border point and/or a character located on a border line.

Acquisition of the start marker or the end marker reflects the arrangement of the characters according to the language of the sentence constituting the entire character. For example, in OCR for books in Korean and English, as illustrated in FIG. 11 , the image sample 1132 at the upper left of the area where the characters are displayed may be a start marker, and the image sample 1132 at the lower right of the area may be The image sample 1134 may be an end marker.

The boundary marker acquisition step (S3546a) is, more specifically, a step of controlling the available resources by the tracking controller 4400 to acquire a text image that includes the start marker and has a text recognition rate equal to or greater than a predetermined threshold (S3546a-) 1) may be included.

Here, the character recognition rate refers to a rate at which a specific character is recognized as a certain text in a character image. For example, it can be estimated by deep learning that there is a certain character from a distant image where an object with a character appears, but it is difficult to determine which character the character is because the remote image is small or the resolution is low. That is, the character recognition rate by OCR may be low.

In other words, the text image having the text recognition rate equal to or greater than a predetermined threshold value refers to an OCR-capable resolution. In order to obtain an enlarged text image of sufficiently high resolution in performing this step (S3546a-1), the system of the present disclosure can control the zoom-in, zoom-out, and rotation of the photographing device 200 through the tracking controller 4400. have. For example, referring to FIG. 11A , in a composition including both the paper document 1110 and the person 1120, the rotation control 1160 of the photographing apparatus 200 to bring the paper document 1110 to the center of the image, and Zoom-in

controls

1170 and 1180 are illustrated.

Following step S3546a-1, this embodiment of the boundary marker acquisition step S3546a includes the end marker determination step S3546a-2 for determining whether the acquired text image includes the end marker, and the acquired If the end marker is included in the text image, the text image is taken as the full text display image, and if the obtained text image does not include the end marker, the next marker of the last marker among the boundary markers displayed in the text image is searched for. to control the available resources by the tracking controller 4400 to obtain an additional text image, which is an image including the next marker and a text recognition rate equal to or greater than a predetermined threshold, and merge the additional text image into the text image, thereby ending the It includes a step (S3546a-3) of re-performing the marker determination step (S3546a-2).

The control of available resources in step S3546a-3 includes control of the gimbal 300, for example, control of the gimbal 300 for books in Korean and English is performed by the photographing device 200. It may be a control (eg, rotation control 1190 in FIG. 11B ) that assists in causing the character to scan from left to right of the region marked with the character and from top to bottom of the region.

In addition, those skilled in the art will understand that several image stitching techniques may be used for merging text images in step S3546a-3. Referring to FIG. 11B , image merging may be performed using

image patterns

1136a and 1136b overlapping with each other between two or more images captured temporally or spatially apart.

Following the boundary marker acquisition step (S3546a), the OCR performing step (S3540a) of this embodiment is a partial image including the image of the object marked with the text, and includes both the start marker and the end marker. The method may include obtaining text by performing OCR on the entire character display image (S3468a).

Meanwhile, referring to FIGS. 10B and 11C , in the second embodiment of the OCR performing step (S3540), the OCR performing step (S3540b) starts with the step of extracting image samples from the object on which the character is displayed (S3542b), and , step (S3542b), followed by a step (S3544b) of defining a boundary line 1130 of an area in which characters are displayed from the image samples.

Next, the OCR performing step (S3540b) is generated by acquiring at least one

image pattern

1142a and 1142b not located on the boundary line among the image samples as a division marker and merging the text images using the division marker. and performing OCR on the entire character display image, which is an image to be used (S3546b).

In step S3546b, the segmentation marker refers to image

patterns

1142a and 1142b that overlap with each other between two or more images taken while being spaced apart temporally or spatially. By capturing the same segmentation marker in different images, image merging using the segmentation marker, that is, image stitching can be performed.

Finally, referring to FIGS. 10C and 11D , in the third embodiment of the OCR performing step ( S3540 ), the OCR performing step ( S3540c ) includes the starting point of the entire character and the character recognition rate is equal to or greater than a predetermined threshold value. and controlling the available resources by the tracking controller 4400 to obtain the object image, that is, so that the object image has a predetermined resolution capable of OCR (S3542c).

As a policy of the tracking controller 4400 performing step S3542c, not only the condition that the object image has a predetermined resolution capable of OCR, but also the condition to include the maximum number of characters in the object image at such resolution can be given

Next, performing OCR ( S3540c ) further includes performing OCR on the first sentence region 1152 of the obtained object image ( S3544c ).

Here, the initial sentence region refers to a region occupied by a sentence detected as being arranged first when considering a general arrangement method of characters according to a language used in the sentence.

For example, in the first execution of this step S3544c, the first sentence section may be determined by scanning punctuation marks indicating the end point of a sentence, such as “,”, “.”, and “?”. When these punctuation marks are scanned, it can be assumed that the first punctuation area 1152 is from the start character to the punctuation mark.

As another example, in the first execution of this step S3544c, the first sentence section 1152 determines that the size of the character changes by more than a predetermined threshold when the size of the next character is sequentially compared from the size of the starting character. It may also be determined as the area occupied by the sentence consisting of the letters consecutive from the beginning of the previous letter.

As another example, in the first execution of this step S3544c, the first sentence section 1152 determines the size of the letter space when sequentially comparing the size of the letters between the start character and the next letter from the size of the next letter. It may be determined as an area occupied by a sentence composed of consecutive characters from the start character until .

Next, in the OCR performing step (S3540c), by performing natural language understanding (NLU) on the first text, which is the primary result by the OCR in the step (S3544c), the semantic value of the first text is The method further includes a semantic value calculation step (S3546c) of calculating a semantic value, which is a numerical value.

Next, in the OCR performing step (S3540c), if the semantic value in step S3546c is greater than or equal to a predetermined threshold, the OCR is completed by taking the first text as the result text, and the semantic value is If it is less than the predetermined threshold, the method further includes performing OCR on the next sentence region 1154 of the first sentence region and re-performing the semantic value calculation step (S3546c) (S3548c).

Since the calculation of the semantic value can be cumulative, natural language understanding is not applied only to one sentence in which OCR has been performed, that is, a sentence area that has been enlarged by combining a sentence that has been previously OCRed with a new sentence. This is because it can be done in a way that is applied again for

To this end, in step S3548c, OCR may be sequentially and repeatedly performed on new sentence sections including the following sentences existing after the punctuation mark of the first sentence section.

Of course, even if the OCR is performed up to the last sentence, if the semantic value does not reach a predetermined threshold, it may be detected that there is no text by the entire character.

As such, by performing the above-described steps of the present disclosure, a computing device that processes an image obtained by photographing objects may grasp various information of objects detected in the image.

The components shown in the drawings are exemplified as being realized in one computing device, for example, a portable terminal for convenience of description, but the computing device 100 performing the method of the present invention may be configured as a plurality of devices interworking with each other. It will be understood that there is Therefore, each step of the method of the present invention described above may be performed by the gimbal 300 having a built-in communication unit and a processor in addition to the portable terminal. It is apparent that this may be performed by supporting other computing devices interlocked with the computing device to perform the operation.

As described so far, the method and apparatus of the present disclosure can recognize and track one or more objects using an image, and actively acquire information of objects and environments, throughout all embodiments and modifications thereof. In particular, it is possible to obtain the state information of the object as an image from a distance, determine the object with which the object interacts as an image, and obtain a higher resolution detailed image of a part of the object from a distance, Since characters printed or output using other means such as a display can be grasped from a distance, there is an advantage in that remote input using an image of a portable computing device is possible.

As described above, based on the description of various embodiments of the present disclosure, those skilled in the art will recognize that the method and/or processes of the present invention, and the steps thereof, are hardware, software, or hardware and software suitable for a specific application. It can be clearly understood that it can be realized in any combination. The hardware may include general purpose computers and/or dedicated computing devices or specific computing devices or special features or components of specific computing devices. The processes may include one or more processors, eg, microprocessors, controllers, eg, microcontrollers, embedded microcontrollers, microcomputers, arithmetic logic units (ALUs), digital signal processors, with internal and/or external memory. , for example, by a programmable digital signal processor or other programmable device. Additionally, or alternatively, the processes may be implemented using an application specific integrated circuit (ASIC), a programmable gate array, such as a field programmable gate array (FPGA), a programmable logic unit (PLU) or programmable array logic. (Programmable Array Logic; PAL) or any other device capable of executing and responding to instructions, any other device or combination of devices that can be configured to process electronic signals. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For convenience of understanding, although one processing device is sometimes described as being used, one of ordinary skill in the art will recognize that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that may include For example, the processing device may include a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as parallel processors.

The software may comprise a computer program, code, instructions, or a combination of one or more of these, which configures the processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or device for interpretation by or providing instructions or data to the processing device. , or may be permanently or temporarily embody in a transmitted signal wave. The software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored in one or more machine-readable recording media.

Moreover, the objects of the technical solution of the present invention or parts contributing to the prior arts may be implemented in the form of program instructions that can be executed through various computer components and recorded in a machine-readable medium. The machine-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions recorded on the machine-readable recording medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. Examples of the machine-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs, DVDs, and Blu-rays, and floppy disks. magneto-optical media, such as ), and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include any one of the devices described above, as well as heterogeneous combinations of processors, processor architectures, or combinations of different hardware and software, or stored and compiled or interpreted for execution on a machine capable of executing any other program instructions. machine code, which may be created using a structured programming language such as C, an object-oriented programming language such as C++, or a high-level or low-level programming language This includes not only bytecode, but also high-level language code that can be executed by a computer using an interpreter or the like.

Accordingly, in one aspect according to the present invention, when the method and combinations thereof described above are performed by one or more computing devices, the methods and combinations of methods may be implemented as executable code for performing respective steps. In another aspect, the method may be implemented as systems that perform the steps, the methods may be distributed in various ways across devices or all functions may be integrated into one dedicated, standalone device or other hardware. In another aspect, the means for performing the steps associated with the processes described above may include any of the hardware and/or software described above. All such sequential combinations and combinations are intended to fall within the scope of this disclosure.

For example, the hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa. The hardware device may include a processor such as an MPU, CPU, GPU, TPU coupled with a memory such as ROM/RAM for storing program instructions and configured to execute the instructions stored in the memory, an external device and a signal It may include a communication unit that can send and receive. In addition, the hardware device may include a keyboard, a mouse, and other external input devices for receiving instructions written by developers.

In the above, the present invention has been described with specific matters such as specific components and limited embodiments and drawings, but these are provided to help a more general understanding of the present invention, and the present invention is not limited to the above embodiments, Those of ordinary skill in the art to which the present invention pertains can devise various modifications and variations from these descriptions.

Therefore, the spirit of the present invention should not be limited to the above-described embodiments, and not only the claims appended to the present disclosure but also all modifications equivalently or equivalently to the claims attached to the present disclosure are the spirit of the present invention. would be said to belong to the category of For example, the described techniques are performed in a different order than the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

Such equivalent or equivalent modifications will include, for example, logically equivalent methods capable of producing the same results as practicing the methods according to the present invention, the spirit and spirit of the present invention. The scope should not be limited by the above examples, but should be understood in the broadest sense permitted by law.

As described above, the relevant content has been described in the best mode for carrying out the invention.

The present disclosure may be used in a system for processing an image.

Claims

A method of processing an image performed by a computing device comprising a processor, the method comprising:

acquiring an image;

obtaining, from the image, analysis information corresponding to the object included in the image by using the object analysis model;

obtaining posture information about the object from the analysis information corresponding to the object by using the posture determination model; and

obtaining behavior information about the object, including an N-dimensional motion vector corresponding to the object, from the posture information on the object by using a behavior discrimination model;

containing,

Way.
According to claim 1,

The analysis information is

including at least one of classification information indicating the category of the object, location information indicating the location of the object, or importance information indicating the priority of the object in an image,

Way.
According to claim 1,

In the posture determination model, based on the analysis information corresponding to the object, a posture determination method to be applied to the object among a plurality of posture determination methods is determined,

Way.
According to claim 1,

The posture determination model is

Differently applying the determination method of the posture according to the classification information indicating the category of the object,

Way.
According to claim 1,

The posture determination model is

From the analysis information, generating posture information about the object, including information about an N-dimensional posture corresponding to the object,

Way.
According to claim 1,

Posture information about the object,

Including information about the temporally continuous posture with respect to the object,

Way.
According to claim 1,

The movement vector is

comprising a two-dimensional motion vector including at least one of a two-dimensional direction or velocity of the object,

Way.
According to claim 1,

The movement vector is

including a three-dimensional movement vector of the object calculated based on at least one of a position, speed, or acceleration of a photographing device that has captured the image,

Way.
According to claim 1,

The behavior discrimination model is

From the posture information of the object, generating behavior information of the object with respect to the context of the object, including an N-dimensional behavior classification corresponding to the object,

Way.
10. The method of claim 9,

The context of the object is

containing at least one of the object's state or purpose of action,

Way.
10. The method of claim 9,

The action classification is

Identifies the location of the partial object included in the object,

determine a pose of the object based on the position of the partial object, and

Calculated by determining the action classification of the object based on the posture of the object,

Way.
12. The method of claim 11,

The partial object is

comprising at least one of a part of the object or an attribute belonging to the object,

Way.
According to claim 1,

The behavior determination model applies a different method of determining an action according to classification information indicating the category of the object,

Way.
A non-transitory computer readable medium comprising a computer program, the computer program causing a computing device to perform a method for processing an image, the method comprising:

acquiring an image;

obtaining, from the image, analysis information corresponding to the object included in the image by using the object analysis model;

obtaining posture information about the object from the analysis information corresponding to the object by using the posture determination model; and

obtaining behavior information about the object, including an N-dimensional motion vector corresponding to the object, from the posture information on the object by using a behavior discrimination model;

containing,

A non-transitory computer-readable medium containing a computer program.
A computing device, comprising:

processor; and

communication department;

including,

The processor is

acquire video,

Obtaining analysis information corresponding to the object included in the image from the image by using the object analysis model,

Using the posture determination model, from the analysis information corresponding to the object, obtain posture information about the object, and

Obtaining behavior information about the object, including an N-dimensional motion vector corresponding to the object, from the posture information about the object by using a behavior discrimination model,

computing device.