KR20200073967A

KR20200073967A - Method and apparatus for determining target object in image based on interactive input

Info

Publication number: KR20200073967A
Application number: KR1020190085650A
Authority: KR
Inventors: 이형욱; 왕 치앙; 장 차오
Original assignee: 삼성전자주식회사
Priority date: 2018-12-14
Filing date: 2019-07-16
Publication date: 2020-06-24
Also published as: CN111400523A

Abstract

Disclosed are a method for determining a target object in an image based on interactive input and an apparatus thereof. The method for determining a target object obtains first feature information corresponding to an image and second feature information corresponding to interactive input. The target object corresponding to the interactive input is determined from objects of the image in accordance with the first feature information and the second feature information.

Description

Method and apparatus for determining a target object from an image based on interactive input {METHOD AND APPARATUS FOR DETERMINING TARGET OBJECT IN IMAGE BASED ON INTERACTIVE INPUT}

이하의 일실시예들은 인간-기계 인터랙티브 기술 분야에 관한 것으로, 구체적으로, 인터랙티브 입력에 기반하여 이미지에서 목표 오브젝트를 결정하는 방법 및 그 장치에 관한 것이다.The following embodiments relate to the field of human-machine interactive technology, and specifically, to a method and apparatus for determining a target object in an image based on interactive input.

인터랙티브 입력에 기반해서 목표 오브젝트를 결정하는 것은 사람-기계 인터랙티브 기술에 관한 중요한 연구 분야이다. 일측에 따르면, 컴퓨터 비전 분야에서 목표 검출(object detection) 기술이 광범위하게 응용된다.Determining target objects based on interactive inputs is an important area of research on human-machine interactive technology. According to one side, in the field of computer vision, object detection technology is widely applied.

목표 오브젝트 검출은 컴퓨터 비전 분야의 이미지 또는 영상(예: 영상의 어느 한 프레임)에서 특정 유형 오브젝트를 검출해 내는 한 종류의 기술이다. 구체적으로, 입력한 이미지에 대해, 이미지에 포함된 각각의 오브젝트의 바운딩 박스（bounding box）를 제시할 수 있고，대응되는 오브젝트 유형 태그를 제시할 수 있다.Target object detection is a type of technology that detects a specific type of object from an image or image (eg, one frame of an image) in the computer vision field. Specifically, for the input image, a bounding box of each object included in the image may be presented, and a corresponding object type tag may be presented.

다른 일 측에 따르면, 사람-기계 인터랙티브에서, 컴퓨터는 인류의 교류 습관에 따라 유저의 인터랙티브 입력을 이해할 수 있다. 예를 들어, 유저가 입력한 음성에 대해, 컴퓨터는 먼저 음성 식별 기술을 사용하여, 유저가 낸 음성 명령을 문자로 변환하여 컴퓨터가 유저의 명령을 이해하기 편리하게 한다. 그 다음, 컴퓨터는 구법 분석 등 자연언어 처리（Natural Language Processing）관련 기술을 통해 유저 명령의 명사를 추출할 수 있다. 인터랙티브 입력에 따라 목표 오브젝트를 결정한 사람-기계 인터랙티브 기술로 양자를 결합하여, 일정 수준에서 컴퓨터가 유저가 지칭하는 어떠한 오브젝트를 이해(목표 오브젝트가 무엇인지 결정)하도록 하여, 주어진 이미지 또는 동영상(동영상의 어느 한 프레임)에서 유저가 지칭하는 오브젝트에 대한 결정을 수행한다. According to the other side, in man-machine interaction, the computer can understand the user's interactive input according to human interaction habits. For example, for a voice input by the user, the computer first uses a voice identification technology to convert the voice command issued by the user into text, thereby making it convenient for the computer to understand the user's command. Then, the computer can extract nouns of user commands through natural language processing-related techniques such as grammar analysis. Combining the two with human-machine interactive technology that determines the target object according to the interactive input, so that at a certain level, the computer understands which object the user refers to (determines what the target object is), so that a given image or video (video In one frame), a determination is made on an object that the user refers to.

도 1a 내지 도 1c는 종래 기술에 따라 이미지에서 목표 오브젝트를 결정하는 예를 도시한 도면이다.1A to 1C are diagrams illustrating an example of determining a target object in an image according to the prior art.

도 1a 내지 도 1c를 참조하면, 예를 들어, 도 1a는 이미지에 한 개의 "비행기"만 있는 사례를 나타냈다. 만약 유저가 "비행기"라고 말하면, 컴퓨터가 유저가 지칭하는 오브젝트를 이해하여, 도 1b에 도시된 바와 같이 해당 오브젝트에 대응하는 바운딩 박스(110)를 제시할 수 있다.1A to 1C, for example, FIG. 1A shows an example in which there is only one "airplane" in an image. If the user says "airplane", the computer understands the object the user refers to, and can present a bounding box 110 corresponding to the object as shown in FIG. 1B.

이러한 사람-기계 기술은 장면에 유저가 지칭하는 오브젝트와 같은 유형의 복수 개의 사례가 존재할 시, 어려움에 직면할 수 있고, 단순히 목표 검출 기술을 이용하여, 유저가 구체적으로 지칭하는 오브젝트를 구분해 내는 것은 불가능하다. 예를 들어, 만약 유저가 "오토바이를 타는 사람"이라고 말하였을 때, 장면에 많은 사람이 있을 경우, 오브젝트 검출 기술로 유저가 구체적으로 지칭하는 것이 어떤 사례(사람)인지 결정할 수 없기에, 정확한 결과를 가져올 수 없다.Such man-machine technology may face difficulties when there are multiple instances of the same type as the object the user refers to in the scene, and simply uses the target detection technology to distinguish the object that the user specifically refers to It is impossible. For example, if a user says "Motorcycler", and there are many people in the scene, the object detection technology cannot determine which case (person) the user specifically refers to, resulting in accurate results. Cannot be imported.

이 유형의 문제에 대해, 종례 기술의 한가지 해결 방안은 아래 도 1c와 같이 검출해 낸 복수 개의 사례(111, 112, 113)를 동시에 나타내고, 순번을 주어, 유저가 다시 구체적인 순번을 선택하게 하여, 목표 오브젝트를 결정을 실현하는 방안이다. 그러나, 이러한 종류의 방안은 별도의 확정 선택 단계가 필요하다. 이로써, 인터랙티브 효율을 낮출 수 있다. 그 밖에, 장면에 수량이 비교적 많은 사례가 존재할 때(예를 들어, 많은 사람이 함께 찍은 단체 사진), 태그가 너무 밀집되어 있어 유저가 선택하는데 불리하다.For this type of problem, one solution of the conventional technique is to simultaneously show a plurality of cases (111, 112, 113) detected as shown in FIG. 1C below, and give a sequence number, allowing the user to select a specific sequence number again, It is a way to realize the decision of the target object. However, this kind of method requires a separate confirmation selection step. Thereby, interactive efficiency can be reduced. In addition, when a relatively large number of cases exist in the scene (for example, a group photo taken by many people), the tags are too dense, which is disadvantageous for the user to select.

기존 기술의 다른 한 종류의 해결 방안은, 이 유형의 문제를 한 종류의 미세한 오브젝트 검출 문제（fine-grained object detection）로 취급하고, 검출 모델을 트레이닝할 때, 오브젝트 속성 정보를 별도의 태그(예를 들어, 키 작은 남성, 안경 낀 사람, 붉은색 자동차 등)로 취급하는 방안이다. 이러한 방안의 결점은, 모델을 트레이닝할 때, 대량의 별도의 주석이 필요하다는 점이다. 또한, 실제 사용시 트레이닝 집합에 나타난 적이 없는 유형에 대한 오브젝트 검출 정확도는 크게 낮아진다.Another type of solution of the existing technology treats this type of problem as one kind of fine-grained object detection, and when training the detection model, separates object attribute information with a separate tag (eg For example, it is considered to be treated as a short man, a person with glasses, a red car, etc.). The drawback of this approach is that a large amount of separate annotations are required when training the model. In addition, the accuracy of object detection for types that have never appeared in the training set in actual use is greatly reduced.

일실시예에 따른 인터랙티브 입력에 기반하여 이미지에서 목표 오브젝트를 결정하는 방법은, 이미지에 대응하는 제1 특징 정보 및 인터랙티브(interactive) 입력에 대응하는 제2 특징 정보를 획득하는 단계; 및 상기 제1 특징 정보 및 상기 제2 특징 정보에 따라, 상기 이미지에 포함된 오브젝트 중에서 상기 인터랙티브 입력에 대응하는 목표 오브젝트를 결정하는 단계를 포함한다.A method of determining a target object from an image based on the interactive input according to an embodiment includes: obtaining first characteristic information corresponding to the image and second characteristic information corresponding to the interactive input; And determining a target object corresponding to the interactive input among objects included in the image according to the first characteristic information and the second characteristic information.

이때, 상기 이미지에 대응하는 상기 제1 특징 정보 및 상기 인터랙티브 입력에 대응하는 상기 제2 특징 정보를 획득하는 단계는, 상기 이미지에 대응하는 상기 제1 특징 정보를 획득할 때, 상기 이미지에 포함된 각 오브젝트와 상기 이미지에 포함된 적어도 하나의 다른 오브젝트 간의 시맨틱 특징 정보를 획득하는 단계를 포함할 수 있다.In this case, acquiring the first characteristic information corresponding to the image and the second characteristic information corresponding to the interactive input include: when acquiring the first characteristic information corresponding to the image, included in the image And obtaining semantic characteristic information between each object and at least one other object included in the image.

이때, 상기 이미지에 포함된 각 오브젝트와 상기 이미지에 포함된 적어도 하나의 다른 오브젝트 간의 상기 시맨틱 특징 정보를 획득하는 단계는, 상기 이미지에 포함된 각 오브젝트의 위치 정보에 기반하여, 상기 이미지에 포함된 각각의 오브젝트와 적어도 하나의 다른 오브젝트 간의 상기 시맨틱 특징 정보를 획득하는 단계를 포함할 수 있다.At this time, obtaining the semantic characteristic information between each object included in the image and at least one other object included in the image, based on the location information of each object included in the image, included in the image And obtaining the semantic feature information between each object and at least one other object.

이때, 상기 이미지에 포함된 각 오브젝트와 상기 이미지에 포함된 적어도 하나의 다른 오브젝트 간의 상기 시맨틱 특징 정보를 획득하는 단계는, 상기 이미지에 포함된 각 오브젝트 및 적어도 하나의 다른 오브젝트에 기반하여 적어도 하나의 후보 영역을 결정하는 단계; 상기 후보 영역 내의 오브젝트의 분류 특징 정보를 획득하는 단계; 상기 후보 영역 내의 오브젝트들 간의 영역 시맨틱 특징 정보를 획득하는 단계; 및 상기 분류 특징 정보와 상기 영역 시맨틱 특징 정보에 기반하여 상기 이미지에 포함된 각 오브젝트와 적어도 하나의 다른 오브젝트 간의 상기 시맨틱 특징 정보를 생성하는 단계를 포함할 수 있다.In this case, obtaining the semantic characteristic information between each object included in the image and at least one other object included in the image may include at least one of the objects included in the image and at least one other object. Determining a candidate region; Obtaining classification feature information of an object in the candidate region; Obtaining region semantic feature information between objects in the candidate region; And generating the semantic feature information between each object included in the image and at least one other object based on the classification feature information and the area semantic feature information.

이때, 상기 이미지에 포함된 각 오브젝트와 적어도 하나의 다른 오브젝트 간의 상기 시맨틱 특징 정보를 생성하는 단계 이전에, 상기 분류 특징 정보와 상기 영역 시맨틱 특징 정보에 기반하여, 상기 분류 특징 정보 및 상기 영역 시맨틱 특징 정보에 대해서 조인트 변경（joint correction）을 수행하는 단계를 더 포함할 수 있다.At this time, before generating the semantic feature information between each object included in the image and at least one other object, based on the classification feature information and the area semantic feature information, the classification feature information and the area semantic feature The method may further include performing joint correction on the information.

이때, 상기 이미지에 포함된 각 오브젝트와 적어도 하나의 다른 오브젝트 간의 상기 시맨틱 특징 정보를 생성하는 단계 이전에, 상기 후보 영역에 따라 기준 영역을 결정하는 단계; 상기 기준 영역의 영역 특징 정보를 획득하는 단계; 및 상기 분류 특징 정보, 상기 영역 시맨틱 특징 정보 및 상기 영역 특징 정보에 기반하여, 상기 분류 특징 정보, 상기 영역 시맨틱 특징 정보 및 상기 영역 특징 정보에 대해서 조인트 변경을 수행하는 단계를 더 포함할 수 있다.At this time, before generating the semantic feature information between each object included in the image and at least one other object, determining a reference area according to the candidate area; Obtaining region feature information of the reference region; And performing joint modification on the classification feature information, the region semantic feature information, and the region feature information based on the classification feature information, the region semantic feature information, and the region feature information.

이때, 상기 후보 영역은, 상기 이미지에 포함된 오브젝트 중 한 개와 상기 이미지에 포함된 적어도 하나의 다른 오브젝트 중 한 개를 포함할 수 있다.In this case, the candidate region may include one of the objects included in the image and one of at least one other object included in the image.

이때, 상기 제1 특징 정보는, 상기 이미지에 대응하는 글로벌 시각 특징 정보, 상기 이미지에 포함된 각 오브젝트에 각각 대응하는 시각 특징 정보, 상기 이미지에 포함된 오브젝트들 간의 상대적 위치 정보, 상기 이미지에 포함된 오브젝트들 간의 상대적 사이즈 특징 정보 및 상기 이미지에 포함된 오브젝트들 간의 시맨틱 특징 정보 중에서 적어도 하나를 포함할 수 있다.In this case, the first feature information includes global visual feature information corresponding to the image, visual feature information corresponding to each object included in the image, relative location information between objects included in the image, and included in the image It may include at least one of the relative size feature information between the objects and the semantic feature information between the objects included in the image.

이때, 상기 이미지에 포함된 오브젝트 중에서 상기 인터랙티브 입력에 대응하는 목표 오브젝트를 결정하는 단계는, 상기 목표 오브젝트를 결정하기 전에, 상기 제1 특징 정보에 융합 처리(fusion processing)를 수행하는 단계를 더 포함할 수 있다.At this time, determining the target object corresponding to the interactive input among the objects included in the image, before determining the target object, further comprising the step of performing a fusion processing (fusion processing) on the first feature information can do.

이때, 인터랙티브 입력에 기반하여 이미지에서 목표 오브젝트를 결정하는 방법은, 샘플 이미지를 포함하는 트레이닝 데이터를 획득하는 단계; 상기 샘플 이미지에 포함된 각 오브젝트와 상기 샘플 이미지에 포함된 적어도 하나의 다른 오브젝트에 기반하여 적어도 하나의 후보 영역을 결정하는 단계; 상기 후보 영역에 따라 기준 영역을 결정하고, 상기 기준 영역의 영역 특징 정보를 획득하는 단계; 상기 영역 특징 정보에 따라 영역 제목을 생성하는 단계; 및 상기 영역 제목을 가지고 감독된 트레이닝 데이터로 이용하여, 이미지에 포함된 오브젝트들 간의 시맨틱 특징 정보를 획득하기 위한 뉴럴 네트워크 모델에 대해 트레이닝을 수행하는 단계를 더 포함할 수 있다.In this case, a method of determining a target object from an image based on the interactive input includes: obtaining training data including a sample image; Determining at least one candidate region based on each object included in the sample image and at least one other object included in the sample image; Determining a reference area according to the candidate area, and obtaining area feature information of the reference area; Generating an area title according to the area feature information; And performing training on a neural network model for obtaining semantic feature information between objects included in the image, using the supervised training data with the area title.

이때, 상기 이미지에 대응하는 상기 제1 특징 정보 및 상기 인터랙티브 입력에 대응하는 상기 제2 특징 정보를 획득하는 단계는, 상기 인터랙티브 입력에 대해 단어 벡터 변환을 수행하는 단계; 및 상기 단어 벡터에 기반하여 상기 인터랙티브 입력에 대응하는 상기 제2 특징 정보를 획득하는 단계를 포함할 수 있다.At this time, obtaining the first feature information corresponding to the image and the second feature information corresponding to the interactive input may include: performing a word vector transformation on the interactive input; And obtaining the second feature information corresponding to the interactive input based on the word vector.

이때, 상기 이미지에 대응하는 상기 제1 특징 정보 및 상기 인터랙티브 입력에 대응하는 상기 제2 특징 정보를 획득하는 단계는, 상기 인터랙티브 입력에 대해 상기 단어 벡터 변환을 수행하는 단계 이전에, 상기 인터랙티브 입력의 단어가 설정된 제1 단어에 속하는지 여부를 판단하는 단계를 더 포함하고, 상기 인터랙티브 입력에 대해 상기 단어 벡터 변환을 수행하는 단계는, 상기 인터랙티브 입력의 단어가 설정된 상기 제1 단어에 속할 경우, 상기 제1 단어의 단어 벡터와 유사성이 높은 제2 단어의 단어 벡터를 상기 제1 단어에 대응하는 단어 벡터로 이용하는 단계를 포함할 수 있다.In this case, acquiring the first characteristic information corresponding to the image and the second characteristic information corresponding to the interactive input may be performed before the step of performing the word vector transformation on the interactive input. The method further includes determining whether the word belongs to the set first word, and performing the word vector conversion on the interactive input includes: when the word of the interactive input belongs to the set first word; The method may include using a word vector of the second word having high similarity to the word vector of the first word as a word vector corresponding to the first word.

이때, 상기 제1 단어는 사용 빈도수가 제1 설정값 보다 낮은 단어를 나타내고, 상기 제2 단어는 사용 빈도수가 제2 설정값 보다 높은 단어를 나타낼 수 있다.In this case, the first word may indicate a word whose frequency of use is lower than the first set value, and the second word may indicate a word whose frequency of use is higher than the second set value.

이때, 상기 인터랙티브 입력은, 음성 입력 및 텍스트 입력 중에서 적어도 하나를 포함할 수 있다.In this case, the interactive input may include at least one of voice input and text input.

일실시예에 따른 인터랙티브 입력에 기반하여 이미지에서 목표 오브젝트를 결정하는 장치는, 이미지에 대응하는 제1 특징 정보 및 인터랙티브 입력에 대응하는 제2 특징 정보를 획득하는 특징 획득부; 및 상기 제1 특징 정보 및 상기 제2 특징 정보에 따라, 상기 이미지에 포함된 오브젝트 중에서 상기 인터랙티브 입력에 대응하는 목표 오브젝트를 결정하는 목표 결정부를 포함한다.An apparatus for determining a target object from an image based on an interactive input according to an embodiment includes: a feature acquiring unit that acquires first feature information corresponding to an image and second feature information corresponding to an interactive input; And a target determination unit that determines a target object corresponding to the interactive input among objects included in the image, according to the first characteristic information and the second characteristic information.

이때, 상기 특징 획득부는, 상기 이미지에 포함된 각 오브젝트 및 적어도 하나의 다른 오브젝트에 기반하여 적어도 하나의 후보 영역을 결정하고, 상기 후보 영역 내의 오브젝트의 분류 특징 정보를 획득하고, 상기 후보 영역 내의 오브젝트들 간의 영역 시맨틱 특징 정보를 획득하고, 상기 후보 영역에 따라 기준 영역을 결정하고, 상기 기준 영역의 영역 특징 정보를 획득하고, 상기 분류 특징 정보, 상기 영역 시맨틱 특징 정보 및 상기 영역 특징 정보에 기반하여, 상기 분류 특징 정보, 상기 영역 시맨틱 특징 정보 및 상기 영역 특징 정보에 대해서 조인트 변경을 수행하고, 상기 정정된 분류 특징 정보, 상기 정정된 영역 시맨틱 특징 정보 및 상기 정정된 영역 특징 정보에 기반하여 상기 이미지에 포함된 각 오브젝트와 적어도 하나의 다른 오브젝트 간의 상기 시맨틱 특징 정보를 생성할 수 있다.At this time, the feature acquiring unit determines at least one candidate region based on each object included in the image and at least one other object, obtains classification feature information of an object in the candidate region, and objects in the candidate region Obtaining area semantic characteristic information between them, determining a reference area according to the candidate area, obtaining area characteristic information of the reference area, and based on the classification characteristic information, the area semantic characteristic information, and the area characteristic information , Performing a joint change on the classification feature information, the region semantic feature information, and the region feature information, and based on the corrected classification feature information, the corrected region semantic feature information, and the corrected region feature information The semantic feature information may be generated between each object included in at least one other object.

이때, 상기 특징 획득부는, 상기 인터랙티브 입력에 대해 단어 벡터 변환을 수행하고, 상기 단어 벡터에 기반하여 상기 인터랙티브 입력에 대응하는 상기 제2 특징 정보를 획득할 수 있다.In this case, the feature acquiring unit may perform a word vector transformation on the interactive input, and obtain the second feature information corresponding to the interactive input based on the word vector.

이때, 상기 특징 획득부는, 상기 인터랙티브 입력에 대해 상기 단어 벡터 변환을 수행할 때, 상기 인터랙티브 입력의 단어가 설정된 제1 단어에 속하는지 여부를 판단하고, 상기 인터랙티브 입력의 단어가 설정된 상기 제1 단어에 속할 경우, 상기 제1 단어의 단어 벡터와 유사성이 높은 제2 단어의 단어 벡터를 상기 제1 단어에 대응하는 단어 벡터로 이용하고, 상기 제1 단어는 사용 빈도수가 제1 설정값 보다 낮은 단어를 나타내고, 상기 제2 단어는 사용 빈도수가 제2 설정값 보다 높은 단어를 나타낼 수 있다.In this case, when the word vector conversion is performed on the interactive input, the feature obtaining unit determines whether the word of the interactive input belongs to the set first word, and the first word in which the word of the interactive input is set. If it belongs to, the word vector of the second word having high similarity to the word vector of the first word is used as a word vector corresponding to the first word, and the first word is a word whose frequency of use is lower than the first set value And the second word may indicate a word whose frequency of use is higher than the second set value.

일실시예에 따른 기술 방안에 따르면, 오브젝트 간의 시맨틱 특징 정보를 포함하는 제1 특징 정보를 획득하고, 제1 특징 정보 및 인터랙티브 입력에 대응하는 제2 특징 정보를 매칭함으로써, 인터랙티브 입력에 기반하여 이미지에서 목표 오브젝트를 결정하여 사람-기계 인터랙티브 시스템의 유저가 표현하는 오브젝트의 특징 표현을 이해하는 능력을 제고하였고, 사람-기계 인터랙티브 시스템이 더 정확하고 더 빨리 목표 오브젝트를 결정하게 한다. 동시에, 자주 볼 수 없는(사용 빈도수가 낮은) 단어를 이와 의미가 같은 자주 볼 수 있는(사용 빈도수가 높은) 단어에 맵핑하여, 사람-기계 인터랙티브 시스템의 사용 빈도수가 낮은 단어에 대한 적응 능력을 제고하였고, 이는 더 정확하고 더 빨리 목표 오브젝트를 결정하는 데 도움을 준다.According to a technical method according to an embodiment, by obtaining first feature information including semantic feature information between objects, and matching the second feature information corresponding to the first feature information and the interactive input, the image based on the interactive input To improve the ability to understand the feature representation of the object expressed by the user of the human-machine interactive system by determining the target object, and to make the human-machine interactive system more accurately and faster to determine the target object. At the same time, by mapping words that are rarely seen (less frequently used) to words that are frequently seen (more frequently used) with the same meaning, the human-machine interactive system improves the ability to adapt to the less frequently used words. And this helps to determine the target object more accurately and faster.

도 1a내지 도 1c는 종래 기술에 따라 이미지에서 목표 오브젝트를 결정하는 예를 도시한 도면이다.
도 2는 일실시예에 따른 인터랙티브 입력에 기반하여 이미지에서 목표 오브젝트를 결정하는 방법을 도시한 흐름도이다;
도 3은 일실시예에 따라 인터랙티브 입력에 기반하여 이미지에서 목표 오브젝트를 결정하는 과정을 예시적으로 도시한 도면이다.
도 4는 일실시예에 따라 오브젝트의 시각 특징 정보를 획득하기 위한 예시적인 과정을 도시한 도면이다.
도 5는 일실시예에 따라 오브젝트 사이의 시맨틱 특징 정보를 획득하기 위한 예시적 과정을 도시한 도면이다.
도 6은 일실시예의 따라 인터랙티브 입력에 기반하여 이미지에서 목표 오브젝트를 결정하는 방법의 다른 예를 도시한 도면이다.
도 7은 단어의 사용 빈도수를 나타낸 분포도를 도시한 도면이다.
도 8은 일실시예에 따른 인터랙티브 입력에 기반하여 이미지에서 목표 오브젝트를 결정하는 방법을 응용한 예를 도시한 도면이다.
도 9는 일실시예에 따른 인터랙티브 입력에 기반하여 이미지에서 목표 오브젝트를 결정하는 장치의 구성을 도시한 도면이다.
도 10은 일실시예에 따른 인터랙티브 입력에 기반하여 이미지에서 목표 오브젝트를 결정하는 장치를 예시적으로 도시한 도면이다.
첨부도면에서, 같거나 유사한 구조는 같거나 유사한 첨부도면 표기로 표시할 수 있다.1A to 1C are diagrams illustrating an example of determining a target object in an image according to the prior art.
2 is a flowchart illustrating a method of determining a target object in an image based on interactive input according to an embodiment;
3 is a diagram exemplarily showing a process of determining a target object from an image based on interactive input according to an embodiment.
4 is a diagram illustrating an exemplary process for obtaining visual feature information of an object according to an embodiment.
5 is a diagram illustrating an exemplary process for obtaining semantic feature information between objects according to an embodiment.
6 is a diagram illustrating another example of a method of determining a target object in an image based on interactive input according to an embodiment.
7 is a diagram showing a distribution diagram showing the frequency of use of words.
8 is a diagram illustrating an example of applying a method of determining a target object in an image based on interactive input according to an embodiment.
9 is a diagram illustrating a configuration of an apparatus for determining a target object in an image based on interactive input according to an embodiment.
10 is a diagram exemplarily illustrating an apparatus for determining a target object from an image based on interactive input according to an embodiment.
In the accompanying drawings, the same or similar structure may be indicated by the same or similar attached drawing notation.

이하, 실시예들을 첨부된 도면을 참조하여 상세하게 설명한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings.

실시예에 따른 인터랙티브 입력에 기초하여 이미지에서 목표 오브젝트를 결정하는 방법은 주로 특징 정보 획득 단계와 목표 오브젝트를 결정하는 단계를 포함한다. 구체적으로, 이미지에 대응하는 제1 특징 정보 및 인터랙티브 입력에 대응하는 제2 특징 정보를 획득하고, 제1 특징 정보 및 제2 특징 정보에 따라, 이미지의 오브젝트에서, 인터랙티브 입력에 대해 목포 오브젝트를 결정한다.A method of determining a target object in an image based on interactive input according to an embodiment mainly includes a step of obtaining feature information and a step of determining the target object. Specifically, the first characteristic information corresponding to the image and the second characteristic information corresponding to the interactive input are obtained, and according to the first characteristic information and the second characteristic information, a target object of the image is determined for the interactive input in the object of the image do.

구체적인 실시예에서, 제1 특징 정보는 이미지에 대응하는 전체 이미지의 시각 특징 정보(글로벌 시각 특징 정보라고도 칭함)와 이미지의 각각의 오브젝트에 각각 대응하는 시각 특징 정보(단일 오브젝트의 시각 특징 정보라고도 칭함), 이미지의 각각의 오브젝트와 오브젝트에 근접한 적어도 한 개의 기타 오브젝트 간의 상대적인 위치 정보 및/또는 상대적인 사이즈 특징 정보 및 이미지의 오브젝트 간의 시맨틱 특징 정보 중 적어도 한 항목을 포함할 수 있다.In a specific embodiment, the first characteristic information is visual characteristic information (also referred to as global visual characteristic information) of the entire image corresponding to the image and visual characteristic information (also referred to as visual characteristic information of a single object) corresponding to each object of the image. ), at least one of relative position information and/or relative size feature information between each object in the image and at least one other object proximate to the object, and semantic feature information between objects in the image.

이때, 어떠한 오브젝트에 근접한 정의는 이미지의 각 오브젝트의 위치 정보에 기초한 것이다. 예를 들어, 어떠한 오브젝트와 다른 한 개의 다른 오브젝트 사이의 거리가 기설정한 거리보다 작을 때, 해당 오브젝트와 다른 한 개의 기타 오브젝트가 근접하다고 정의 내릴 수 있다. At this time, the definition close to any object is based on the location information of each object in the image. For example, when a distance between an object and one other object is smaller than a preset distance, it may be defined that the object and one other object are close.

구체적인 실시예에서, 이미지에 대응하는 제1 특징 정보를 획득하는 단계는, 먼저, 완전한 이미지의 시각 특징 정보, 단일 오브젝트의 시각 특징 정보, 이미지의 각각의 오브젝트와 오브젝트에 근접한 적어도 한 개의 다른 오브젝트 간의 상대적인 위치 정보 및 상대적인 사이즈 특징 정보 및 이미지의 오브젝트 간의 시맨틱 특징 정보를 각각 획득하는 단계를 포함할 수 있다. 그 다음, 상기 각 정보에 대해 융합 처리를 수행하여, 이미지에 대응하는 제1 특징 정보를 획득한다.In a specific embodiment, the step of obtaining the first feature information corresponding to the image includes: first, visual feature information of a complete image, visual feature information of a single object, between each object in the image and at least one other object proximate to the object And obtaining semantic feature information between the object of the image and the relative location information and the relative size feature information, respectively. Then, fusion processing is performed on each of the pieces of information to obtain first feature information corresponding to the image.

제1 특징 정보가 이미지의 오브젝트 간 시맨틱 특징 정보를 포함할 때, 실시예의 인터랙티브 입력에 기반하여 이미지에서 목표 오브젝트를 결정하는 방법에 따른 흐름도인 도 2를 따를 수 있다.When the first feature information includes semantic feature information between objects in the image, FIG. 2, which is a flowchart according to a method of determining a target object in the image based on the interactive input of the embodiment, may be followed.

도 2는 일실시예에 따른 인터랙티브 입력에 기반하여 이미지에서 목표 오브젝트를 결정하는 방법을 도시한 흐름도이다; 2 is a flowchart illustrating a method of determining a target object in an image based on interactive input according to an embodiment;

도 2를 참조하면, 210단계에서 이미지에 대응하는 제1 특징 정보 및 인터랙티브 입력에 대응하는 제2 특징 정보를 획득한다. 여기서, 제1 특징 정보는 이미지의 오브젝트 간 시맨틱 특징 정보를 포함할 수 있다.Referring to FIG. 2, in step 210, first feature information corresponding to an image and second feature information corresponding to an interactive input are acquired. Here, the first feature information may include semantic feature information between objects in the image.

그리고, 220단계에서, 제1 특징 정보와 제2 특징 정보에 따라, 이미지의 오브젝트에서 인터랙티브 입력에 대해 이미지의 목표 오브젝트를 결정한다.Then, in step 220, the target object of the image is determined for the interactive input from the object of the image according to the first characteristic information and the second characteristic information.

구체적인 실시예에서, 제1 특징 정보가 이미지의 오브젝트 간 시맨틱 특징 정보를 포함할 때, 210단계에서 이미지에 대응하는 제1 특징 정보를 획득할 때, 이미지의 각각의 오브젝트와 오브젝트에 근접한 적어도 한 개의 다른 오브젝트 간의 시맨틱 특징 정보를 획득하는 단계를 포함할 수 있다.In a specific embodiment, when the first feature information includes semantic feature information between objects in the image, and when obtaining the first feature information corresponding to the image in step 210, each object in the image and at least one object close to the object And obtaining semantic feature information between different objects.

구체적인 실시예에서, 이미지에 대응하는 제1 특징 정보에 포함되는 이미지의 각각의 오브젝트와 오브젝트에 근접한 적어도 한 개의 다른 오브젝트 간의 시맨틱 특징 정보를 획득하는 단계는, 먼저, 이미지의 각각의 오브젝트와 오브젝트에 근접한 적어도 한 개의 기타 오브젝트에 기초하여 적어도 한 개의 후보 영역을 결정한다. 그리고, 후보 영역 내 오브젝트의 분류 특징 정보와 후보 영역 내의 오브젝트 간의 영역 시맨틱 특징 정보를 각각 획득한다. 마지막으로, 분류 특징 정보 및 영역 시맨틱 특징 정보에 기초하여 이미지의 각각의 오브젝트와 오브젝트에 근접한 적어도 한 개의 다른 오브젝트 간의 시맨틱 특징 정보를 생성할 수 있다.In a specific embodiment, the step of acquiring semantic characteristic information between each object in the image included in the first characteristic information corresponding to the image and at least one other object proximate to the object is first performed on each object and object in the image. At least one candidate region is determined based on at least one other object in proximity. Then, the semantic feature information of the object in the candidate region and the classification feature information of the object in the candidate region are respectively obtained. Finally, semantic feature information between each object in the image and at least one other object proximate to the object may be generated based on the classification feature information and the area semantic feature information.

다른 구체적인 실시예에서, 획득한 이미지의 각각의 오브젝트와 오브젝트에 근접한 적어도 한 개의 다른 오브젝트 간의 시맨틱 특징 정보가 더 정확해지도록 하기 위하여, 이미지의 각각의 오브젝트와 적어도 한 개의 다른 오브젝트 간의 시맨틱 특징 정보를 생성하기 전, 획득한 분류 특징 정보 및 영역 언어 특징 정보에 기반하여, 분류 특징 정보 및 영역 시맨틱 특징 정보에 대해 조인트 변경（joint correction）을 수행할 수 있다. 여기서, 조인트 변경은 관련된 다른 종류의 특징 정보를 이용해서 해당 특징 정보를 정정하는 것을 나타낸다.In another specific embodiment, the semantic feature information between each object in the image and at least one other object is used to make the semantic feature information between each object in the acquired image and at least one other object proximate to the object more accurate. Before generation, joint correction may be performed on the classification feature information and the region semantic feature information based on the acquired classification feature information and the region language feature information. Here, the joint change refers to correcting the corresponding characteristic information using other related types of characteristic information.

추가적으로, 다른 구체적인 실시예에서, 이미지의 각각의 오브젝트와 적어도 한 개의 다른 오브젝트 간의 시맨틱 특징 정보를 획득하기 전, 후보 영역에 따라 해당 후보 영역을 포함하는 기준 영역을 결정하고, 기준 영역의 영역 특징 정보를 획득할 수 있고, 분류 특징 정보, 영역 시맨틱 특징 정보 및 영역 특징 정보에 기초하여, 분류 특징 정보, 영역 시맨틱 특징 정보 및 영역 특징 정보에 대해 조인트 변경을 수행할 수 있다.Additionally, in another specific embodiment, before acquiring semantic feature information between each object in the image and at least one other object, a reference region including the candidate region is determined according to the candidate region, and region feature information of the reference region It is possible to obtain, and based on the classification characteristic information, the area semantic characteristic information and the area characteristic information, joint change can be performed on the classification characteristic information, the area semantic characteristic information and the area characteristic information.

그 밖에, 이미지에 포함된 각각의 오브젝트와 적어도 한 개의 다른 오브젝트 간 시맨틱 특징 정보와 뉴럴 네트워크 모델에 대해 트레이닝을 수행할 때, 기준 영역을 이용하여 영역 제목을 획득하고, 영역 제목을 이용하여 모델에 대해 감독된 트레이닝을 수행할 수도 있다. 이는 모델의 품질을 제고하는 데 유익하다. 아래에서, 구체적인 사례를 통해 더 구체적으로 설명하고자 한다.In addition, when training on the semantic feature information and the neural network model between each object included in the image and at least one other object, an area title is acquired using a reference area, and the area title is used to obtain a model. You can also perform supervised training on This is beneficial for improving the quality of the model. Below, it will be described in more detail through specific examples.

일실시예에서, 베이직 네트워크를 이용하여 이미지에 대응하는 완전한 이미지의 시각 특징 정보, 단일 오브젝트의 시각 특징 정보 및 이미지의 각각의 오브젝트와 각각의 오브젝트에 근접한 다른 오브젝트 간의 상대적인 위치 정보 및 상대적인 사이즈 특징 정보를 추출한다. VRN(pair-wise visual relationship network)을 이용하여 시맨틱 특징 정보를 추출할 수 있다. 해당 네트워크는 베이직 네트워크(예를 들어, VGG-Net)의 컨볼루션 뉴럴 네트워크를 기초로 사용하여, 특별한 구조 및 트레이닝을 통해 획득할 수 있다. VRN은 다른 베이직 네트워크와 구분을 수행하기 위해 사용될 뿐, 해당 실시예를 제한하지 않는다.In one embodiment, the basic network visual feature information, a single object visual feature information, and the relative location information and relative size feature information between each object in the image and other objects proximate to each object, using the basic network To extract. Semantic feature information can be extracted using a pair-wise visual relationship network (VRN). This network can be obtained through a special structure and training using a convolutional neural network of a basic network (for example, VGG-Net). The VRN is only used to perform differentiation from other basic networks, and does not limit the embodiment.

일실시예에 따른 이미지에서 목표 오브젝트를 결정하는 방법에 따르면, 이미지 또는 장면에서 인터랙티브 입력(예를 들어, 자연 언어 입력 또는 언어 입력)에서 설명한 어떠한 특정 사례(예를 들어, 사람, 동물, 사물 등과 같은 오브젝트)가 결정될 수 있다. 또한, 일실시예에 따른 이미지에서 목표 오브젝트를 결정하는 방법은 이미지 또는 장면에 같은 유형에 속하는 복수개의 사례가 존재할지라도, 인터랙티브 입력과 관련된 속성, 위치, 동작 등 시맨틱 특징을 통해 지칭한 구체적인 사례를 결정할 수 있도록 사람-기계 인터랙티브 시스템의 인터랙티브 능력을 개선하였다. 아래에서는, 자연 언어 입력 또는 음성 입력의 사람-기계 인터랙티브 시스템을 예로, 인터랙티브 입력에 기반하여 이미지에서 목표 오브젝트를 결정하는 구체적인 사례에 대해 설명한다.According to a method of determining a target object from an image according to an embodiment, any specific case (eg, person, animal, object, etc.) described in interactive input (eg, natural language input or language input) in an image or scene The same object) can be determined. In addition, in the method of determining the target object in the image according to an embodiment, even if there are a plurality of cases belonging to the same type in the image or scene, the specific case referred to through semantic characteristics such as properties, location, and motion related to interactive input may be determined. To improve the interactive capabilities of the human-machine interactive system. Hereinafter, a specific example of determining a target object in an image based on the interactive input will be described using an example of a human-machine interactive system of natural language input or voice input.

도 3은 일실시예에 따라 인터랙티브 입력에 기반하여 이미지에서 목표 오브젝트를 결정하는 과정을 예시적으로 도시한 도면이다.3 is a diagram exemplarily showing a process of determining a target object from an image based on interactive input according to an embodiment.

도 3을 참조하면, 먼저 입력한 이미지(310)(또는 동영상의 어느 한 프레임)의 오브젝트에 대해 오브젝트 검출(312)을 수행하였다. 오브젝트 검출(312)은 이미지의 모든 오브젝트를 검출하기 위해 사용될 수 있고, 각각의 오브젝트의 바운딩 박스(오브젝트의 위치 정보 및 사이즈 정보 포함)를 제공한다.Referring to FIG. 3, first, object detection 312 was performed on an object of the input image 310 (or any one frame of a video). Object detection 312 can be used to detect all objects in an image, and provides a bounding box for each object (including object location and size information).

해당 사례에서, Faster R-CNN 네트워크를 이용하여 오브젝트 검출을 수행할 수 있다. 하지만, 다른 네트워크를 이용하여 이 기능을 실현할 수 있으며, 실시예는 Faster R-CNN 네트워크를 이용하는 것으로 제한되지 않는다. 이때, Faster R-CNN 네트워크는 종래 기술로서 이미지에서 오브젝트를 검출하고 클래스를 분류하는 네트워크를 나타낸다.In this case, object detection may be performed using a Faster R-CNN network. However, this function can be realized using other networks, and the embodiment is not limited to using the Faster R-CNN network. At this time, the Faster R-CNN network is a prior art and represents a network that detects objects in images and classifies them.

오브젝트가 검출된 이미지(320)에 포함된 검출한 각각의 오브젝트에 대해 베이직 네트워크를 사용하여 해당 오브젝트의 영역에 대해 시각 특징 추출(322)을 수행한다. 그 밖에, 근접한 다른 오브젝트 구역 및 전체 이미지에 대하여도 동일한 시각 특징 추출(322)을 수행한다. 즉 시각 특징 추출을 통해, 시각 특징이 추출된 이미지(330)는 단일 오브젝트(각각의 오브젝트)의 시각 특징 정보(331), 완전한 이미지의 시각 특징 정보(332) 및 이미지의 오브젝트 간 상대적인 위치 정보(333) 및/또는 이미지의 오브젝트 간 사이즈 특징 정보(334)를 각각 포함할 수 있다.The visual feature extraction 322 is performed on a region of the object by using a basic network for each detected object included in the image 320 in which the object is detected. In addition, the same visual feature extraction 322 is performed for other adjacent object regions and the entire image. That is, through the visual feature extraction, the image 330 from which the visual feature has been extracted includes visual feature information 331 of a single object (each object), visual feature information 332 of a complete image, and relative location information between objects of the image ( 333) and/or size feature information 334 between objects of the image, respectively.

이때, 완전한 이미지의 시각 특징 정보는 이미지의 전체적인 시각적 특징으로 입력한 이미지(310)를 예를 들자면, 공원, 낮 등이 완전한 이미지의 시각 특징 정보가 될 수 있다.In this case, the visual feature information of the complete image may be, for example, the image 310 input as the overall visual feature of the image, such as a park or daytime, as the visual feature information of the complete image.

그리고, 단일 오브젝트의 시각 특징 정보는 이미지에 포함된 각각의 오브젝트의 시각적 특징으로, 입력한 이미지(310)를 예를 들자면, 자전거, 사람, 스케이트 보드, 모자, 상자 등일 될 수 있고, 각 오브젝트의 색깔, 위치 등도 시각 특징 정보에 포함될 수 있다.And, the visual characteristic information of a single object is a visual characteristic of each object included in the image. For example, the input image 310 may be a bicycle, a person, a skateboard, a hat, a box, etc. Color and location may also be included in the visual feature information.

그리고, 시맨틱 특징 정보는 오브젝트와 근접한 다른 오브젝트 간의 의미 있는 관계를 나타내는 것으로, 입력한 이미지(310)를 예를 들자면, 자전거를 타는 사람에서 자전거와 사람의 오브젝트 간의 관계에 해당하는 "타는 상태"가 시맨틱 특징 정보가 될 수 있다. 즉, 시맨틱 특징 정보는 오브젝트와 오브젝트가 어떤 관계로 연결되어 있는지를 나타낸다.In addition, the semantic feature information indicates a meaningful relationship between an object and another object in close proximity. For example, in the input image 310, a “riding state” corresponding to a relationship between a bicycle and a person's object in a person riding a bicycle It can be semantic feature information. That is, the semantic feature information indicates how the object is connected to the object.

위의 실시예에서, Faster R-CNN을 사용한 제3 그룹 및 제4 그룹의 컨볼루션 레이어의 가장 마지막 레이어를 사용하여 오브젝트 영역 및 전체 이미지의 시각 특징 정보를 추출하였다. 이와 동일하게, 다른 네트워크의 기타 레이어를 사용하여 특징 추출을 수행할 수 있다(예를 들어, VGG-16、ResNet-101 등). 하지만, 위의 실시예로 한정되지는 않는다.In the above embodiment, visual feature information of the object region and the entire image is extracted using the last layers of the convolutional layers of the third and fourth groups using Faster R-CNN. In the same way, feature extraction can be performed using other layers of different networks (eg, VGG-16, ResNet-101, etc.). However, it is not limited to the above embodiment.

도 4는 일실시예에 따라 오브젝트의 시각 특징 정보를 획득하기 위한 예시적인 과정을 도시한 도면이다.4 is a diagram illustrating an exemplary process for obtaining visual feature information of an object according to an embodiment.

도 4를 참조하면, 앞에서 열거한 바와 같은 입력받은 이미지(410)는 오브젝트 검출(412)을 통해 오브젝트가 검출된 이미지(420)의 각 오브젝트의 바운딩 박스를 획득한다. 그 다음, 베이직 네트워크(예를 들어, VGG-16, ResNet-101 등)의 어떠한 특징층을 사용하여 현재 바운딩 박스(현재 오브젝트의 영역에 대응), 근접한 바운딩 박스(근접한 다른 오브젝트 영역에 대응) 및 전체 이미지를 사용하여 시각 특징 정보(431, 432, 433)를 추출(430)한다. 동시에, 현재 바운딩 박스와 근접한 바운딩 박스의 상대적 위치 정보와 상대적인 사이즈 특징 정보를 별도의 특징으로 취급하여 시각 특징 정보에 연결한다. 이미지의 각각의 오브젝트에 대해 상기 시각 특징 정보 추출을 순서대로 실행하여 이미지의 각각의 오브젝트(각각의 바운딩 프레임)에 대응하는 시각 특징 정보를 획득할 수 있다.Referring to FIG. 4, the input image 410 as listed above acquires the bounding box of each object of the image 420 where the object is detected through the object detection 412. The current bounding box (corresponding to the area of the current object), the adjacent bounding box (corresponding to other adjacent object areas), and then using any feature layer of the basic network (e.g., VGG-16, ResNet-101, etc.) and The visual feature information 431, 432, 433 is extracted 430 using the entire image. At the same time, the relative location information and the relative size feature information of the bounding box adjacent to the current bounding box are treated as separate features and connected to the visual feature information. It is possible to obtain visual feature information corresponding to each object (each bounding frame) of the image by sequentially executing the extraction of the visual feature information for each object in the image.

해당 사례에서, 현재 오브젝트 및 근접 오브젝트의 바운딩 박스(및/또는 단일 오브젝트의 바운딩 박스)의 왼쪽 상단 좌표의 정규화 값（x/W,y/H）과 너비와 높이의 정규화 값（w/W,h/H）, 면적의 정규화 값（w*h/W*H） 또한 시각 특징 정보의 일부로 취급할 수 있다. 구체적으로, 현재 오브젝트 및 근접 오브젝트의 바운딩 박스(및/또는 단일 오브젝트의 바운딩 박스)에 대해 바운딩 박스의 너비(w)와 높이(h) 역시 동일하게 전체 이미지의 너비(W)와 높이(H)를 나눈다. 다시 바운딩 박스의 면적에서 전체 이미지의 면적을 나누어, 5차원 특징 벡터를 획득한다.In this case, the normalized values (x/W,y/H) of the upper left coordinates of the bounding box (and/or the bounding box of a single object) of the current and proximity objects and the normalized values of width and height (w/W, h/H) and area normalization values (w*h/W*H) can also be treated as part of the visual feature information. Specifically, the width (w) and height (h) of the bounding box for the bounding box (and/or the bounding box of a single object) of the current object and the proximity object are equally equal to the width (W) and height (H) of the entire image. Divide. Again, the area of the entire image is divided by the area of the bounding box to obtain a 5D feature vector.

이러한 종류의 방법으로 구성된 오브젝트에 대한 위치 정보 및 사이즈 정보 및 오브젝트와 이와 근접한 다른 오브젝트의 상대적인 위치 및 상대적인 사이즈 정보는 현재 입력한 단어의 설명에서 좌변/우변/윗변/아랫변을 포함하거나, 또는, 가장 큰/가장 작은/가장 높은/가장 낮은 등 설명을 포함할 때, 언어가 설명하는 목표 오브젝트가 결정될 수 있다.The position information and size information for an object configured in this kind of method and the relative position and relative size information of the object and other objects adjacent thereto include left/right/upper/lower sides in the description of the currently entered word, or When including the largest/smallest/highest/lowest etc description, the target object described by the language can be determined.

추가적으로, 현재 오브젝트 주위에서부터, 이와 근접한 적어도 한 개의 오브젝트 선택하여 적어도 한 쌍의 오브젝트를 조성한다. 구체적으로, 검출한 오브젝트에 대해, 이에 근접한 오브젝트를 선택하여, 한 쌍의 오브젝트를 차례대로 조성한다. 여기서, '근접'의 함의는 앞에서 열거한 바와 같다. 한 쌍의 오브젝트를 조성할 때, 통상적으로, 오브젝트 간 위치 관계에 따라 가장 근접한 몇 개의 오브젝트를 선택하여 한 쌍의 오브젝트를 조성한다. 근접한 오브젝트 간에는 예를 들면 운송, 배치, 보는 것, 착용하고 있는 상태, 타는 상태, 기대는 상태 등과 같은 일련의 인터랙티브 관계가 존재한다. 그 밖에, 선택한 근접한 오브젝트는 기설정된 개수(예를 들어, 5개)를 초과하지 않을 수 있다.Additionally, at least a pair of objects are created by selecting at least one object close to it from around the current object. Specifically, with respect to the detected object, an object close to it is selected, and a pair of objects are sequentially formed. Here, the implications of'proximity' are as listed above. When composing a pair of objects, typically, a pair of objects are created by selecting some of the closest objects according to the positional relationship between the objects. There are a series of interactive relationships between adjacent objects, such as transport, placement, viewing, wearing, riding, and leaning. In addition, the selected adjacent objects may not exceed a preset number (eg, 5).

조성된 한 쌍의 오브젝트에 대해, 시각 관계를 통해 네트워크 VRN를 식별하여 해당하는 한 쌍의 오브젝트(즉 오브젝트간)의 시맨틱 특징 정보를 추출한다.With respect to the created pair of objects, the network VRN is identified through a visual relationship to extract semantic feature information of the corresponding pair of objects (that is, between objects).

도 5는 일실시예에 따라 오브젝트 사이의 시맨틱 특징 정보를 획득하기 위한 예시적 과정을 도시한 도면이다.5 is a diagram illustrating an exemplary process for obtaining semantic feature information between objects according to an embodiment.

도 5를 참조하면, 해당 사례에서, VGG-Net 컨볼루션 네트워크를 VRN의 베이직 네트워크(520)로 취하여 사용할 수 있다. 즉 VGG-Net(예: VGG-16)를 사용하여 입력한 이미지(510)에 대해 특징 추출을 수행하여 이미지의 공유 특징(530)을 획득한다. 이어서, 오브젝트 검출 결과에서 한 쌍의 오브젝트를 검출하고, 선택한 한 쌍의 오브젝트에 기초하여 후보 영역region proposal）을 생성한다. 즉, 각각의 후보 영역은 한 쌍의 오브젝트를 포함한다. 즉 (현재) 오브젝트 중 한 개 및 (현재) 오브젝트에 근접한 적어도 한 개 및 다른 오브젝트 중 한 개를 포함한다. 이로써, 오브젝트의 조합 관계를 획득한다. 그 다음, 3개의 병행의 브랜치(541, 542, 543)를 통해 3개의 다른 컴퓨터 시각 태스크를 처리한다. 구체적으로, 획득한 공유 특징에 기초하여 후보 영역의 두개의 오브젝트 바운딩 박스 및 해당 후보 영역에 대응하는 바운딩 박스를 3개의 브랜치(541, 542, 543)(branch)로 나누어 특징 추출을 수행한다.Referring to FIG. 5, in the corresponding case, the VGG-Net convolutional network may be used as the basic network 520 of the VRN. That is, feature extraction is performed on the image 510 input using VGG-Net (eg, VGG-16) to obtain a shared feature 530 of the image. Subsequently, a pair of objects is detected from the object detection result, and a candidate region region proposal is generated based on the selected pair of objects. That is, each candidate region includes a pair of objects. That is, it includes one of the (current) objects and at least one of the (current) objects and one of the other objects. Thereby, a combination relationship of objects is obtained. Then, three different computer visual tasks are processed through three parallel branches 541, 542, and 543. Specifically, on the basis of the obtained shared features, two object bounding boxes of the candidate region and a bounding box corresponding to the candidate region are divided into three branches 541, 542, and 543 (branch) to perform feature extraction.

(1) 범위가 후보 영역에 대응되는 바운딩 박스보다 더 큰 영역 바운딩 박스(기준 영역에 대응)을 선택하여, 해당 바운딩 박스에 대해 영역 특징 정보 추출을 수행한다(목적은 이후 영역 표제 생성을 수행)(1) Selecting an area bounding box (corresponding to a reference area) whose range is larger than the bounding box corresponding to the candidate area, and extracting area feature information for the bounding box (the purpose is to perform area heading generation later)

(2) 두 개의 오브젝트 각각의 바운딩 박스에 대해 각각 특징 정보를 분류하는 추출을 수행한다(목적은 이후 오브젝트 분류를 수행)(2) For each bounding box of each of the two objects, extraction is performed to classify feature information (objects are then classified as objects).

(3) 해당 후보 영역에 대응하는 바운딩 박스에 대해 오브젝트 영역 시맨틱 특징 정보 추출을 수행(목적은 이후 오브젝트 간 시맨틱 특징 식별을 수행, 구체적으로, 예를 들어 동작 관계 등과 같은 오브젝트 간 관계)(3) Object region semantic feature information extraction is performed on the bounding box corresponding to the candidate region (the purpose is to subsequently perform semantic feature identification between objects, specifically, relationships between objects such as, for example, motion relations)

도 5에 나타난 바와 같이, 분류 특징 정보, 영역 특징 정보 및 구역 특징 정보에 기반하여 각각의 오브젝트와 각각의 오브젝트에 근접한 기타 오브젝트 간 시맨틱 특징 정보를 생성하기 전, 분류 특징 정보, 영역 시맨틱 특징 정보 및 영역 특징 정보에 기초하여 동적 그래프를 구축하는 것과 동적 그래프에 따라 분류 특징 정보, 영역 시맨틱 특징 정보 및 영역 특징 정보에 따라 수정(조인트 변경을 수행이라 칭할 수 있음)을 수행하는 것을 더 포함한다. As shown in FIG. 5, before generating semantic feature information between each object and other objects proximate to each object based on the classification feature information, the region feature information, and the region feature information, the classification feature information, the region semantic feature information, and The method further includes constructing a dynamic graph based on the region feature information, and performing correction (which may be referred to as performing a joint change) according to the classification feature information, the region semantic feature information, and the region feature information according to the dynamic graph.

동적 그래프는 서로 다른 브랜치의 다른 감흥 영역(서로 다른 바운딩 박스라 이해할 수 있음)을 시맨틱 및 공간 관례를 통해 하나로 연결하고, 수정 과정에 따라 내용을 변경한다. 다른 브랜치 간 메시지 전달（passing message）을 통해 다른 브랜치 특징을 조인트 변경하여, 각 브랜치 간 특징이 서로 연관되게 하여 더 정확히 시맨틱 특징 정보를 획득한다.The dynamic graph connects different inspiration areas of different branches (which can be understood as different bounding boxes) into one through semantic and spatial conventions, and changes the content according to the modification process. By jointly changing different branch features through passing messages between different branches, the features between each branch are correlated with each other to obtain semantic feature information more accurately.

이때, 영역 특징 정보는 후보 영역을 포함한 기준 영역의 특징 정보에 대응하고, 영역 특징 정보를 이용하여 분류 특징 정보 및 영역 시맨틱 특징 정보에 대해 수정을 수행할 수 있고, 이는 뉴럴 네트워크 모델 출력 정확성을 고려할 수 있다. 수정을 완료한 후, 수정 후의 특징 분류를 사용하여 오브젝트 분류를 수행하고, 오브젝트 간의 시맨틱 특징 정보를 식별하고 영역 제목(570)을 생성한다. At this time, the region feature information corresponds to the feature information of the reference region including the candidate region, and can modify the classification feature information and the region semantic feature information using the region feature information, which takes into account the neural network model output accuracy. Can. After completing the modification, object classification is performed using the modified feature classification, semantic feature information between objects is identified, and an area title 570 is generated.

구체적으로, 베이직 네트워크(520)에 포함되는 장단기 기억 네트워크（Long short-term memory，LSTM）를 통해, 수정 후의 브랜치1(561)의 특징 정보에 기초하여 영역 제목(570)을 생성할 수 있다. 즉, 이미지의 영역에 대응하여, "한 사람이 모자를 쓰고 공원에서 연을 날린다"고 설명한다.Specifically, an area title 570 may be generated based on the feature information of the branch 1 561 after modification through a long short-term memory (LSTM) included in the basic network 520. In other words, in response to the area of the image, "one person wears a hat and flies a kite in the park."

이때, 이해하기 쉬운 이미지 영역은 영역 제목을 생성하지 않을 수도 있다. 수정 후의 브랜치2(562)와 수정 후의 브랜치3(563)의 특징에 따라 장면 그래프(scene graph）(580)를 생성한다. 해당 장면 그래프(580)는 한 개의 매트릭스로 이해할 수 있고, 해당 매트릭스의 크기는 N*N크기이고, N은 검출해낸 이미지의 오브젝트의 수량이다. 해당 매트릭스의 각각의 행과 각각의 열은 한 개의 오브젝트에 각각 대응하고, 매트릭스의 각각의 원소와 오브젝트 간의 시맨틱 특징 정보가 대응한다.At this time, the image area that is easy to understand may not generate the area title. A scene graph 580 is generated according to the characteristics of the modified branch 2 562 and the modified branch 3 563. The scene graph 580 can be understood as one matrix, and The size is N*N, and N is the number of objects in the detected image, each row and each column of the corresponding matrix respectively corresponds to one object, and semantic feature information between each element and object of the matrix is To respond.

도 5에 나타난 바와 같이, 장면 매트릭스의 각각의 항은 장면 그래프(580)의 제1행이 오브젝트 "사람"에 대응될 수 있는 것과 같이 한 개의 오브젝트에 대응한다. 장면 그래프(580)도 매트릭스의 각각의 열 역시 한 개의 오브젝트에 각각 대응된다. 예를 들어, 장면도의 제1열은 오브젝트의 "모자"에 대응될 수 있고, 이의 제2열은 오브젝트의 "연"에 대응될 수 있고, 제3열은 오브젝트의 "공원"에 대응될 수 있다.As shown in FIG. 5, each term in the scene matrix corresponds to one object, such that the first row of the scene graph 580 may correspond to the object "person". In the scene graph 580, each column of the matrix also corresponds to one object. For example, the first column of the scene diagram may correspond to the "hat" of the object, the second column of it may correspond to the "yeon" of the object, and the third column to the "park" of the object. Can.

장면 그래프(580)의 행과 열이 교차하는 위치(즉 매트릭스의 각 원소의 위치)는 오브젝트 간 시맨틱 특징 정보에 대응한다. 예를 들어, 제1행과 제1열이 교차하는 위치는 오브젝트 "사람"과 오브젝트 "모자" 간 시맨틱 특징 정보에 대응하고, 해당 원소는 "착용하다"로, 오브젝트 "사람" 및 오브젝트 "모자" 간 "사람이 모자를 착용하다"라는 시맨틱 특징 정보를 표현한다. 동일한 이유로, 제1행과 제2열이 교차하는 위치는 원소는 "날리다"로, 오브젝트 "사람"과 오브젝트 "연" 간의 "사람이 연을 날리다"라는 시맨틱 특징 정보를 표현한다. 제1행과 제3열이 교차하는 위치의 원소는 "안에 있다"로, 이는 오브젝트 "사람"과 오브젝트 "공원" 간의 "사람이 공원 안에 있다"라는 시맨틱 특징 정보를 표현한다. 즉, 생성된 장면 이미지는 이미지의 오브젝트 간 시맨틱 특징 정보를 명확히 표현할 수 있다.The position at which the row and column of the scene graph 580 intersect (that is, the position of each element in the matrix) corresponds to semantic feature information between objects. For example, the position where the first row and the first column intersect correspond to the semantic feature information between the object "person" and the object "hat", and the element is "wear", the object "person" and the object "hat" Expresses the semantic characteristic information of "living "a person wears a hat". For the same reason, the position at which the first row and the second column intersect is "fly", and expresses the semantic feature information of "person flies a kite" between object "person" and object "kite". The element at the position where the first row and the third column intersect is “inside”, which represents the semantic characteristic information of “the person is in the park” between the object “person” and the object “park”. That is, the generated scene image can clearly express semantic feature information between objects of the image.

그 밖에, VRN의 온라인 테스트 과정에서, 수정 후의 브랜치2(562)와 수정 후의 브랜치3(563)의 특징에 따라 장면 그래프(580)를 생성하기 전, 즉, VRN에서 장면 그래프(580)를 생성하기 전(즉, 분류 특징 정보, 영역 시맨틱 특징 정보 및 영역 특징 정보에 기초하여 이미지의 각각의 오브젝트와 적어도 한 개의 다른 오브젝트 간 시맨틱 특징 정보를 생성하기 전)의 마지막 한 개의 완전하게 연결된 계층의 출력 결과를 추출해 내 두 개의 오브젝트 간 시맨틱 특징 정보를 설명하기 위해 사용한다. 동일하게, 수요와 테스트 결과에 따라, 해당 네트워크의 서로 다른 레이어의 출력 결과를 오브젝트 시맨틱 특징 정보를 표현에 사용할 수 있다.In addition, in the online test process of the VRN, before the scene graph 580 is generated according to the characteristics of the modified branch 2 562 and the modified branch 3 563, that is, the VRN generates a scene graph 580. Output of the last one fully connected layer before (i.e., before generating semantic feature information between each object in the image and at least one other object based on the classification feature information, region semantic feature information, and region feature information) Extract the result and use it to describe the semantic feature information between two objects. Similarly, according to demand and test results, output results of different layers of a corresponding network may be used to express object semantic feature information.

다른 실시예에서, 기준 영역 제목을 이용하여 영역 제목을 획득하고, 영역 제목을 이용하여 모델에 대해 감독된 트레이닝을 수행하는VRN의 트레이닝 방법을 제공할 수 있다. 해당 트레이닝 방법에서, 먼저 샘플을 포함한 이미지의 트레이닝 데이터를 획득하고, 샘플 이미지의 각각의 오브젝트와 적어도 한 개의 다른 오브젝트에 기초하여 적어도 한 개의 후보 구역을 결정한다. 그 다음, 후보 영역에 따라 기준 영역을 결정하고, 기준 영역의 영역 특징 정보를 획득한다. 다시 영역 특징 정보에 따라 영역 제목을 생성한다. 뉴럴 네트워크 모델에 대해 트레이닝을 수행할 때, 브랜치1 및 브랜치3의 분류 특징 정보 및 영역 시맨틱 특징 정보에 대해 감독 트레이닝을 수행하는 것 외에, 영역 제목을 가진 것을 감독의 트레이닝 데이터로 하여, 뉴럴 네트워크 모델에 대해 트레이닝을 수행한다. 반전 분포 과정에서, 브랜치1 및 브랜치2의 네트워크 가중치 업데이트를 도와 더 좋은 오브젝트 분류 및 관계 식별 네트워크를 획득할 수 있다. 이렇게, 테스트 단계에서 더 좋은 분류 및 특징 정보를 추출할 수 있다.In another embodiment, it is possible to provide a VRN training method for acquiring an area title using a reference area title and performing supervised training on a model using the area title. In this training method, first, training data of an image including a sample is acquired, and at least one candidate region is determined based on each object and at least one other object of the sample image. Then, a reference area is determined according to the candidate area, and area characteristic information of the reference area is obtained. Again, an area title is generated according to the area feature information. When performing training on the neural network model, in addition to performing supervised training on the classification feature information of the branch 1 and branch 3 and the region semantic feature information, the training data of the director is used as the director's training data. Perform training on. In the inverse distribution process, it is possible to obtain better object classification and relationship identification networks by helping to update network weights of branches 1 and 2. In this way, better classification and feature information can be extracted in the test stage.

계속해서 도 3을 참고하여 설명하면, 도 3의 목표 오브젝트를 결정하는 방법은 단일 오브젝트의 시각 특징 정보(331), 완전한 이미지의 시각 특징 정보(332), 오브젝트 간의 상대적인 위치 정보(333) 및 오브젝트 간의 상대적인 사이즈 특징 정보(334) 및 오브젝트 간 시맨틱 특징 정보(335)의 한 개 또는 복수 개를 차례로 획득한 후, 획득한 정보를 특징 융합 처리하여 제1 특징 정보(350)를 획득한다.Continuing with reference to FIG. 3, a method of determining the target object of FIG. 3 includes visual characteristic information 331 of a single object, visual characteristic information 332 of a complete image, relative location information 333 between objects, and an object After obtaining one or a plurality of the relative size feature information 334 between the objects and the semantic feature information 335 between the objects in sequence, the obtained information is subjected to feature fusion processing to obtain the first feature information 350.

구체적인 실시예에서, 융합 처리는 상기 각 입력 정보를 연결（concatenate）한 후 완전하게 연결된 계층을 나타내는 완전 연결된 계층(FC; Fully connected layer）(340)을 거쳐 차원 감소(dimensionality reduction)를 수행할 수 있고, 먼저 완전 연결된 계층(FC)(340)을 거쳐 차원 감소 후 다시 연결하여 실현할 수 있다. 상기 정보의 부분 정보에 대해 먼저 처리를 수행한 후, 다시 다른 정보와 연결하여 차원 감소 처리를 수행할 수 있다. 하지만, 설계 요구 사항 및 실제 수요에 따라 서로 다른 융합 처리 수단을 사용할 수도 있다.In a specific embodiment, the fusion process may perform dimensionality reduction through a fully connected layer (FC) 340 representing a completely connected layer after concatenating each input information. It can be realized by first reducing the dimension through a fully connected layer (FC) 340 and then reconnecting it, and then processing the partial information of the information first, and then connecting with other information to perform the dimension reduction process. However, different convergence processing methods may be used depending on design requirements and actual demands.

여기서, 차원 감소란 많은 오브젝트를 근접한 오브젝트끼리 연결하여 의미 있는 시각 특징으로 추출함을 의미한다. 예를 들어, 자전거와 사람이라는 단일 오브젝트의 시각 특징 정보와, 타는 상태라는 시맨틱 특징 정보를 연결하여 자전거 타는 사람이라는 제1 특징 정보를 생성하는 경우, 자전거, 사람, 타는 상태가 하나로 연결되었으며, 이를 차원 감소라 칭할 수 있다.Here, the dimensional reduction means that many objects are connected to adjacent objects and extracted as meaningful visual features. For example, when generating the first characteristic information of the cyclist by connecting the visual characteristic information of the single object called bicycle and the human and the semantic characteristic information of the riding state, the bicycle, the person, and the riding state are connected as one. It can be called dimensional reduction.

다른 일측에 따르면, 입력한 언어로 처리한 것은 음성 식별 도구를 사용하여 입력한 언어를 문자로 변환한 후, LSTM을 사용하여 문장 전체에 대해 부호화(encode)를 수행하여, 해당 어구의 언어적 특징(language feature），즉 제2 특징 정보를 획득한다.According to the other side, what is processed in the input language is converted into a text by using a voice identification tool, and then encoded using the LSTM to encode the entire sentence, so that the linguistic characteristics of the phrase (language feature), that is, second feature information is acquired.

제1 특징 정보와 제2 특징 정보에 대해 매칭(시각 언어 매칭)을 수행하는 과정은, 획득한 제1 특징 정보와 제2 특징 정보를 각각 미리 설정된 특징 공간(즉 내제된 공간(embedding space))으로 맵핑하는 것을 포함한다. 제1 특징 정보에 대해 맵핑을 수행할 때, 각각의 오브젝트(즉 각각의 바운딩 박스)의 제1 특징 정보에 대해 FC를 통해 맵핑을 수행할 때, 새로운 특징 공간에서, 각각의 바운딩 박스에 각각 대응하는 제1 특징 정보를 획득한다. 여기서, 어느 한 바운딩 박스의 제1 특징 정보와 제2 특징 정보는 한 쌍의 특징을 포함하고, 유저 언어가 지정한 오브젝트(즉, 목표 오브젝트)에 대응하는 바운딩 박스의 제1 특징 정보 및 제2 특징 정보로 조성된 특징 집합은 서로 연관된 한 쌍의 언어-시각 특징이다. 서로 관련된 한 쌍의 언어-시작 특징은 서로 연관되지 않은 한 쌍의 언어-시각 특징과 비교하였을 때, 맵핑 공간에서 더 큰 유사도를 가진다. 이에, 유사도 크기에 기초하여 유사도가 가장 높은 한 개의 오브젝트(또는 유사도가 일정 범위내의 수치인 한 쌍의 오브젝트)를 유저의 음성에 따른 목표 오브젝트의 결과로 결정할 수 있다.In the process of matching (visual language matching) to the first feature information and the second feature information, the obtained first feature information and the second feature information are respectively set in a predetermined feature space (ie, an embedded space). This includes mapping. When performing mapping on the first characteristic information, when performing mapping through the FC on the first characteristic information of each object (ie, each bounding box), in the new feature space, each corresponding to each bounding box Acquires the first characteristic information. Here, the first characteristic information and the second characteristic information of one of the bounding boxes include a pair of characteristics, and the first characteristic information and the second characteristics of the bounding box corresponding to the object (ie, the target object) specified by the user language. The set of information features is a pair of language-visual features that are related to each other. A pair of language-initiating features related to each other has a greater similarity in the mapping space when compared to a pair of language-visual features not related to each other. Accordingly, based on the similarity size, one object having the highest similarity (or a pair of objects having similarities within a predetermined range) may be determined as a result of the target object according to the user's voice.

각각의 오브젝트(바운딩 박스)의 제1 특징 정보 및 입력 언어의 제2 특징 정보 간의 유사도를 통해 유사도가 가장 큰 한 개의 오브젝트(바운딩 박스, 즉, 점수가 가장 높은 바운딩 박스) 또는 유사도가 일정 범위 내에 있는 한 그룹의 오브젝트(복수 개의 바운딩 박스, 즉 점수가 비교적 높은 복수개의 바운딩 박스)를 최종 결과 출력으로 취할 수 있다.One object having the highest similarity (bounding box, that is, the bounding box with the highest score) or similarity within a certain range through the similarity between the first characteristic information of each object (bounding box) and the second characteristic information of the input language One group of objects (a plurality of bounding boxes, i.e., a plurality of bounding boxes with relatively high scores) may be taken as the final result output.

다른 실시예에서, 수요에 따라 유사도가 가장 높은 약간의 오브젝트를 선택할 수 있고, 이러한 오브젝트를 출력하여 유저가 선택하도록 제공할 수 있다. In another embodiment, some objects having the highest similarity may be selected according to demand, and these objects may be output and provided to the user for selection.

종래 기술에서, 이미지의 오브젝트의 시각 특징 추출에 기반하여, 어떠한 바운딩 박스 또는 전체의 이미지에 대해 시각 특징을 추출하고, 각각의 오브젝트의 위치 및/또는 사이즈 정보만을 추출할 수 있었다. 이 유형의 특징은 오브젝트 간 실제 위치 및/또는 사이즈만을 포함한다. 그러나 이미지 간의 관계 등 비교적 높은 수준의 시맨틱 정보（high-level semantic information）를 포함하지는 않았다. 이는 예를 들면 가장 높은 방, 왼쪽 두번째 접시 등을 처리할 수 있었다. 이미지에 두 사람이 있고, 한 사람은 한 개의 상자를 들고 있고, 한 사람은 상자 위에 앉아 있는 경우, 또는, 각각 자전거를 탄 사람/자전거에 치인 사람 또는 모자를 쓴 사람/모자를 들고 있는 경우에 오브젝트 간의 높은 수준의 시맨틱 정보를 이해할 수 없으면, 유저가 "박스 위에 앉아있는 사람 또는 상자를 밝고 있는 사람"을 입력하였을 때, 오브젝트 간의 높은 수준의 시맨틱 정보를 이해할 수 없으면, 시스템은 유저가 지칭하는 사람이 구체적으로 어느 사람인지 정확한 판단을 해낼 수 없다. In the prior art, based on the extraction of the visual features of the object of the image, it was possible to extract the visual features for any bounding box or the entire image, and only the location and/or size information of each object. Features of this type include only the actual location and/or size between objects. However, it did not include relatively high-level semantic information, such as relationships between images. It could handle the tallest room, the second dish on the left, for example. If there are two people in the image, one holding one box, one sitting on top of the box, or each riding a bike/bicycle hitter or wearing a hat/hat If the high-level semantic information between objects cannot be understood, when the user inputs "the person sitting on the box or the person lighting the box", if the high-level semantic information between objects is not understood, the system refers to the user. You cannot make accurate judgments about who a person is.

실시예에 따라, 비교적 높은 수준의 언어 정보와 상대적으로 대응하는 이미지 특징을 획득할 수 있다. 예를 들어, 타고 있는(riding), 들고 있는(holding), 마주보고 있는(facing), 차고 있는(kicking) 등 서로 다른 오브젝트 간의 동작 또는 관계이다. 실시예는 오브젝트 간 관계를 이해할 수 있다. 예를 들어, 사람-타다-차, 사람-부딪히다-차, 사람-내리다-차이다. 오브젝트 간 관계를 식별함으로써, 시각 정보와 언어 정보를 더 잘 매칭할 수 있다.Depending on the embodiment, it is possible to acquire image characteristics that correspond to relatively high level language information. For example, it is an action or relationship between different objects, such as riding, holding, facing, kicking, and the like. Embodiments can understand the relationship between objects. For example, people-ride-car, people-crash-car, people-disembark-car. By identifying relationships between objects, visual information and language information can be better matched.

실시예에서 오브젝트의 시각적 특징 정보와 오브젝트 간의 언어 특징 정보를 상호 융합하는 시스템을 개시하였고, 종래 기술이 위치, 사이즈 정보에 따라 유저가 대신 지칭하는 오브젝트를 구분할 수 없다는 문제를 해결하였고, 이는 인간-기계 인터랙티브 시스템의 인터랙티브 성능을 제고하는 데 유익하다.In an embodiment, a system for mutually fusing visual feature information of an object with language feature information between objects has been disclosed, and a problem in which the prior art cannot distinguish an object that a user refers to according to location and size information has been solved. It is beneficial to improve the interactive performance of mechanical interactive systems.

실시예에 따라, 두 개의 오브젝트 간 언어 특징 정보(예를 들어, 오브젝트 간 관계)를 사용하여 어떠한 오브젝트를 구분할 때, 더 정확하게 목표 오브젝트를 결정하는 결과를 가져올 수 있다.According to an embodiment, when classifying an object using language feature information (eg, a relationship between objects) between two objects, a result of determining a target object more accurately may be obtained.

기존의 인간-기계 인터랙티브 시스템은 다음과 같은 문제를 직면할 수도 있다. 인터랙티브 입력이 음성 입력일 경우, 서로 다른 사람(유저)의 동일한 오브젝트에 대한 표현은 완전히 일치할 수 없다. 예를 들어, 만약 이미지에 복수개의 접시가 있을 경우, 이 중 한 개의 접시 위에 파인애플이 있고, 시스템이 앞서 직면한 대부분의 유저는 모두 이를 파인애플이라 부른다. 그러나 한 명의 새로운 유저가 파인애플을 다르게 칭하는 다른 언어 습관을 가지고 있을 수 있다. 만약 그가 시스템에 "위에 파인애플이 있는 접시"를 묻는다면, 시스템은 "파인애플"이 지명하는 오브젝트가 무엇인지 이해할 수 없을 것이다.Existing human-machine interactive systems may face the following problems. When the interactive input is a voice input, expressions of the same object of different people (users) cannot be completely matched. For example, if there are multiple plates in the image, there is a pineapple on one of them, and most users faced by the system earlier call it pineapple. However, one new user may have different language habits to call pineapple differently. If he asks the system for "a plate with pineapple on top", the system will not be able to understand what the object named "pineapple" is.

사람-기계 인터랙티브 시스템의 실제 응용에서, 다른 사람들끼리는 다른 언어적 습관을 가지고 있으므로, 다른 단어가 출현할 확률이 서로 크게 차이가 난다. 이로써, LSTM 모델은 흔하지 않은 단어에 대해 비교적 좋은 특징 표현을 학습해 낼 수 없게 되어, 유저가 표현한 문장을 이해할 수 없게 된다.In the actual application of the human-machine interactive system, since different people have different linguistic habits, the probability of occurrence of different words differs greatly from each other. As a result, the LSTM model cannot learn relatively good feature expressions for uncommon words, making it impossible to understand sentences expressed by the user.

다음의 실시예는 위의 문제에 대해 해결 방안을 제기하였다.The following example proposes a solution to the above problem.

도 6은 일실시예의 따라 인터랙티브 입력에 기반하여 이미지에서 목표 오브젝트를 결정하는 방법의 다른 예를 도시한 도면이다.6 is a diagram illustrating another example of a method of determining a target object in an image based on interactive input according to an embodiment.

도 6을 참조하면, 인터랙티브 입력에 기반하여 이미지에서 목표 오브젝트를 결정하는 방법은 크게 제1 특징 정보를 획득하는 단계(610), 제2 특징 정보를 획득하는 단계(620) 및 제1 특징 정보와 제2 특징 정보를 매칭하는 단계(630)로 구분할 수 있다.Referring to FIG. 6, a method of determining a target object in an image based on the interactive input is largely obtained by obtaining 610 first feature information, 610 obtaining second feature information 620, and first feature information The second feature information may be divided into matching step 630.

도 6의 610단계를 보다 상세히 살펴보면, 목표 오브젝트를 결정하는 방법은 611단계에서 이미지(또는 동영상의 어느 한 프레임)를 입력받고, 612단계에서 입력받은 이미지의 오브젝트에 대해 오브젝트를 검출하고, 613단계에서 오브젝트가 검출된 이미지에 포함된 오브젝트의 바운딩 박스에 대해서 시각 특징 추출을 수행하여, 614단계에서 단일 오브젝트(각각의 오브젝트)의 시각 특징 정보, 완전한 이미지의 시각 특징 정보 및 이미지의 오브젝트 간 상대적인 위치 정보 및/또는 이미지의 오브젝트 간 사이즈 특징 정보(334)를 각각을 획득하고, 615단계에서 오브젝트 간의 상대적인 사이즈 특징 정보를 획득한다.Looking at step 610 of FIG. 6 in more detail, the method of determining the target object receives an image (or a frame of a video) in step 611, detects an object for the object of the image received in step 612, and step 613 In step 614, visual feature extraction is performed on the bounding box of the object included in the image in which the object is detected, and in step 614, visual feature information of a single object (each object), visual feature information of a complete image, and relative position between objects in the image The size feature information 334 between objects of the information and/or image is respectively obtained, and in step 615, relative size feature information between objects is obtained.

그리고, 목표 오브젝트를 결정하는 방법은 616단계에서 단일 오브젝트의 시각 특징 정보, 완전한 이미지의 시각 특징 정보, 오브젝트 간의 상대적인 위치 정보 및 오브젝트 간의 상대적인 사이즈 특징 정보 및 오브젝트 간 시맨틱 특징 정보의 한 개 또는 복수 개를 차례로 획득한 후, 획득한 정보를 특징 융합 처리하여 제1 특징 정보를 획득한다. 이때, 제1 특징 정보를 획득하는 보다 구체적인 설명은 도 3에서 도 5를 통해서 상술한 바와 같습니다.The method of determining the target object may include one or more of visual characteristic information of a single object, visual feature information of a complete image, relative location information between objects, relative size feature information between objects, and semantic feature information between objects in step 616. After obtaining, in turn, the acquired information is subjected to feature fusion processing to obtain first feature information. At this time, a more detailed description of obtaining the first characteristic information is as described above with reference to FIGS. 3 to 5.

도 6의 620단계를 보다 상세히 살펴보면, 목표 오브젝트를 결정하는 방법은 621단계에서 입력한 음성 정보를 문자로 변환하고, 622단계에서 입력어구의 각각의 단어를 단어 벡터로 변환하고, 623단계에서 입력한 단어가 사용빈도가 기설정된 값보다 낮은 단어인지 판단하고, 판단 결과가 사용빈도가 낮은 단어일 경우, 의미가 근접한 사용빈도가 높은 단어로 대체하고, 624단계에서 입력한 음성 정보 전체의 각 단어에 대해서 사용빈도가 높은 단어 벡터를 획득하고, 625단계에서 입력된 전체 문장에 대해서 처리하여, 제2 특징 정보(즉 언어 특징)을 획득한다.Looking at step 620 of FIG. 6 in more detail, the method of determining the target object converts the voice information input in step 621 into characters, converts each word of the input phrase into a word vector in step 622, and inputs in step 623 It is determined whether a word is a word whose usage frequency is lower than a preset value, and if the result of the determination is a word having a low usage frequency, the word having the closest usage frequency is replaced with each word of the entire voice information input in step 624. A word vector having a high frequency of use is obtained, and the second sentence information (that is, language feature) is obtained by processing the entire sentence input in step 625.

이때, 623단계에서 사용빈도가 낮은 단어를 의미가 근접한 사용빈도가 높은 단어로 바꾸는 것은 사용빈도가 낮은 단어의 단어 벡터의 코사인 유사성이 기설정된 값 이상으로 유사한 사용빈도가 높은 단어의 단어 벡터를 검색하고, 사용빈도가 낮은 단어의 단어 벡터를 검색된 사용빈도가 높은 단어의 단어 벡터로 대체함을 나타낼 수 있다.At this time, in step 623, replacing the word with a low usage frequency with a word with a high usage frequency having a close meaning searches for a word vector of a word having a similar usage frequency that is higher than a preset value of a cosine similarity of a word vector of a word having a low usage frequency. And, it may represent that the word vector of the word with low usage frequency is replaced with the word vector of the word with high usage frequency.

도 7은 단어의 사용 빈도수를 나타낸 분포도를 도시한 도면이다.7 is a diagram showing a distribution diagram showing the frequency of use of words.

도 7을 참조하면, 사람들이 오브젝트를 설명할 때의 단어 사용은 매우 두드러지는 롱 테일 효과(long-tail effect)를 지닌다. RefCOCO+ 데이터 집합을 예로, 해당 데이터 집합은 총 2627개의 다른 단어(수평축으로 단어를 표시, 그러나 구체적으로 각각의 단어의 좌표점을 나타내는 것을 생략하였다), 여기서, 빈도수가 가장 높은 10개의 단어가 나타났고, 평균적으로 13000회 나타났다(원점 부근에 집중적으로 분포됨, 이는 도 7의 점선 프레임과 같다), 그러나, 절반이 넘는 단어(1381개)의 출현 횟수는 20회 미만이다. 아래의 실시예에서, 사용 빈도수가 제1 설정값보다 낮은 단어를 제1 단어로 정의하고, 사용 빈도수가 제2 설정값보다 높은 단어를 제2 단어로 정의한다.Referring to FIG. 7, the use of words when people describe objects has a very prominent long-tail effect. Taking the RefCOCO+ data set as an example, the data set has a total of 2627 different words (displaying words on the horizontal axis, but not specifically showing the coordinate points of each word), where 10 words with the highest frequency appear. , On average, appeared 13000 times (intensively distributed around the origin, which is the same as the dotted frame in FIG. 7 ), but more than half of the words (1381) appear less than 20 times. In the following embodiment, a word whose frequency of use is lower than the first setting value is defined as a first word, and a word whose frequency of use is higher than the second setting value is defined as a second word.

제1 단어의 사용 빈도수가 비교적 낮기 때문에, 샘플의 수량이 매우 불균형해 진다. 이에, 모델에 대해 트레이닝을 수행할 때, 출현 횟수가 매우 적은 샘플에 대해, LSTM 모듈은 해당 단어의 비교적 좋은 특징 표현을 학습할 수 없다. 이에, 이 문장의 제2 특징 정보(즉 시맨틱 특징)의 추출에 영향을 준다. 예를 들어, 이미지에 독목선(하나의 통나무를 파서 만든 작은 배)이 있는 장면에 대해, 만약 유저가 시스템에 "중간의 독목선"이라고 묻는다면, "독목선" 단어에 대해, 트레이닝 집합 출현 횟수가 비교적 작기 때문에, 모델은 이의 시맨틱 특징과 대은 이미지 영역의 시각 특징을 좋게 대응시킬 수 없다. 이에, 유저가 지칭하는 오브젝트를 이해할 수 없다. 만약 "독목선"을 출현 횟수가 비교적 많고 의미가 근접한 단어(예를 들어, 작은 배)로 대체한다면, 시스템은 유저가 지칭하는 오브젝트를 "이해"하여 정확한 결과를 출력할 수 있다. <표 1>에 일련의 단어 대체 사례를 제시하였다.Since the frequency of use of the first word is relatively low, the quantity of samples is very unbalanced. Thus, when training the model, for a sample with very few occurrences, the LSTM module cannot learn a relatively good feature expression of the word. This affects the extraction of the second feature information (ie semantic feature) of this sentence. For example, for a scene with a poison line (a small boat made by digging a single log) in the image, if the user asks the system to say "middle dock line", for the word "road line", a training set appears Because the number of times is relatively small, the model cannot correlate its semantic features with visual features of the target image area. Accordingly, the object that the user refers to cannot be understood. If the "poisonous line" is replaced with a word having a relatively high number of appearances and a close meaning (for example, a small boat), the system can "understand" the object the user refers to and output the correct result. Table 1 presents a series of word substitution examples.

기존단어Existing word refrigeratorrefrigerator totallytotally PuppyPuppy gringrin blackestblackest loafloaf walletwallet 매핑단어Mapping word fridgefridge completelycompletely dogdog smilesmile darkestdarkest breadbread pursepurse

해당 실시예에서, 제2 특징 정보(즉 언어 특징)의 추출은 언어 식별부(또는 식별부), 단어 벡터（word vector）부, 단어 판단부, 단어 대체부 및 특징 추출부와 같은 기능 모듈을 통해 실현할 수 있다.In this embodiment, the extraction of the second feature information (that is, the language feature) includes function modules such as a language identification unit (or identification unit), a word vector unit, a word determination unit, a word replacement unit, and a feature extraction unit. Can be realized through

먼저 음성 식별부를 사용하여 입력한 음성 정보를 문자로 변환한다. 그 다음, 단어 벡터부를 사용하여 입력어구의 각각의 단어를 단어 벡터로 변환한다. 뒤이어, 단어 판단부를 사용하여 입력한 단어가 제1 단어인지 아닌지 판단하고, 판단 결과가 제1 단어일 경우, 단어 대체부는 이와 의미가 근접한 제2 단어를 선택하여 해당 제1 단어를 대체할 것이다. 특징 추출부는 한 개의 LSTM 언어 코딩부이고, 이는 단어를 하나 하나 차례로 입력하여 전체 문장에 대한 코딩을 완성하여, 제2 특징 정보(즉 언어 특징)을 획득한다.First, the input voice information is converted into text using a voice identification unit. Then, each word of the input phrase is converted into a word vector using the word vector unit. Subsequently, it is determined whether or not the input word is the first word using the word determination unit, and when the determination result is the first word, the word replacement unit will replace the corresponding first word by selecting the second word having a similar meaning. The feature extraction unit is one LSTM language coding unit, which completes coding for the entire sentence by inputting words one by one, thereby obtaining second feature information (ie, language features).

주의해야 할 점은, 단어 벡터부를 이용하여 입력 어구의 각각의 단어를 단어 벡터로 변환한 후, 시스템은 입력 어구의 각각의 단어와 단어 벡터를 공동적으로 저장하고, 제1 단어에 속하는지 속하지 않는지 판단할 때, 단어 판단부가 사용하는 것 역시 저장한 단어이지, 단어 벡터가 아니다.Note that after converting each word of the input phrase into a word vector using the word vector unit, the system jointly stores each word and word vector of the input phrase and does not belong to the first word. When judging whether or not, the words used by the word judgment unit are also stored words, not word vectors.

실시예에서 제기한 단어 벡터의 의미와 근접한 단어의 대체 방법에 기초하여, 비교적 좋은 특징을 추출할 수 없는 저 빈도 샘플(빈도가 비교적 낮은 제1 단어를 사용)을 비교적 좋은 특징을 추출할 수 있는 고 빈도 샘플(빈도가 비교적 높은 제2 단어를 사용)로 대체할 수 있고, 대체하여도 기본적으로 원래 문장의 의미는 바뀌지 않는다. 예를 들어, 유저가 설명한 단어가 "black shirt with pinkish emblem”일 때, 단어 대체를 거친 후, 이를“black shirt with reddish logo”로 변환할 수 있고, 이는 원래 입력한 의미와 기본적으로 같다.Based on the meaning of the word vector proposed in the embodiment and a method of substituting words close to each other, a relatively low frequency sample (using the first word with a relatively low frequency) that cannot extract relatively good features can extract relatively good features. It can be replaced with a high-frequency sample (using a second word with a relatively high frequency), and basically the meaning of the original sentence does not change even if replaced. For example, when the word described by the user is "black shirt with pinkish emblem", after the word substitution, it can be converted into a "black shirt with reddish logo", which is basically the same as the original input meaning.

이때, 상술한 각 실시예의 인터랙티브 입력은 자연 언어 입력에 한정되지 않고, 인터랙티브 입력은 유저가 문자 형식으로 직접 입력한 텍스트 일 수 있다.In this case, the interactive input of each of the above-described embodiments is not limited to natural language input, and the interactive input may be text directly input by a user in a text format.

따라서, 도 6에 나타난 바와 같이, 제1 특징 정보를 추출할 때, 완전한 정보 및/또는 같은 사이즈 특징 정보 및 오브젝트 간의 언어 특징 정보를 각각 획득한다. 이에, 해당 실시예에 따라서, 위에서 언급한 오브젝트 관계를 포함하는 높은 수준의 시맨틱 특징이 없는 문제 및 사용 빈도수가 비교적 낮은 흔히 볼 수 없는 단어 또는 트레이닝 집합에서 나타난 적 없는 단어를 사용하여 비교적 좋은 언어 특징을 추출할 수 없다는 문제를 해결할 수 있다.Accordingly, as shown in FIG. 6, when extracting the first feature information, complete information and/or the same size feature information and language feature information between objects are respectively obtained. Accordingly, according to the embodiment, a problem with no high-level semantic features including the above-mentioned object relationships and a relatively good language feature using words that are not frequently seen in a training set or words that have never appeared in a training set are relatively low. It can solve the problem that can not be extracted.

실시예에서 제기한 VRN을 이미지의 오브젝트 한 쌍에 응용하고, 두 개의 오브젝트 간의 시맨틱 특징 정보(오브젝트 간의 관계)를 식별하여 더 좋은 특징을 추출할 수 있고, 사용 빈도수가 비교적 낮은 단어를 의미가 근접한 사용 빈도율이 비교적 높은 단어로 대체하여, 더 좋은 언어 특징을 획득하여, 사람-기계 인터랙티브 시스템이 더 정확하고 더 빠르게 이미지에서 유저 언어로 설명한 오브젝트를 찾을 수 있게 한다.The VRN proposed in the embodiment can be applied to a pair of objects in an image, semantic feature information (relationship between objects) between two objects can be extracted, and better features can be extracted. By replacing words with relatively high frequency of use, better language features are obtained, allowing the human-machine interactive system to find objects described in the user language in the image more accurately and faster.

그 밖에, 실시예에서, 사용 빈도수가 비교적 높은 제2 단어를 사용하여 사용 빈도수가 비교적 낮은 제1 단어를 대체하므로 인간-기계 인터랙티브 방안을 실현할 수 있고, 단독으로 사용할 수도 있다.In addition, in the embodiment, a second word having a relatively high frequency of use is used to replace the first word having a relatively low frequency of use, so that a human-machine interactive scheme can be realized or used alone.

구체적인 실시예에서, 시각적으로, 입력 이미지에 대해 오브젝트 검출을 수행하여, 각 오브젝트 바운딩 박스를 획득한다. 베이직 네트워크（VGG-16，ResNet-101 등）를 사용하여 어떠한 특정 층의 현재 오브젝트의 바운딩 박스와 전체 이미지에 대한 시각 특징 정보를 추출하고, 현재 오브젝트의 바운딩 박스의 위치 및 사이즈 정보, 현재 오브젝트의 바운딩 박스 및 근접 오브젝트의 바운딩 박스의 상대적인 위치 및 상대적인 사이즈 정보를 별도의 특징으로 취급하여 시각 특징 정보에 연결하여, 각각의 바운딩 박스에 대응하는 시각 특징 정보를 획득한다. 이때, VGG-16와 ResNet-101는 종래 기술로 대규모 이미지 인식을 위한 네트워크를 나타낸다.In a specific embodiment, visually, object detection is performed on an input image to obtain each object bounding box. Using the basic network (VGG-16, ResNet-101, etc.), we extract the visual feature information of the current object's bounding box and the entire image of any specific layer, and the location and size information of the current object's bounding box and the current object's The relative position and relative size information of the bounding box of the bounding box and the proximity object are treated as separate features and connected to the visual feature information to obtain visual feature information corresponding to each bounding box. At this time, VGG-16 and ResNet-101 represent a network for large-scale image recognition in the prior art.

그리고, 오브젝트 간 시맨틱 특징 정보에 대해 추출을 수행하지 않고, 인터랙티브 입력의 제1 단어가 존재하면 제2 단어로 대체하는 처리를 수행할 수 있다. 이 경우, 간단한 사람-기계 인터랙티브 장면에 대해, 인터랙티브 입력은 기본적으로 높은 수준의 시맨틱 이해 장면에 대한 것이 아니고, 시스템에 대해 이러한 제거를 수행하여 시스템 원가를 낮출 수 있고, 시스템 처리 속도 및 처리 정확도가 균형 잡히게 할 수 있다. 더 상세한 동작은 상술한 각 실시예를 참고하여 획득할 수 있다. 여기서, 이에 대해 더 이상 상세히 설명하지 않는다.In addition, without performing extraction on the semantic feature information between objects, if the first word of the interactive input exists, a process of replacing it with the second word may be performed. In this case, for a simple human-machine interactive scene, the interactive input is not primarily for a high-level semantic understanding scene, and this removal can be performed on the system to lower the system cost, and the system processing speed and processing accuracy It can be balanced. More detailed operations can be obtained by referring to the above-described respective embodiments. Here, this will not be described in detail anymore.

각 실시예는 각종 다른 사람-기계 인터랙티브 장면에 응용될 수 있고, 사람들이 언어를 사용하여 설명한 장면에서의 오브젝트 정보가 사람-기계 인터랙티브 시스템에 대한 굉장히 중요한 작용을 한다는 점을 이해할 수 있다. 유저는 손을 사용하지 않고도 장면에서의 어떠한 오브젝트를 선택할 수 있다. 만약 이미지에 해당 오브젝트 유형과 같은 다른 사례가 존재할 시, 이미지 분류 방법에 기초하여 유저가 설명한 오브젝트를 정확히 결정할 수 없고, 또는 별도의 확정/선택 단계가 필요해짐으로, 사람-기계 인터랙티브 성능이 낮아진다. 실시예를 사용하여, 이 유형의 문제를 해결할 수 있고, 빠르고 정확히 이 과정을 완성할 수 있다. 유사도에 기초하여, 목표 오브젝트를 결정할 때, 수요에 따라 특정한 어떤 유형의 대상을 검출할 수 있다.Each embodiment can be applied to various other man-machine interactive scenes, and it is understood that object information in a scene described by people using a language plays a very important role for a man-machine interactive system. Users can select any object in the scene without using their hands. If another instance such as a corresponding object type exists in the image, the user cannot accurately determine the object described based on the image classification method, or a separate confirmation/selection step is required, resulting in lower man-machine interactive performance. Using an embodiment, this type of problem can be solved and this process can be completed quickly and accurately. Based on the similarity, when determining the target object, it is possible to detect certain types of objects according to demand.

실시예에서 제공하는 이미지에서 목표 오브젝트를 결정하는 방법은 유저가 인터랙티브 입력(예 음성 입력 또는 문자 입력)으로 지정한 사례에 광범위하게 응용될 수 있고, 이는 가격, 평가, 번역, 백과, 네비게이션 등을 포함하나 이에 제한되지 않는다. 이는 헤드업 디스플레이 시스템（AR HUD）, 증강 현실 안경（AR glasses）, 스마트 가구 등 시스템에 광범위하게 사용될 수 있다.The method of determining the target object from the image provided in the embodiment can be widely applied to cases designated by the user as interactive input (eg, voice input or text input), including price, evaluation, translation, encyclopedia, navigation, etc. However, it is not limited to this. It can be widely used in systems such as head-up display systems (AR HUD), augmented reality glasses (AR glasses), and smart furniture.

예를 들어, 유저가 시스템에 대량의 광고판이 존재하는 이미지에서 XX 표시 우측의 광고판 위의 문자를 영어로 번역해 달라는 요청을 할 수 있다. 비록 해당 장면에 대량의 광고판이 있으나, 이미지의 제1 특징 정보를 이용하여, 유저가 지칭하는 구체적인 목표 오브젝트를 결정할 수 있고, 목표 오브젝트의 결정을 끝낸 후, 문자 식별 기술 및 기계 번역 기술을 이용해 유저가 제기한 광고판의 번역을 수행할 수 있다.For example, a user may make a request to translate the text on the billboard to the right of the XX display in the image where a large number of billboards are present in the system. Although there is a large number of billboards in the scene, the user can determine a specific target object referred to by the user using the first characteristic information of the image, and after the determination of the target object is completed, the user may use text recognition technology and machine translation technology. You can perform the translation of billboards filed by.

다시 예를 들면, 유저가 시스템에 이미지에서 머리가 가장 짧은 사람이 누군지 물어본다. 이미지의 제1 특징 정보를 이용하여, 시스템은 유저가 지칭하는 구체적인 목표 오브젝트를 결정할 수 있고, 이로써, 사람 얼굴 식별 시스템을 이용하여 결정된 목표 오브젝트의 얼굴을 식별하여 유저의 문제를 대답할 수 있다.Again, for example, the user asks the system who is the shortest person in the image. Using the first feature information of the image, the system can determine a specific target object referred to by the user, thereby identifying the face of the determined target object using the human face identification system to answer the user's problem.

다시 예를 들면, 유저가 시스템에 우측에서 점프한 사람이 입은 신발이 얼마인지 묻는다. 시스템은 이미지의 제1 특징 정보(오브젝트 간 시맨틱 특징, 예를 들어, 점프 등을 포함)를 이용하여 유저가 지칭하는 구체적인 목표 오브젝트를 결정하고, 목표 오브젝트의 신발 이미지로 이미지를 검색하는 등 기술을 다시 결합하여 해당 상품의 가격 정보를 획득할 수 있다.Again, for example, the user asks the system how many shoes the person who jumped from the right side wore. The system uses the first feature information of the image (including semantic features between objects, for example, jumps, etc.) to determine a specific target object that the user refers to, and retrieves an image using a shoe image of the target object. It can be combined again to obtain price information of the corresponding product.

도 8은 일실시예에 따른 인터랙티브 입력에 기반하여 이미지에서 목표 오브젝트를 결정하는 방법을 응용한 예를 도시한 도면이다.8 is a diagram illustrating an example of applying a method of determining a target object in an image based on interactive input according to an embodiment.

도 8을 참조하면, 청바지를 입고 있는 남자, 책상 위의 하얀색 노트북 컴퓨터, 검정색 셔츠를 입고 있는 여자 등 정보를 식별해 낼 수 있다. 이는 시스템이 식별할 수 있는 입력을 현저하게 개선하였다.Referring to FIG. 8, information such as a man wearing jeans, a white laptop computer on a desk, and a woman wearing a black shirt can be identified. This has significantly improved the input the system can identify.

성능 지표 상으로, 트레이닝 완료된 목표 오브젝트를 결정하는 시스템을 RefCOCO+ 공개 데이터 집합에서 운행하였고, 테스트를 수행하였다. 해당 데이터 집합은 17000장이 넘는 이미지를 포함하고, 42000개의 대신 지칭되는 오브젝트, 12만 개의 오브젝트를 설명하는 어구를 포함한다. 해당 데이터 집합의 검증 집합, 테스트 집합A, 테스트 집합B에 대해 성능 테스트를 수행하였고, 알고리즘 비교를 통해, 기존 방법과 비교하였을 때, 다른 테스트 집합에서 실시예는 비교적 두드러지게 업그레이드되었다. 테스트 집합A에서는 1.5% 이상의 성능이 업그레이드되었다. 이때, RefCOCO+는 종래 기술로 수집된 이미지를 트레이닝 하여 생성된 데이터 집합을 포함한다.On the performance index, a system for determining the target object with training was run in the RefCOCO+ public data set, and tests were performed. The data set contains over 17,000 images, contains 42,000 objects referred to instead, and phrases describing 120,000 objects. Performance tests were performed on the verification set, test set A, and test set B of the corresponding data set, and when compared with the existing method through algorithm comparison, the embodiments in the other test sets were upgraded relatively prominently. In test set A, the performance was over 1.5%. At this time, RefCOCO+ includes a data set generated by training images collected in the prior art.

다른 실시예에서, 인터랙티브 입력에 기반하여 이미지에서 목표 오브젝트를 결정하는 장치를 더 제공하였다. In another embodiment, an apparatus for determining a target object in an image based on interactive input is further provided.

도 9는 일실시예에 따른 인터랙티브 입력에 기반하여 이미지에서 목표 오브젝트를 결정하는 장치의 구성을 도시한 도면이다.9 is a diagram illustrating a configuration of an apparatus for determining a target object in an image based on interactive input according to an embodiment.

도 9를 참조하면, 인터랙티브 입력에 기반하여 이미지에서 목표 오브젝트를 결정하는 장치(900)는 특징 획득부(910)와 목표 결정부(920)를 포함한다. 특징 획득부(910)는 이미지에 대응하는 제1 특징 정보 및 인터랙티브 입력에 대응하는 제2 특징 정보를 획득한다. 목표 결정부(920)는 제1 특징 정보 및 제2 특징 정보에 따라 이미지의 오브젝트에서, 인터랙티브 입력에 대해 목표 오브젝트를 결정한다. 특징 획득부(910)와 목표 결정부(920)에 대한 더 상세한 동작은, 상술한 실시예를 참고하여 획득할 수 있다. 이에 대해, 더 이상 상세히 설명하지 않는다.Referring to FIG. 9, an apparatus 900 for determining a target object from an image based on an interactive input includes a feature acquisition unit 910 and a target determination unit 920. The feature acquisition unit 910 acquires first feature information corresponding to the image and second feature information corresponding to the interactive input. The target determination unit 920 determines the target object for the interactive input from the object of the image according to the first characteristic information and the second characteristic information. More detailed operations of the feature acquisition unit 910 and the target determination unit 920 may be obtained by referring to the above-described embodiment. This will not be described in detail anymore.

도 10은 일실시예에 따른 인터랙티브 입력에 기반하여 이미지에서 목표 오브젝트를 결정하는 장치를 예시적으로 도시한 도면이다.10 is a diagram exemplarily illustrating an apparatus for determining a target object from an image based on interactive input according to an embodiment.

도 10을 참조하면, 실시예에 따른 장치(1000)는 예를 들어, 디지털 신호 처리 장치（DSP）와 같은 프로세서(1010)를 포함한다. 프로세서(1010)는 실시예에 따른 다른 동작을 실행하기 위한 한 개의 장치 또는 복수 개의 장치일 수 있다. 장치(1000)는 입력/출력(I/O）장치(1030)를 더 포함할 수도 있고, 다른 장치(또는 사람)로부터 신호를 접수하기 위해 사용되거나 다른 장치(또는 사람)에 신호를 보내기 위해 사용된다.Referring to FIG. 10, the apparatus 1000 according to the embodiment includes a processor 1010 such as, for example, a digital signal processing apparatus (DSP). The processor 1010 may be a single device or a plurality of devices for performing other operations according to an embodiment. The device 1000 may further include an input/output (I/O) device 1030 and may be used to receive a signal from another device (or person) or to send a signal to another device (or person). do.

그 밖에, 장치(1000)는 메모리(1020)를 포함하고, 해당 메모리(1020)는 비휘발성 또는 휘발성 메모리와 같은 형식을 구비할 수 있다. 전자 EEPROM(electrically erasable and programmable read only memory), 플래시 기억 장치 등이 그 예이다. 메모리(1020)는 컴퓨터 판독 가능 명령을 저장하고, 프로세서(1010)로 해당 컴퓨터 판독 가능 명령을 수행할 때, 해당 컴퓨터 판독 가능 명령은 프로세서가 실시예에 따른 발명을 실행하게 한다.In addition, the device 1000 includes a memory 1020, and the memory 1020 may have a format such as nonvolatile or volatile memory. Examples include electronically erasable and programmable read only memory (EEPROM) and flash memory devices. The memory 1020 stores computer readable instructions, and when the processor 1010 executes the computer readable instructions, the computer readable instructions cause the processor to implement the invention according to the embodiment.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, or the like alone or in combination. The program instructions recorded in the medium may be specially designed and configured for the embodiments or may be known and usable by those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs, DVDs, and magnetic media such as floptical disks. -Hardware devices specifically configured to store and execute program instructions such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, etc., as well as machine language codes produced by a compiler. The hardware device described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다.　 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with limited drawings, those skilled in the art may apply various technical modifications and variations based on the above. For example, the described techniques are performed in a different order than the described method, and/or the components of the described system, structure, device, circuit, etc. are combined or combined in a different form from the described method, or other components Alternatively, even if replaced or substituted by equivalents, appropriate results can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

Obtaining first feature information corresponding to the image and second feature information corresponding to an interactive input; And
Determining a target object corresponding to the interactive input among objects included in the image according to the first characteristic information and the second characteristic information
A method of determining a target object in an image based on an interactive input including a.

According to claim 1,
The obtaining of the first characteristic information corresponding to the image and the second characteristic information corresponding to the interactive input may include:
When acquiring the first characteristic information corresponding to the image,
Obtaining semantic feature information between each object included in the image and at least one other object included in the image
A method of determining a target object in an image based on an interactive input including a.

According to claim 2,
The obtaining of the semantic feature information between each object included in the image and at least one other object included in the image may include:
Obtaining the semantic feature information between each object included in the image and at least one other object based on the location information of each object included in the image
A method of determining a target object in an image based on an interactive input including a.

According to claim 2,
The obtaining of the semantic feature information between each object included in the image and at least one other object included in the image may include:
Determining at least one candidate region based on each object included in the image and at least one other object;
Obtaining classification feature information of an object in the candidate region;
Obtaining region semantic feature information between objects in the candidate region; And
Generating the semantic feature information between each object included in the image and at least one other object based on the classification feature information and the area semantic feature information.
A method of determining a target object in an image based on an interactive input including a.

According to claim 4,
Before generating the semantic feature information between each object included in the image and at least one other object,
Performing joint correction on the classification feature information and the region semantic feature information based on the classification feature information and the region semantic feature information.
A method of determining a target object in an image based on the interactive input further comprising a.

According to claim 4,
Before generating the semantic feature information between each object included in the image and at least one other object,
Determining a reference area according to the candidate area;
Obtaining region feature information of the reference region; And
Performing a joint change on the classification feature information, the region semantic feature information, and the region feature information based on the classification feature information, the region semantic feature information, and the region feature information
A method of determining a target object in an image based on the interactive input further comprising a.

According to claim 4,
The candidate region,
One of the objects included in the image and one of the at least one other object included in the image
How to determine a target object from an image based on interactive input.

According to claim 1,
The first feature information,
Global visual feature information corresponding to the image,
Visual feature information corresponding to each object included in the image,
Relative location information between objects included in the image,
Relative size feature information between objects included in the image and
Semantic feature information between objects included in the image
Containing at least one of
How to determine a target object from an image based on interactive input.

The method of claim 8,
Determining a target object corresponding to the interactive input among the objects included in the image,
Performing fusion processing on the first characteristic information before determining the target object
A method of determining a target object in an image based on the interactive input further comprising a.

According to claim 1,
Obtaining training data including a sample image;
Determining at least one candidate region based on each object included in the sample image and at least one other object included in the sample image;
Determining a reference area according to the candidate area, and obtaining area feature information of the reference area;
Generating an area title according to the area feature information; And
Using the area title as supervised training data, performing training on a neural network model to obtain semantic feature information between objects included in an image
A method of determining a target object in an image based on the interactive input further comprising a.

According to claim 1,
The obtaining of the first characteristic information corresponding to the image and the second characteristic information corresponding to the interactive input may include:
Performing a word vector transformation on the interactive input; And
Obtaining the second feature information corresponding to the interactive input based on the word vector
A method of determining a target object in an image based on an interactive input including a.

The method of claim 11,
The obtaining of the first characteristic information corresponding to the image and the second characteristic information corresponding to the interactive input may include:
Prior to the step of performing the word vector transformation on the interactive input,
Determining whether a word of the interactive input belongs to a first word set
Further comprising,
The step of performing the word vector transformation on the interactive input,
When the word of the interactive input belongs to the set first word, using the word vector of the second word having high similarity to the word vector of the first word as a word vector corresponding to the first word
Containing
How to determine a target object from an image based on interactive input.

The method of claim 12,
The first word,
The frequency of use indicates a word lower than the first set value,
The second word,
A word whose frequency of use is higher than the second setting value,
How to determine a target object from an image based on interactive input.

According to claim 1,
The interactive input,
At least one of voice input and text input,
How to determine a target object from an image based on interactive input.

A feature acquiring unit that acquires first feature information corresponding to an image and second feature information corresponding to an interactive input; And
A target determination unit that determines a target object corresponding to the interactive input among objects included in the image according to the first characteristic information and the second characteristic information
Device for determining a target object in the image based on the interactive input including a.

The method of claim 15,
The first feature information,
Global visual feature information corresponding to the image,
Visual feature information corresponding to each object included in the image,
Relative location information between objects included in the image,
Relative size feature information between objects included in the image and
Semantic feature information between objects included in the image
Containing at least one of
A device that determines a target object from an image based on interactive input.

The method of claim 16,
The feature acquisition unit,
Determining at least one candidate region based on each object included in the image and at least one other object,
Obtain classification feature information of an object in the candidate region,
Area semantic feature information between objects in the candidate area is acquired,
A reference area is determined according to the candidate area,
Acquire region feature information of the reference region,
Performing a joint change on the classification feature information, the region semantic feature information, and the region feature information based on the classification feature information, the region semantic feature information, and the region feature information; And
Generating the semantic feature information between each object included in the image and at least one other object based on the corrected classification feature information, the corrected region semantic feature information, and the corrected region feature information
A device that determines a target object from an image based on interactive input.

The method of claim 15,
The feature acquisition unit,
Perform word vector transformation on the interactive input,
Acquiring the second characteristic information corresponding to the interactive input based on the word vector
A device that determines a target object from an image based on interactive input.

The method of claim 15,
The feature acquisition unit,
When performing the word vector conversion on the interactive input,
It is determined whether the word of the interactive input belongs to the set first word, and when the word of the interactive input belongs to the set first word, the word vector of the second word having high similarity to the word vector of the first word Is used as a word vector corresponding to the first word,
The first word,
The frequency of use indicates a word lower than the first set value,
The second word,
A word whose frequency of use is higher than the second setting value,
A device that determines a target object from an image based on interactive input.

A computer-readable recording medium, characterized in that a program for executing the method of any one of claims 1 to 14 is recorded.