KR20210076110A

KR20210076110A - Methods for finding image regions, model training methods and related devices

Info

Publication number: KR20210076110A
Application number: KR1020217014824A
Authority: KR
Inventors: 린 마
Original assignee: 텐센트 테크놀로지(센젠) 컴퍼니 리미티드
Priority date: 2019-03-13
Filing date: 2020-03-10
Publication date: 2021-06-23
Also published as: EP3940638A1; EP3940638B1; CN109903314A; EP3940638A4; US20210264227A1; JP7096444B2; KR102646667B1; JP2022508790A; WO2020182112A1

Abstract

본 개시는 이미지 영역을 찾기 위한 방법을 제공하며, 이 방법은, 찾아질 이미지의 이미지 후보 영역 세트에 따라 영역 시맨틱 정보 세트를 생성하는 단계(102); GCN을 사용하여 영역 시맨틱 정보 세트에 대응하는 향상된 시맨틱 정보 세트를 획득하는 단계(103) ― GCN은 다양한 영역 시맨틱 정보 사이의 연관 관계를 구축하도록 구성됨 ―; 이미지 영역 찾기 네트워크 모델을 사용하여 찾아질 텍스트에 대응하는 텍스트 특징 세트와 각각의 향상된 시맨틱 정보 사이의 매칭 정도를 획득하는 단계(105); 및 텍스트 특징 세트와 각각의 향상된 시맨틱 정보 사이의 매칭 정도에 따라 이미지 후보 영역 세트로부터 타깃 이미지 후보 영역을 결정하는 단계(106)를 포함한다. 본 개시는 모델 훈련 방법 및 관련 장치를 더 개시한다. 본 개시에서, 이미지 후보 영역 사이의 시맨틱 표현은 GCN을 사용하여 향상되며, 이는 이미지 후보 영역을 찾는 정확도를 향상시킴으로써, 이미지 이해 능력을 향상시킬 수 있다.The present disclosure provides a method for finding an image region, the method comprising: generating (102) a set of region semantic information according to a set of image candidate regions of an image to be found; obtaining (103) an enhanced semantic information set corresponding to the region semantic information set by using the GCN, wherein the GCN is configured to establish an association relationship between the various region semantic information; obtaining (105) a degree of matching between a set of text features corresponding to the text to be found and each enhanced semantic information using the image region finding network model; and determining (106) a target image candidate region from the set of image candidate regions according to the degree of matching between the text feature set and each enhanced semantic information. The present disclosure further discloses a model training method and related apparatus. In the present disclosure, the semantic representation between image candidate regions is improved by using GCN, which can improve image comprehension ability by improving the accuracy of finding image candidate regions.

Description

Methods for finding image regions, model training methods and related devices

본 출원은, 2019년 3월 13일에 출원된 중국 특허 출원 제201910190207.2호('이미지 영역을 찾기 위한 방법, 모델 훈련 방법 및 관련 장치')의 우선권을 주장하며, 이것은 그 전체가 참조로서 본 명세서 포함된다.This application claims the priority of Chinese Patent Application No. 201910190207.2 ('Method for Finding Image Area, Model Training Method and Related Apparatus'), filed on March 13, 2019, which is hereby incorporated by reference in its entirety Included.

본 개시의 실시예는 이미지 영역을 찾기 위한 방법, 모델 훈련 방법 및 관련 장치에 관한 것이다.Embodiments of the present disclosure relate to a method for finding an image region, a model training method, and a related apparatus.

인공 지능의 발전이 진행됨에 따라, 이미지에서 자연 문장(natural sentence)에 대응하는 영역을 찾는 것이 기계 학습에서 중요한 태스크가 되고 있다. 이미지가 많은 경우, 일반적으로 자연 문장과 연관된 영역을 인위적으로 추출하는 데 시간이 많이 걸리며, 오류가 발생할 가능성이 높다. 따라서, 기계를 사용하여 이미지 영역을 찾는 것이 매우 필요하다.As artificial intelligence advances, finding regions in images that correspond to natural sentences is becoming an important task in machine learning. When there are many images, it is usually time-consuming to artificially extract the regions associated with natural sentences, and there is a high probability of errors. Therefore, it is very necessary to use a machine to find the image area.

현재, 이미지 영역을 찾는 방법에서, 이미지에서 복수의 후보 영역이 먼저 객체 제안 방식으로 추출되고, 그 다음 자연어에 가장 잘 맞는 로컬 영역을 타깃 이미지 영역으로 추가로 선택하기 위해 각각의 객체 제안과 자연 문장 사이의 매칭 관계를 결정하는 데 매칭 모델이 사용됨으로써, 대응하는 자연 문장 이미지 찾기 태스크를 완료할 수 있다.At present, in a method for finding an image region, a plurality of candidate regions from an image are first extracted by an object proposal method, and then each object proposal and natural sentence are further selected as a target image region to select a local region that best fits the natural language as a target image region. The matching model is used to determine the matching relationship between the two, thereby completing the task of finding the corresponding natural sentence image.

본 개시의 제1 측면에 따르면, 이미지 영역을 찾기 위한 방법이 제공되며,According to a first aspect of the present disclosure, there is provided a method for finding an image region,

찾아질 이미지(to-be located image)의 이미지 후보 영역 세트에 따라 영역 시맨틱 정보 세트(regoin semantic information set)를 생성하는 단계 ― 상기 영역 시맨틱 정보 세트의 각각의 영역 시맨틱 정보는 상기 이미지 후보 영역 세트의 하나의 이미지 후보 영역에 대응함 ―;generating a set of region semantic information according to the set of image candidate regions of the to-be located image, wherein each region semantic information of the set of region semantic information is a set of image candidate regions. Corresponding to one image candidate area;

그래프 컨볼루션 네트워크(Graph Convolutional Network, GCN)를 사용하여 상기 영역 시맨틱 정보 세트에 대응하는 향상된 시맨틱 정보 세트를 획득하는 단계 ― 상기 향상된 시맨틱 정보 세트의 각각의 향상된 시맨틱 정보는 상기 영역 시맨틱 정보 세트의 하나의 영역 시맨틱 정보에 대응하고, 상기 GCN은 다양한 영역 시맨틱 정보 사이의 연관 관계를 구축하도록 구성됨 ―;obtaining an enhanced semantic information set corresponding to the regional semantic information set using a Graph Convolutional Network (GCN), wherein each enhanced semantic information of the enhanced semantic information set is one of the regional semantic information sets. corresponding to the region semantic information of , wherein the GCN is configured to establish an association relationship between the various region semantic information;

이미지 영역 찾기 네트워크 모델을 사용하여 찾아질 텍스트에 대응하는 텍스트 특징 세트와 상기 각각의 향상된 시맨틱 정보 사이의 매칭 정도를 획득하는 단계 ― 상기 이미지 영역 찾기 네트워크 모델은 상기 이미지 후보 영역과 상기 찾아질 텍스트 사이의 매칭 관계를 결정하도록 구성되고, 상기 찾아질 텍스트의 각각의 단어는 상기 텍스트 특징 세트의 하나의 단어 특징에 대응함 ―; 및obtaining a degree of matching between the respective enhanced semantic information and a set of text features corresponding to the text to be found using an image region finding network model, wherein the image region finding network model is configured between the image candidate region and the text to be found. determine a matching relationship of , wherein each word of the text to be found corresponds to a single word feature of the set of text features; and

상기 텍스트 특징 세트와 상기 각각의 향상된 시맨틱 정보 사이의 매칭 정도에 따라 상기 이미지 후보 영역 세트로부터 타깃 이미지 후보 영역을 결정하는 단계를 포함한다.and determining a target image candidate region from the image candidate region set according to a matching degree between the text feature set and each of the enhanced semantic information.

본 개시의 제2 측면에 따르면, 모델 훈련 방법이 제공되며,According to a second aspect of the present disclosure, there is provided a model training method,

훈련될 텍스트 세트 및 훈련될 이미지 후보 영역 세트를 획득하는 단계 ― 상기 훈련될 텍스트 세트는 제1 훈련될 텍스트 및 제2 훈련될 텍스트를 포함하고, 상기 훈련될 이미지 후보 영역 세트는 제1 훈련될 이미지 후보 영역 및 제2 훈련될 이미지 후보 영역을 포함하며, 상기 제1 훈련될 텍스트와 상기 제1 훈련될 이미지 후보 영역은 매칭 관계를 갖고 있고, 상기 제1 훈련된 텍스트와 상기 제2 훈련될 이미지 후보 영역은 매칭 관계를 갖고 있지 않으며, 상기 제2 훈련될 텍스트와 상기 제2 훈련될 이미지 후보 영역은 매칭 관계를 갖고 있고, 상기 제2 훈련될 텍스트와 상기 제1 훈련될 이미지 후보 영역은 매칭 관계를 갖고 있지 않음 ―; obtaining a set of text to be trained and a set of image candidate regions to be trained, wherein the set of text to be trained includes a first text to be trained and a second text to be trained, and the set of image candidate regions to be trained is a first image to be trained image. a candidate region and a second candidate image to be trained region, wherein the first to-be-trained text and the first to-be-trained image candidate region have a matching relationship, and the first trained text and the second to-be-trained image candidate region does not have a matching relationship, the second to-be-trained text and the second to-be-trained image candidate region have a matching relationship, and the second to-be-trained text and the first to-be-trained image candidate region have a matching relationship. do not have ―;

상기 제1 훈련될 텍스트, 상기 제2 훈련될 텍스트, 상기 제1 훈련될 이미지 후보 영역 및 상기 제2 훈련될 이미지 후보 영역에 따라 타깃 손실 함수를 결정하는 단계; 및determining a target loss function according to the first to-be-trained text, the second to-be-trained text, the first to-be-trained image candidate region, and the second to-be-trained image candidate region; and

이미지 영역 찾기 네트워크 모델을 획득하기 위해 상기 타깃 손실 함수를 사용하여 훈련될 이미지 영역 찾기 네트워크 모델을 훈련시키는 단계 ― 상기 이미지 영역 찾기 네트워크 모델은 텍스트 특징 세트 및 향상된 시맨틱 정보에 따라 이미지 후보 영역과 찾아질 텍스트 사이의 매칭 관계를 결정하도록 구성되고, 상기 향상된 시맨틱 정보와 상기 이미지 후보 영역은 대응관계를 가지며, 상기 텍스트 특징 세트와 상기 찾아질 텍스트는 대응관계를 가지고 있음 ―를 포함한다.training an image region finding network model to be trained using the target loss function to obtain an image region finding network model, wherein the image region finding network model is to be found with image candidate regions according to a text feature set and enhanced semantic information. and determine a matching relationship between text, wherein the enhanced semantic information and the image candidate region have a correspondence, and the text feature set and the text to be found have a correspondence.

본 개시의 제3 측면에 따르면, 이미지 영역을 찾기 위한 장치가 제공되며, According to a third aspect of the present disclosure, there is provided an apparatus for finding an image region,

찾아질 이미지의 이미지 후보 영역 세트에 따라 영역 시맨틱 정보 세트를 생성하도록 하는 생성 모듈 ― 상기 영역 시맨틱 정보 세트의 각각의 영역 시맨틱 정보는 상기 이미지 후보 영역 세트의 하나의 이미지 후보 영역에 대응함 ―;a generating module, configured to generate a set of region semantic information according to a set of image candidate regions of an image to be found, wherein each region semantic information of the set of region semantic information corresponds to one image candidate region of the set of image candidate regions;

GCN을 사용하여, 상기 생성 모듈에 의해 생성되는 상기 영역 시맨틱 정보 세트에 대응하는 향상된 시맨틱 정보 세트를 획득하도록 구성된 획득 모듈 ― 상기 향상된 시맨틱 정보 세트의 각각의 향상된 시맨틱 정보는 상기 영역 시맨틱 정보 세트의 하나의 영역 시맨틱 정보에 대응하고, 상기 GCN은 다양한 영역 시맨틱 정보 사이의 연관 관계를 구축하도록 구성되며,an acquiring module, configured to acquire, using GCN, an enhanced semantic information set corresponding to the regional semantic information set generated by the generating module, wherein each enhanced semantic information of the enhanced semantic information set is one of the regional semantic information set Corresponding to the domain semantic information of , the GCN is configured to establish an association between the various domain semantic information,

상기 획득 모듈은, 이미지 영역 찾기 네트워크 모델을 사용하여 찾아질 텍스트에 대응하는 텍스트 특징 세트와 상기 각각의 향상된 시맨틱 정보 사이의 매칭 정도를 획득하도록 추가로 구성되고, 상기 이미지 영역 찾기 네트워크 모델은 상기 이미지 후보 영역과 상기 찾아질 텍스트 사이의 매칭 관계를 결정하도록 구성되며, 상기 찾아질 텍스트의 각각의 단어는 상기 텍스트 특징 세트의 하나의 단어 특징에 대응함 ―; 및The obtaining module is further configured to obtain a matching degree between the text feature set corresponding to the text to be found using the image area finding network model and the respective enhanced semantic information, wherein the image area finding network model is configured to: determine a matching relationship between a candidate region and the text to be found, wherein each word of the text to be found corresponds to one word feature of the set of text features; and

상기 텍스트 특징 세트와 상기 획득 모듈에 의해 획득되는 각각의 향상된 시맨틱 정보 사이의 매칭 정도에 따라 상기 이미지 후보 영역 세트로부터 타깃 이미지 후보 영역을 결정하도록 구성된 결정 모듈을 포함한다.and a determining module, configured to determine a target image candidate region from the image candidate region set according to a matching degree between the text feature set and each enhanced semantic information obtained by the acquiring module.

본 개시의 제4 측면에 따르면, 모델 훈련 장치가 제공되며,According to a fourth aspect of the present disclosure, there is provided a model training apparatus,

훈련될 텍스트 세트 및 훈련될 이미지 후보 영역 세트를 획득하도록 구성된 획득 모듈 ― 상기 훈련될 텍스트 세트는 제1 훈련될 텍스트 및 제2 훈련될 텍스트를 포함하고, 상기 훈련될 이미지 후보 영역 세트는 제1 훈련될 이미지 후보 영역 및 제2 훈련될 이미지 후보 영역을 포함하며, 상기 제1 훈련될 텍스트와 상기 제1 훈련될 이미지 후보 영역은 매칭 관계를 갖고 있고, 상기 제1 훈련된 텍스트와 상기 제2 훈련될 이미지 후보 영역은 매칭 관계를 갖고 있지 않으며, 상기 제2 훈련될 텍스트와 상기 제2 훈련될 이미지 후보 영역은 매칭 관계를 갖고 있고, 상기 제2 훈련될 텍스트와 상기 제1 훈련될 이미지 후보 영역은 매칭 관계를 갖고 있지 않음 ―; an acquiring module configured to acquire a set of text to be trained and a set of image candidate regions to be trained, wherein the set of text to be trained includes a first text to be trained and a second text to be trained, and wherein the set of image candidate regions to be trained is a first training an image candidate region to be trained and a second image candidate region to be trained, wherein the first to-be-trained text and the first to-be-trained image candidate region have a matching relationship, and the first trained text and the second to-be-trained region The image candidate region does not have a matching relationship, the second to-be-trained text and the second to-be-trained image candidate region have a matching relationship, and the second to-be-trained text and the first to-be-trained image candidate region match not in a relationship ―;

상기 획득 모듈에 의해 획득되는 상기 제1 훈련될 텍스트, 상기 제2 훈련될 텍스트, 상기 제1 훈련될 이미지 후보 영역 및 상기 제2 훈련될 이미지 후보 영역에 따라 타깃 손실 함수를 결정하도록 구성되는 결정 모듈; 및a determining module, configured to determine a target loss function according to the first to-be-trained text, the second to-be-trained text, the first to-be-trained image candidate region, and the second to-be-trained image candidate region obtained by the acquiring module ; and

이미지 영역 찾기 네트워크 모델을 획득하기 위해 상기 결정 모듈에 의해 결정되는 타깃 손실 함수를 사용하여 훈련될 이미지 영역 찾기 네트워크 모델을 훈련시키도록 구성된 훈련 모듈 ― 상기 이미지 영역 찾기 네트워크 모델은 텍스트 특징 세트 및 향상된 시맨틱 정보에 따라 이미지 후보 영역과 찾아질 텍스트 사이의 매칭 관계를 결정하도록 구성되고, 상기 향상된 시맨틱 정보와 상기 이미지 후보 영역은 대응관계를 가지며, 상기 텍스트 특징 세트와 상기 찾아질 텍스트는 대응관계를 가지고 있음 ―을 포함한다.a training module, configured to train an image region finding network model to be trained using the target loss function determined by the determining module to obtain an image region finding network model, wherein the image region finding network model includes a text feature set and enhanced semantics and determine a matching relationship between the image candidate region and the text to be found according to the information, wherein the enhanced semantic information and the image candidate region have a correspondence, and the text feature set and the text to be found have a correspondence. - includes

본 개시의 제5 측면에 따르면, 단말 장치가 제공되며,According to a fifth aspect of the present disclosure, there is provided a terminal device,

메모리, 트랜시버, 프로세서 및 버스 시스템을 포함하며,including memory, transceiver, processor and bus systems;

상기 메모리는 프로그램을 저장하도록 구성되고,The memory is configured to store a program,

상기 프로세서는, The processor is

찾아질 이미지의 이미지 후보 영역 세트에 따라 영역 시맨틱 정보 세트를 생성하는 작동 ― 상기 영역 시맨틱 정보 세트의 각각의 영역 시맨틱 정보는 상기 이미지 후보 영역 세트의 하나의 이미지 후보 영역에 대응함 ―;generating a set of region semantic information according to a set of image candidate regions of an image to be found, wherein each region semantic information of the set of region semantic information corresponds to one image candidate region of the set of image candidate regions;

GCN을 사용하여 상기 영역 시맨틱 정보 세트에 대응하는 향상된 시맨틱 정보 세트를 획득하는 작동 ― 상기 향상된 시맨틱 정보 세트의 각각의 향상된 시맨틱 정보는 상기 영역 시맨틱 정보 세트의 하나의 영역 시맨틱 정보에 대응하고, 상기 GCN은 다양한 영역 시맨틱 정보 사이의 연관 관계를 구축하도록 구성됨 ―;obtaining an enhanced semantic information set corresponding to the regional semantic information set using a GCN, wherein each enhanced semantic information in the enhanced semantic information set corresponds to one region semantic information of the regional semantic information set, wherein the GCN is configured to establish associations between various domain semantic information;

이미지 영역 찾기 네트워크 모델을 사용하여 찾아질 텍스트에 대응하는 텍스트 특징 세트와 상기 각각의 향상된 시맨틱 정보 사이의 매칭 정도를 획득하는 작동 ― 상기 이미지 영역 찾기 네트워크 모델은 상기 이미지 후보 영역과 상기 찾아질 텍스트 사이의 매칭 관계를 결정하도록 구성되고, 상기 찾아질 텍스트의 각각의 단어는 상기 텍스트 특징 세트의 하나의 단어 특징에 대응함 ―; 및obtaining a degree of matching between the respective enhanced semantic information and a set of text features corresponding to the text to be found using an image region finding network model, wherein the image region finding network model is between the image candidate region and the text to be found. determine a matching relationship of , wherein each word of the text to be found corresponds to a single word feature of the set of text features; and

상기 텍스트 특징 세트와 상기 각각의 향상된 시맨틱 정보 사이의 매칭 정도에 따라 상기 이미지 후보 영역 세트로부터 타깃 이미지 후보 영역을 결정하는 작동을 수행하기 위해 상기 메모리의 프로그램을 실행하도록 구성되며,and execute a program in the memory to perform an operation of determining a target image candidate region from the set of image candidate regions according to a degree of matching between the text feature set and each of the enhanced semantic information;

상기 버스 시스템은 상기 메모리와 상기 프로세서가 통신을 수행할 수 있도록 상기 메모리와 상기 프로세서를 연결하도록 구성된다.The bus system is configured to connect the memory and the processor so that the memory and the processor can communicate.

가능한 설계에서, 본 개시의 실시예의 제5 측면의 가능한 구현에서, 상기 프로세서는,In possible designs, in possible implementations of the fifth aspect of embodiments of the present disclosure, the processor comprises:

컨볼루션 신경망(CNN)을 사용하여 각각의 이미지 후보 영역에 대응하는 영역 시맨틱 정보를 획득하는 작동 ― 상기 이미지 후보 영역은 영역 정보를 포함하고, 상기 영역 정보는 상기 찾아질 이미지의 이미지 후보 영역의 위치 정보 및 상기 이미지 후보 영역의 크기 정보를 포함함 ―; 및an operation of obtaining region semantic information corresponding to each image candidate region using a convolutional neural network (CNN), wherein the image candidate region includes region information, the region information comprising a location of an image candidate region of the image to be found information and size information of the image candidate region; and

N개의 이미지 후보 영역에 대응하는 영역 시맨틱 정보가 획득되는 경우 N개의 영역 시맨틱 정보에 따라 상기 영역 시맨틱 정보 세트를 생성하는 작동 ― N은 1보다 크거나 같은 정수임 ―An operation of generating the region semantic information set according to the N regions semantic information when region semantic information corresponding to N image candidate regions is obtained, wherein N is an integer greater than or equal to 1

을 수행하기 위해 상기 메모리의 프로그램을 실행하도록 추가로 구성된다.and further configured to execute a program in the memory to perform

상기 영역 시맨틱 정보 세트로부터 제1 영역 시맨틱 정보 및 제2 영역 시맨틱 정보를 획득하는 작동 ― 상기 제1 영역 시맨틱 정보는 상기 영역 시맨틱 정보 세트의 임의의 하나의 영역 시맨틱 정보이고, 상기 제2 영역 시맨틱 정보는 상기 영역 시맨틱 정보 세트의 임의의 하나의 영역 시맨틱 정보임 ―;obtaining first region semantic information and second region semantic information from the region semantic information set, wherein the first region semantic information is any one region semantic information of the region semantic information set, and the second region semantic information is any one region semantic information of the region semantic information set;

상기 제1 영역 시맨틱 정보와 상기 제2 영역 시맨틱 정보 사이의 연결 에지의 강도를 획득하는 작동;obtaining a strength of a connection edge between the first region semantic information and the second region semantic information;

정규화된 강도를 획득하기 위해 상기 제1 영역 시맨틱 정보와 상기 제2 영역 시맨틱 정보 사이의 연결 에지의 강도를 정규화하는 작동;normalizing the strength of a connecting edge between the first region semantic information and the second region semantic information to obtain a normalized strength;

상기 영역 시맨틱 정보 세트의 다양한 영역 시맨틱 정보 사이의 정규화된 강도에 따라 타깃 연결 매트릭스를 결정하는 작동; 및determining a target connection matrix according to normalized strengths between various region semantic information of the region semantic information set; and

상기 GCN을 사용하여 상기 타깃 연결 매트릭스에 대응하는 향상된 시맨틱 정보 세트를 결정하는 작동Determining an enhanced semantic information set corresponding to the target connection matrix using the GCN

상기 영역 시맨틱 정보 세트의 다양한 영역 시맨틱 정보 사이의 정규화된 강도에 따라 연결 매트릭스를 생성하는 작동; 및generating a connection matrix according to normalized strengths between various region semantic information of the region semantic information set; and

상기 연결 매트릭스 및 단위 매트릭스에 따라 상기 타깃 연결 매트릭스를 생성하는 작동generating the target connection matrix according to the connection matrix and the unit matrix

다음의 방식the following way

으로 상기 향상된 시맨틱 정보를 계산하는 작동을 수행하기 위해 상기 메모리의 프로그램을 실행하도록 추가로 구성되며,further configured to execute the program in the memory to perform the operation of calculating the enhanced semantic information with

여기서

는 GCN의 k 번째 계층에 대응하는 i 번째 향상된 시맨틱 정보를 나타내고,

는 GCN의 k-1 번째 계층에 대응하는 j 번째 향상된 시맨틱 정보를 나타내며,

는 GCN의 k 번째 계층의 제1 네트워크 파라미터를 나타내고,

는 GCN의 k 번째 계층의 제2 네트워크 파라미터를 나타내며,

는 j 번째 노드가 i 번째 노드의 이웃 노드임을 나타내고,

는 상기 타깃 연결 매트릭스의 요소를 나타낸다.here

represents the i-th enhanced semantic information corresponding to the k-th layer of GCN,

represents the j-th enhanced semantic information corresponding to the k-1 th layer of GCN,

represents the first network parameter of the k-th layer of GCN,

represents the second network parameter of the k-th layer of GCN,

indicates that the j-th node is a neighbor of the i-th node,

denotes an element of the target connection matrix.

상기 찾아질 텍스트를 획득하는 작동;obtaining the text to be found;

상기 찾아질 텍스트에 따라 텍스트 벡터 시퀀스를 획득하는 작동 ― 상기 텍스트 벡터 시퀀스는 T개의 단어 벡터를 포함하고, 각각의 단어 벡터는 하나의 단어에 대응하며, T는 1보다 크거나 같은 정수임 ―;obtaining a text vector sequence according to the text to be found, the text vector sequence comprising T word vectors, each word vector corresponding to one word, and T being an integer greater than or equal to one;

텍스트 특징을 획득하기 위해 상기 텍스트 벡터 시퀀스의 각각의 단어 벡터를 인코딩하는 작동; 및encoding each word vector of the sequence of text vectors to obtain a text feature; and

상기 T개의 단어 벡터에 대응하는 텍스트 특징이 획득되는 경우 상기 T개의 텍스트 특징에 따라 상기 텍스트 특징 세트를 생성하는 작동generating the text feature set according to the T text features when text features corresponding to the T word vectors are obtained

을 수행하기 위해 상기 메모리의 프로그램을 실행하도록 구성된다.and execute a program in the memory to perform

본 개시의 실시예의 제5 측면의 가능한 구현에서, 상기 프로세서는,In possible implementations of the fifth aspect of the embodiments of the present disclosure, the processor comprises:

다음의 방식the following way

으로 상기 텍스트 특징을 획득하는 작동을 수행하기 위해 상기 메모리의 프로그램을 실행하도록 구성되며,and execute a program in the memory to perform an operation of obtaining the text feature with

여기서

는 상기 텍스트 특징 세트의 t 번째 텍스트 특징을 나타내고,

는 장단기 메모리(Long Short-Term Memory, LSTM) 네트워크를 사용하여 인코딩을 수행하는 것을 나타내며,

는 상기 텍스트 벡터 시퀀스의 t 번째 단어 벡터를 나타내고,

는 상기 텍스트 특징 세트의 (t-1) 번째 텍스트 특징을 나타낸다.here

denotes the t-th text feature of the set of text features,

indicates that encoding is performed using a Long Short-Term Memory (LSTM) network,

represents the t-th word vector of the text vector sequence,

denotes the (t-1)-th text feature of the text feature set.

본 개시의 제6 측면에 따르면, 서버가 제공되며,According to a sixth aspect of the present disclosure, a server is provided,

상기 프로세서는,The processor is

상기 텍스트 특징 세트와 상기 각각의 향상된 시맨틱 정보 사이의 매칭 정도에 따라 상기 이미지 후보 영역 세트로부터 타깃 이미지 후보 영역을 결정하는 작동determining a target image candidate region from the image candidate region set according to a degree of matching between the text feature set and each of the enhanced semantic information;

을 수행하기 위해 상기 메모리의 프로그램을 실행하도록 구성되며,configured to execute a program in the memory to perform

본 개시의 제7 측면에 따르면, 서버가 제공되며,According to a seventh aspect of the present disclosure, a server is provided,

상기 프로세서는,The processor is

훈련될 텍스트 세트 및 훈련될 이미지 후보 영역 세트를 획득하는 작동 ― 상기 훈련될 텍스트 세트는 제1 훈련될 텍스트 및 제2 훈련될 텍스트를 포함하고, 상기 훈련될 이미지 후보 영역 세트는 제1 훈련될 이미지 후보 영역 및 제2 훈련될 이미지 후보 영역을 포함하며, 상기 제1 훈련될 텍스트와 상기 제1 훈련될 이미지 후보 영역은 매칭 관계를 갖고 있고, 상기 제1 훈련된 텍스트와 상기 제2 훈련될 이미지 후보 영역은 매칭 관계를 갖고 있지 않으며, 상기 제2 훈련될 텍스트와 상기 제2 훈련될 이미지 후보 영역은 매칭 관계를 갖고 있고, 상기 제2 훈련될 텍스트와 상기 제1 훈련될 이미지 후보 영역은 매칭 관계를 갖고 있지 않음 ―; obtaining a set of text to be trained and a set of image candidate regions to be trained, wherein the set of text to be trained includes a first text to be trained and a second text to be trained, and wherein the set of image candidate regions to be trained is a first image to be trained image. a candidate region and a second candidate image to be trained region, wherein the first to-be-trained text and the first to-be-trained image candidate region have a matching relationship, and the first trained text and the second to-be-trained image candidate region does not have a matching relationship, the second to-be-trained text and the second to-be-trained image candidate region have a matching relationship, and the second to-be-trained text and the first to-be-trained image candidate region have a matching relationship. do not have ―;

상기 제1 훈련될 텍스트, 상기 제2 훈련될 텍스트, 상기 제1 훈련될 이미지 후보 영역 및 상기 제2 훈련될 이미지 후보 영역에 따라 타깃 손실 함수를 결정하는 작동; 및determining a target loss function according to the first to-be-trained text, the second to-be-trained text, the first to-be-trained image candidate region, and the second to-be-trained image candidate region; and

이미지 영역 찾기 네트워크 모델을 획득하기 위해 상기 타깃 손실 함수를 사용하여 훈련될 이미지 영역 찾기 네트워크 모델을 훈련시키는 작동 ― 상기 이미지 영역 찾기 네트워크 모델은 텍스트 특징 세트 및 향상된 시맨틱 정보에 따라 이미지 후보 영역과 찾아질 텍스트 사이의 매칭 관계를 결정하도록 구성되고, 상기 향상된 시맨틱 정보와 상기 이미지 후보 영역은 대응관계를 가지며, 상기 텍스트 특징 세트와 상기 찾아질 텍스트는 대응관계를 가지고 있음 ―training an image region finding network model to be trained using the target loss function to obtain an image region finding network model, wherein the image region finding network model is to be found with image candidate regions according to a text feature set and enhanced semantic information. determine a matching relationship between text, wherein the enhanced semantic information and the image candidate region have a correspondence, and the text feature set and the text to be found have a correspondence;

가능한 설계에서, 본 개시의 실시예의 제7 측면의 제1 구현에서, 상기 프로세서는,In a possible design, in a first implementation of the seventh aspect of an embodiment of the present disclosure, the processor comprises:

다음의 방식the following way

으로 상기 타깃 손실 함수를 결정하는 작동을 수행하기 위해 상기 메모리의 프로그램을 실행하도록 구성되며,and execute the program in the memory to perform an operation of determining the target loss function with

여기서

은 상기 타깃 손실 함수를 나타내고,

는 상기 제1 훈련될 이미지 후보 영역을 나타내며,

는 상기 제1 훈련될 텍스트를 나타내고,

는 상기 제2 훈련될 이미지 후보 영역을 나타내며,

는 상기 제2 훈련될 텍스트를 나타내고,

는 훈련될 데이터 쌍을 나타내며, max()는 최대값을 취함을 나타내고,

은 제1 파라미터 제어 가중치를 나타내며,

는 제2 파라미터 제어 가중치를 나타내고,

은 제1 미리 설정된 임계값을 나타내며,

는 제2 미리 설정된 임계값을 나타낸다.here

represents the target loss function,

represents the first to-be-trained image candidate region,

represents the first to-be-trained text,

represents the second to-be-trained image candidate region,

represents the second to-be-trained text,

denotes the data pair to be trained, max() denotes to take the maximum,

represents the first parameter control weight,

represents the second parameter control weight,

represents the first preset threshold,

denotes a second preset threshold value.

본 개시의 제8 측면에 따르면, 컴퓨터 판독 가능 저장 매체가 제공되고, 상기 컴퓨터 판독 가능 저장 매체는 명령을 저장하고, 상기 명령은, 컴퓨터에서 실행될 때, 상기 컴퓨터로 하여금 전술한 측면 중 어느 하나에 따른 방법을 수행하게 한다.According to an eighth aspect of the present disclosure, there is provided a computer-readable storage medium, the computer-readable storage medium storing instructions, the instructions, when executed in a computer, causing the computer to display in any one of the aforementioned aspects. follow the method to be followed.

본 개시의 제9 측면에 따르면, 이미지 영역을 찾는 방법이 제공되며, According to a ninth aspect of the present disclosure, there is provided a method for finding an image region,

이미지 찾기 명령을 수신하는 단계;receiving an image finding command;

상기 이미지 찾기 명령에 응답하여 상기 이미지 찾기 명령에 따라 찾아질 이미지의 이미지 후보 영역 세트를 획득하는 단계 ― 상기 이미지 후보 영역 세트는 N개의 이미지 후보 영역을 포함하고, N은 1보다 크거나 같은 정수임 ―;obtaining an image candidate region set of an image to be found according to the image finding command in response to the image finding command, wherein the image candidate region set includes N image candidate regions, where N is an integer greater than or equal to one; ;

상기 이미지 후보 영역 세트에 따라 영역 시맨틱 정보 세트를 생성하는 단계 ― 상기 영역 시맨틱 정보 세트는 N개의 영역 시맨틱 정보를 포함하고, 각각의 영역 시맨틱 정보는 하나의 이미지 후보 영역에 대응함 ―;generating a set of region semantic information according to the set of image candidate regions, wherein the set of region semantic information includes N pieces of region semantic information, each region semantic information corresponding to one image candidate region;

GCN을 사용하여 상기 영역 시맨틱 정보 세트에 대응하는 향상된 시맨틱 정보 세트를 획득하는 단계 ― 상기 향상된 시맨틱 정보 세트는 N개의 향상된 시맨틱 정보를 포함하고, 각각의 향상된 시맨틱 정보는 하나의 영역 시맨틱 정보에 대응하며, 상기 GCN은 다양한 영역 시맨틱 정보 사이의 연관 관계를 구축하도록 구성됨 ―;obtaining an enhanced semantic information set corresponding to the region semantic information set using GCN, wherein the enhanced semantic information set includes N enhanced semantic information, each enhanced semantic information corresponding to one region semantic information, and , the GCN is configured to establish associations between various domain semantic information;

찾아질 텍스트에 대응하는 텍스트 특징 세트를 획득하는 단계 ― 상기 찾아질 텍스트는 T개의 단어를 포함하고, 상기 텍스트 특징 세트는 T개의 단어 특징을 포함하며, 각각의 단어는 하나의 단어 특징에 대응하고, T는 1보다 크거나 같은 정수임 ―; obtaining a set of text features corresponding to the text to be looked up, wherein the text to be found comprises T words, the set of text features comprises T word features, each word corresponding to one word feature, and , T is an integer greater than or equal to 1 —;

이미지 영역 찾기 네트워크 모델을 사용하여 상기 텍스트 특징 세트와 상기 각각의 향상된 시맨틱 정보 사이의 매칭 정도를 획득하는 단계 ― 상기 이미지 영역 찾기 네트워크 모델은 상기 이미지 후보 영역과 상기 찾아질 텍스트 사이의 매칭 관계를 결정하도록 구성됨 ―; obtaining a degree of matching between the set of text features and each of the enhanced semantic information using an image region finding network model, wherein the image region finding network model determines a matching relationship between the image candidate region and the text to be found configured to ―;

상기 텍스트 특징 세트와 상기 각각의 향상된 시맨틱 정보 사이의 매칭 정도에 따라 상기 이미지 후보 영역 세트로부터 타깃 이미지 후보 영역을 결정하는 단계; 및determining a target image candidate region from the image candidate region set according to a matching degree between the text feature set and each of the enhanced semantic information; and

클라이언트가 이미지 생성 명령에 따라 상기 타깃 이미지 후보 영역을 디스플레이할 수 있도록 상기 클라이언트에게 상기 이미지 생성 명령을 전송하는 단계sending the image creation command to the client so that the client can display the target image candidate region according to the image creation command

를 포함한다.includes

도 1은 본 개시의 실시예에 따른 이미지 영역을 찾기 위한 시스템의 개략적인 아키텍처 도면이다.
도 2는 본 개시의 실시예에 따른 이미지 영역을 찾기 위한 전체 프레임워크의 개략도이다.
도 3은 본 개시의 실시예에 따른 이미지 영역을 찾기 위한 방법의 실시예의 개략도이다.
도 4는 본 개시의 실시예에 따른 모델 훈련 방법의 실시예의 개략도이다.
도 5는 본 개시의 실시예에 따른 이미지 영역을 찾기 위한 장치의 실시예의 개략도이다.
도 6은 본 개시의 실시예에 따른 모델 훈련 장치의 실시예의 개략도이다.
도 7은 본 개시의 실시예에 따른 단말 장치의 개략적인 구조도이다.
도 8은 본 개시의 실시예에 따른 서버의 개략적인 구조도이다.1 is a schematic architectural diagram of a system for finding an image region according to an embodiment of the present disclosure;
2 is a schematic diagram of an overall framework for finding an image region according to an embodiment of the present disclosure;
3 is a schematic diagram of an embodiment of a method for finding an image region according to an embodiment of the present disclosure;
4 is a schematic diagram of an embodiment of a model training method according to an embodiment of the present disclosure;
5 is a schematic diagram of an embodiment of an apparatus for finding an image region according to an embodiment of the present disclosure;
6 is a schematic diagram of an embodiment of a model training apparatus according to an embodiment of the present disclosure;
7 is a schematic structural diagram of a terminal device according to an embodiment of the present disclosure.
8 is a schematic structural diagram of a server according to an embodiment of the present disclosure.

관련 기술에서, 자연 문장에 가장 잘 매칭되는 이미지 영역이 이미지에서 찾아질 수 있지만, 로컬 영역들 사이의 공간적 관계가 고려되지 않고, 로컬 영역들 사이의 시맨틱(semantic) 정보가 무시된다. 그 결과, 타깃 이미지 영역이 정확하게 찾아질 수 없고, 이로 인해 이미지 이해 능력이 저하된다.In the related art, the image region that best matches the natural sentence can be found in the image, but the spatial relationship between the local regions is not considered, and the semantic information between the local regions is ignored. As a result, the target image region cannot be accurately found, which deteriorates the image understanding ability.

본 개시의 실시예는 이미지 영역을 찾기 위한 방법, 모델 훈련 방법 및 관련 장치를 제공한다. 이미지 후보 영역 사이의 시맨틱 표현은 GCN을 사용하여 효과적으로 강화될 수 있으며, 영상 후보 영역 사이의 공간적 관계가 고려되며, 이는 이미지 영역을 찾는 정확도를 향상시킬 수 있으므로, 이미지 이해 능력을 향상시킬 수 있다.An embodiment of the present disclosure provides a method for finding an image region, a model training method, and a related apparatus. The semantic representation between image candidate regions can be effectively enhanced using GCN, and spatial relationships between image candidate regions are considered, which can improve the accuracy of finding image regions, thereby improving image comprehension ability.

본 발명의 명세서 및 청구 범위와 전술한 첨부 도면에서 "제1", "제2", "제3", "제4" 등의 용어(있는 경우)는 유사한 객체를 구별하기 위해 사용되며, 특정 시퀀스나 순서를 설명하는 데 반드시 사용되는 것은 아니다. 이러한 방식으로 사용되는 데이터는 여기에서 설명되는 본 개시의 실시예가 여기에서 도시되거나 설명된 순서를 제외한 순서로 구현될 수 있도록 적절한 상황에서 상호 교환될 수 있다. 또한, 용어 "포함한다", "에 대응하는" 및 임의의 다른 변형은 비 배타적 포함을 커버하기 위한 것이다. 예를 들어, 일련의 단계 또는 유닛을 포함하는 프로세스, 방법, 시스템, 제품 또는 장치는 명시적으로 나열된 단계 또는 유닛에 반드시 제한되지는 않지만, 명시적으로 나열되지 않거나 그러한 프로세스, 방법, 제품 또는 장치에 고유하지 않은 다른 단계 또는 유닛을 포함할 수 있다.In the specification and claims of the present invention and the accompanying drawings, the terms "first", "second", "third", "fourth", etc. (if any) are used to distinguish similar objects, and It is not necessarily used to describe a sequence or sequence. Data used in this manner may be interchanged in appropriate circumstances such that the embodiments of the present disclosure described herein may be implemented in orders other than those shown or described herein. Also, the terms “comprises,” “corresponds to,” and any other variations are intended to cover non-exclusive inclusions. For example, a process, method, system, product, or apparatus comprising a series of steps or units is not necessarily limited to the explicitly listed steps or units, but is not explicitly listed or includes such process, method, product or apparatus. may include other steps or units that are not unique to

본 개시에서 제공되는 이미지 영역을 찾기 위한 방법은 이미지에서 관심 타깃을 찾기 위해 이미지 처리 및 모드 인식과 같은 분야에 적용 가능하여, 특정 유형의 타깃이 결정될 수 있고 타깃의 경계 박스가 제공될 수 있다. 이미지 영역을 찾기 위한 방법은 얼굴 인식, 의료 영상, 지능 비디오 모니터링, 로봇 내비게이션, 컨텐츠 기반 이미지 검색, 이미지 기반 드로잉 기술, 이미지 편집 및 증강 현실과 같은 분야에 널리 적용된다. 예를 들어, 컨텐츠 기반 이미지 검색의 시나리오에서, 이미지 A가 있다고 가정한다. 이미지 A에서 복수의 후보 영역이 추출된다. 사용자가 "소년이 사과를 들고 있다"라는 문장을 입력한 것으로 가정된다. 이 경우, 문장은 각각의 후보 영역과 매칭된다. 매칭 결과에 따라 복수의 후보 영역 중에서 타깃 후보 영역이 선택된다. 본 개시는 주로 GCN을 사용하여 자연 문장 이미지 찾기를 완성하기 위한 것이다. 자연 문장은 단어, 구 또는 문장일 수 있으며, 자연 문장에 대응하는 이미지의 타깃 후보 영역이 발견된다. 타깃 후보 영역은 직사각형 박스로 정의될 수 있다.The method for finding an image region provided in the present disclosure is applicable to fields such as image processing and mode recognition to find a target of interest in an image, so that a specific type of target can be determined and a bounding box of the target can be provided. Methods for finding image regions are widely applied in fields such as face recognition, medical imaging, intelligent video monitoring, robot navigation, content-based image search, image-based drawing technology, image editing, and augmented reality. For example, in the scenario of content-based image search, suppose there is image A. A plurality of candidate regions are extracted from image A. It is assumed that the user has entered the sentence "The boy is holding an apple". In this case, the sentence is matched with each candidate area. A target candidate area is selected from among a plurality of candidate areas according to the matching result. The present disclosure is mainly for completing natural sentence image finding using GCN. The natural sentence may be a word, a phrase, or a sentence, and a target candidate region of an image corresponding to the natural sentence is found. The target candidate area may be defined as a rectangular box.

실제 적용 동안, 이미지 영역 찾기는 세 개의 레벨을 포함할 수 있다. 제1 레벨은 이미지 레벨이다. 즉, 이미지에서 관련된 타깃 객체가 있는지 여부가 결정된다. 이미지 분류 또는 이미지 주석 기술에서, 예를 들어 "사과"라는 단어에 대해, "사과" 객체가 이미지에서 원으로 표시될 수 있다.During practical application, image region finding may involve three levels. The first level is the image level. That is, it is determined whether there is an associated target object in the image. In image classification or image annotation techniques, for example, for the word "apple", an "apple" object may be circled in the image.

제2 계층은 영역 레벨이다. 즉, 이미지의 영역이 타깃 유형을 포함하는지 여부가 결정된다. 예를 들어, 이미지에서 타깃 유형의 검출에서, 예를 들어 "소년이 사과를 들고 있다"에 대해, 영역이 이미지로부터 프레임화될 수 있다. 이 영역은 소년과 사과를 포함한다.The second layer is the domain level. That is, it is determined whether the region of the image contains the target type. For example, in detection of a target type in an image, for example for "a boy is holding an apple", a region may be framed from the image. This area includes boys and apples.

제3 계층은 영역 레벨이다. 즉, 이미지의 각각의 픽셀이 속한 타깃 객체의 유형이 결정된다. 픽셀 레벨 분할은 또한 유형 레벨 타깃 분할 및 시맨틱 분할을 포함한다. 유형 레벨 타깃 분할과 시맨틱 분할 사이의 주요 차이점은 시맨틱 분할에서, 이미지의 배경을 포함하는 모든 타깃이 분할되어야 하고, 타깃의 유형이 결정되어야 하는 반면, 관심 타깃을 분할하고 타깃을 분류하기만 하면 된다.The third layer is the domain level. That is, the type of target object to which each pixel of the image belongs is determined. Pixel level segmentation also includes type level target segmentation and semantic segmentation. The main difference between type level target segmentation and semantic segmentation is that in semantic segmentation, all targets including the background of the image have to be segmented, and the type of target has to be determined, whereas the target of interest only needs to be segmented and the target classified. .

설명의 편의를 위해, 본 개시는 이미지 영역을 찾기 위한 방법을 제안한다. 이 방법은 도 1에 도시된 이미지 영역을 찾기 위한 시스템에 적용 가능하다. 도 1은 본 개시의 실시예에 따른 이미지 영역을 찾기 위한 시스템의 개략적인 아키텍처 도면이다. 도면에 도시된 바와 같이, 본 개시에서 제공되는 이미지 영역을 찾기 위한 방법은 서버에 적용할 수 있거나 또는 클라이언트에 적용할 수 있다. 이 방법이 서버에 적용되는 경우, 찾기 결과를 결정한 후, 서버는 클라이언트에게 찾기 결과를 전송하고, 클라이언트를 사용하여 대응하는 타깃 이미지 후보 영역을 디스플레이할 수 있다. 이 방법이 클라이언트에 적용되는 경우, 찾기 결과를 결정한 후, 클라이언트는 대응하는 타깃 이미지 후보 영역을 직접 디스플레이할 수 있다. 구체적으로, 하나의 이미지에 대해, 먼저 이미지 검출 방법이 복수의 이미지 후보 영역(즉, 이미지의 로컬 영역)을 획득하는 데 사용된다. 복수의 영상 후보 영역에 대해, 이미지 후보 영역 사이의 공간적 관계는 그래프를 구축하는 데 사용된다. 그 후, 이미지 후보 영역의 경우, CNN은 대응하는 시맨틱 특징을 추출하는 데 사용될 수 있다. 획득된 시맨틱 특징과 구축된 그래프에 기초하여, GCN은 이미지 후보 영역의 표현을 추가로 학습하는 데 사용된다. GCN을 사용하여 획득되는 이미지 후보 영역의 표현에 기초하여, 가장 상관 관계가 있는 이미지 후보 영역을 자연 문장 이미지 찾기의 최종 결과로서 추가로 결정하기 위해 이러한 이미지 후보 영역과 주어진 자연 문장 사이의 시맨틱 관련성을 측정하는 데 시맨틱 매칭 방식이 사용된다. For convenience of description, the present disclosure proposes a method for finding an image region. This method is applicable to the system for finding an image region shown in FIG. 1 . 1 is a schematic architectural diagram of a system for finding an image region according to an embodiment of the present disclosure; As shown in the figure, the method for finding an image region provided in the present disclosure may be applied to a server or a client. When this method is applied to the server, after determining the finding result, the server may send the finding result to the client, and display the corresponding target image candidate area using the client. When this method is applied to the client, after determining the finding result, the client may directly display the corresponding target image candidate area. Specifically, for one image, an image detection method is first used to obtain a plurality of image candidate regions (ie, local regions of the image). For a plurality of image candidate regions, the spatial relationship between the image candidate regions is used to build a graph. Then, for image candidate regions, CNN can be used to extract the corresponding semantic features. Based on the obtained semantic features and the constructed graph, the GCN is used to further learn the representation of the image candidate region. Based on the representation of image candidate regions obtained using GCN, the semantic relevance between these image candidate regions and a given natural sentence is calculated to further determine the most correlated image candidate regions as the final result of natural sentence image finding. A semantic matching method is used to measure.

클라이언트는 단말 장치에 배치된다. 단말 장치는 태블릿 컴퓨터, 노트북 컴퓨터, 팜탑 컴퓨터, 휴대폰, 음성 상호작용 장치 및 개인용 컴퓨터(personal computer, PC)를 포함하지만 이에 제한되지 않으며, 여기에서 제한되지 않는다. 음성 상호작용 장치는 스마트 사운드 및 스마트 가전을 포함하지만 이에 제한되지 않는다.The client is placed in the terminal device. Terminal devices include, but are not limited to, tablet computers, notebook computers, palmtop computers, mobile phones, voice interaction devices, and personal computers (PCs). Voice interaction devices include, but are not limited to, smart sound and smart appliances.

본 개시에서 제안되는 이미지 영역을 찾기 위한 방법은 자연 문장 이미지 찾기 서비스를 제공할 수 있다. 이 서비스는 서버 측에 배치될 수 있거나 또는 단말 장치 측에 배치될 수 있다. 서버 측에 이미지 영역을 찾기 위한 방법의 적용은 이미지에 대한 보다 상세히 세분화된 주석을 추가로 수행하기 위해 이미지에 대한 더 깊은 이해를 완성할 수 있으므로, 사용자가 빠르고 정밀한 검색 및 매칭을 할 수 있도록 도울 수 있고, 또한 화상 및 텍스트 정보의 개인 추천에 대한 적용을 할 수 있다는 것이 이해될 수 있다. 이미지 영역을 찾기 위한 방법은 단말 장치, 예를 들어, 휴대폰 단말 또는 로봇에 배치될 수 있다. 로봇의 카메라는 대응하는 이미지 신호를 획득하고, 사용자는 자연어를 사용하여 대응하는 로봇과 상호 작용한다. 예를 들어, 사용자는 음성 또는 키보드 입력을 통해 대응하는 자연어 텍스트를 획득한 후, 이미지 영역 찾기 네트워크 모델을 사용하여 대응하는 자연어 텍스트에 대한 이미지의 로컬 영역을 찾을 수 있다. 이러한 방식으로, 단말 장치는 사용자와 더 잘 상호 작용할 수 있다.The method for finding an image region proposed in the present disclosure may provide a natural sentence image finding service. This service may be located on the server side or may be located on the terminal device side. The application of the method for finding image regions on the server side can complete a deeper understanding of the image to further perform fine-grained annotations on the image, thus helping users to make fast and precise search and matching. It can be appreciated that it can also be applied to personal recommendations of picture and text information. The method for finding the image area may be arranged in a terminal device, for example, a mobile phone terminal or a robot. The robot's camera acquires a corresponding image signal, and the user interacts with the corresponding robot using natural language. For example, after the user obtains the corresponding natural language text through voice or keyboard input, the image region finding network model may be used to find the local region of the image for the corresponding natural language text. In this way, the terminal device can better interact with the user.

예시적인 시나리오에서, 사용자는 편리하게 정확한 검색을 수행할 수 있다. 사용자는 음성 또는 키보드를 사용하여 단말 장치에 자연어 텍스트를 입력할 수 있다. 단말 장치는 본 개시의 실시예에서 이미지 영역을 찾기 위한기 방법을 사용하여, 찾기 대상의 이미지에서 자연어 텍스트와 가장 잘 매칭되는 정도를 갖는 영역을 결정한다. 이것은 범죄 수사 분야와 교육 분야에서 실질적인 의미가 있다. 예를 들어, 범죄 수사 분야에서, 감시 영상 이미지에서 특징이 있는 용의자를 정확하게 찾을 수 있다. 다르게는, 교육 분야에서, 모든 학생이 클래스 비디오 이미지에서 정확하게 찾아질 수 있다. 복잡한 수동 스크리닝이 필요하지 않으며, 사용자는 자연어 텍스트만 입력하면 된다.In an example scenario, a user may conveniently perform an accurate search. A user may input natural language text into the terminal device using a voice or a keyboard. The terminal device uses the method for finding an image region in an embodiment of the present disclosure to determine a region having a degree of best matching with the natural language text in the image to be found. This has practical implications in the field of criminal investigation and education. For example, in the field of criminal investigation, it is possible to accurately find a characteristic suspect in a surveillance video image. Alternatively, in the field of education, all students can be accurately found in the class video image. No complicated manual screening is required, users only need to input natural language text.

예시적인 시나리오에서, 서비스 단말은 사용자에 대응하는 단말 장치에 대한 개인화된 추천을 편리하게 수행할 수 있다. 서비스 단말은 사용자가 입력하여 완전히 권한이 부여된 자연어 텍스트를 수집하고, 서비스 단말은 보다 정확한 개인화된 추천이 달성될 수 있도록 유사한 이미지 자원, 비디오 자원, 웹 페이지 자원 등을 선택된 영역으로 푸시하기 위해 본 개시의 실시예에서 이미지 영역을 찾기 위한 방법을 사용하여 찾기 대상의 이미지에서 자연어와 가장 잘 매칭되는 정도를 갖는 영역을 결정함으로써, 자원 추천 프로세스의 정확도를 향상시킬 수 있다. In an exemplary scenario, the service terminal may conveniently perform a personalized recommendation for the terminal device corresponding to the user. The service terminal collects the fully authorized natural language text entered by the user, and the service terminal sees this to push similar image resources, video resources, web page resources, etc. to the selected area so that more accurate personalized recommendations can be achieved. In the embodiment of the disclosure, by using the method for finding an image region to determine a region having a degree of best matching with a natural language in the image to be found, the accuracy of the resource recommendation process may be improved.

이해의 편의를 위해, 도 2를 참조하면, 도 2는 본 개시의 실시예에 따른 이미지 영역을 찾기 위한 전체 프레임워크의 개략도이다. 도면에 도시된 바와 같이, 자연 이미지의 경우, 객체 제안 방법은 대응하는 이미지 후보 영역을 획득하는 데 사용된다. 대응하는 이미지 후보 영역이 추출된 후, 대응하는 CNN이 대응하는 이미지 후보 영역의 시맨틱 표현을 추출하는 데 사용되고, 각각의 후보 영역은 하나의 특징 벡터로 표현된다. 대응하는 후보 영역에 대응하는 시맨틱 표현

이 추가로 획득되며, 여기서 n은 이미지로부터 추출된 이미지 후보 영역의 총 개수를 나타낸다. 그 후, GCN은 이미지 후보 영역의 향상된 시맨틱 표현

을 획득하기 위해 이미지 후보 영역의 추출된 시맨틱 표현을 향상시키는 데 사용된다. GCN을 구축하는 과정에서, 대응하는 그래프를 구축하고 대응하는 연결 에지 정보를 추가로 정의하기 위해 이미지 후보 영역 사이의 시맨틱 유사성이 고려된다. 연결 에지 정보는 대응하는 이미지 후보 영역의 시맨틱 표현을 향상시키는 데 사용된다.For convenience of understanding, referring to FIG. 2 , FIG. 2 is a schematic diagram of an overall framework for finding an image region according to an embodiment of the present disclosure. As shown in the figure, in the case of a natural image, the object suggestion method is used to obtain a corresponding image candidate region. After the corresponding image candidate region is extracted, the corresponding CNN is used to extract the semantic representation of the corresponding image candidate region, and each candidate region is represented by one feature vector. Semantic representations corresponding to corresponding candidate regions

is further obtained, where n represents the total number of image candidate regions extracted from the image. After that, the GCN is an enhanced semantic representation of the image candidate region.

is used to enhance the extracted semantic representation of the image candidate region to obtain In the process of constructing the GCN, the semantic similarity between image candidate regions is considered in order to construct a corresponding graph and further define the corresponding connected edge information. The connecting edge information is used to enhance the semantic representation of the corresponding image candidate region.

입력된 자연 문장(예를 들어, "좌측에서 야구를 하고 있는 백인")에 대해, RNN은 자연 문장에 대응하는 시맨틱 표현을 획득하기 위해 자연 문장을 인코딩하는 데 사용된다. 자연 문장의 시맨틱 표현과 이미지 후보 영역의 대응하는 향상된 시맨틱 표현에 대해, 자연 문장과 대응하는 이미지 후보 영역 사이의 시맨틱 관계를 학습하기 위해 매칭 학습 방법이 사용된다. 마지막으로, 자연 문장과 이미지 후보 영역 사이의 시맨틱 유사성은 가장 시맨틱적으로 관련된 이미지 후보 영역을 타깃 이미지 후보 영역으로 선택하는 데 사용된다.For an input natural sentence (eg, "white man playing baseball on the left"), the RNN is used to encode the natural sentence to obtain a semantic representation corresponding to the natural sentence. For the semantic representation of the natural sentence and the corresponding enhanced semantic representation of the image candidate region, a matching learning method is used to learn the semantic relationship between the natural sentence and the corresponding image candidate region. Finally, the semantic similarity between natural sentences and image candidate regions is used to select the most semantically relevant image candidate regions as target image candidate regions.

전술한 설명을 참조하면, 본 개시에서 이미지 영역을 찾기 위한 방법이 아래에서 설명된다. 도 3을 참조하면, 본 방법이 이미지 영역을 찾기 위한 장치에 적용되는 예가 설명을 위해 사용된다. 이미지 영역을 찾기 위한 장치는 서버에 배치될 수 있거나 또는 단말 장치에 배치될 수 있다. 본 개시의 실시예에서 이미지 영역을 찾기 위한 방법의 실시예는 다음 단계를 포함한다.Referring to the foregoing description, a method for finding an image region in the present disclosure is described below. Referring to FIG. 3 , an example in which the method is applied to an apparatus for finding an image region is used for explanation. The device for finding the image region may be disposed in a server or may be disposed in a terminal device. An embodiment of a method for finding an image region in an embodiment of the present disclosure includes the following steps.

단계 101. 이미지 영역을 찾기 위한 장치는 찾아질 이미지의 이미지 후보 영역 세트를 획득하며, 이미지 후보 영역 세트는 N개의 이미지 후보 영역을 포함하고, N은 1보다 크거나 같은 정수이다. Step 101. The apparatus for finding an image region obtains a set of image candidate regions of an image to be found, the image candidate region set includes N image candidate regions, where N is an integer greater than or equal to one.

본 실시예에서, 이미지 영역을 찾기 위한 장치는 먼저 찾아질 이미지를 획득한다. 찾아질 이미지는 백엔드 서버에 저장된 이미지일 수 있거나 또는 클라이언트에 의해 업로드된 이미지일 수 있거나 또는 클라이언트의 로컬 이미지일 수 있다. 이미지 영역을 찾기 위한 장치는 서버에 배치될 수 있거나 또는 단말 장치에 배치될 수 있음이 이해될 수 있다. 이것은 여기에서 제한되지 않는다.In this embodiment, the apparatus for finding an image region first acquires an image to be found. The image to be found may be an image stored on the backend server, or it may be an image uploaded by the client, or it may be a local image of the client. It may be understood that the device for finding the image area may be disposed in a server or may be disposed in a terminal device. It is not limited here.

이미지 영역을 찾기 위한 장치가 찾아질 이미지를 획득한 후, 객체 제안 방법이 찾아질 이미지에서 이미지 후보 영역 세트를 추출하는 데 사용될 수 있다. 이미지 후보 영역 세트는 N개의 이미지 후보 영역을 포함하고, N은 1보다 크거나 1과 같은 정수이다. N이 1과 같은 경우, 찾아질 이미지에 하나의 이미지 후보 영역만 있음을 나타내고, 이미지 후보 영역은 타깃 이미지 후보 영역으로 직접 사용된다.After the apparatus for finding an image region obtains an image to be found, an object suggestion method may be used to extract a set of image candidate regions from the image to be found. The image candidate region set includes N image candidate regions, where N is an integer greater than or equal to 1. When N is equal to 1, it indicates that there is only one image candidate region in the image to be found, and the image candidate region is directly used as the target image candidate region.

객체 제안 방법에 기초하여, 이미지 내의 이미지 후보 영역이 추출될 수 있다. 구체적으로, 타깃이 이미지에서 나타날 수 있는 위치인 이미지 후보 영역이 미리 찾아진다. 이미지의 텍스처(texture), 에지 및 색상과 같은 정보를 사용하면, 상대적으로 적은 수의 윈도우가 선택되면서 상대적으로 높은 재현율(Intersection-over-Union, IoU)이 유지될 수 있음이 보장된다. 객체 제안 방법은 영역 기반 CNN(R-CNN), 고속 R-CNN 및 더 빠른 R-CNN 등을 포함하지만, 이에 제한되지 않는다. 이것은 여기에서 제한되지 않는다.Based on the object suggestion method, an image candidate region in the image may be extracted. Specifically, an image candidate area, which is a location where a target may appear in the image, is found in advance. Using information such as texture, edge, and color of the image ensures that a relatively high intersection-over-union (IoU) can be maintained while a relatively small number of windows are selected. Object proposal methods include, but are not limited to, region-based CNN (R-CNN), fast R-CNN, and faster R-CNN, and the like. It is not limited here.

단계 102. 이미지 영역을 찾기 위한 장치는 이미지 후보 영역 세트에 따라 영역 시맨틱 정보 세트를 생성하며, 영역 시맨틱 정보 세트는 N개의 영역 시맨틱 정보를 포함하고, 각각의 영역 시맨틱 정보는 하나의 이미지 후보 영역에 대응한다. Step 102. The apparatus for finding an image region generates a set of region semantic information according to the image candidate region set, the region semantic information set includes N regions semantic information, and each region semantic information is in one image candidate region. respond

본 실시예에서, 이미지 후보 영역 세트를 획득한 후, 이미지 영역을 찾기 위한 장치는 영역 시맨틱 정보 세트를 획득하기 위해 신경망을 사용하여 이미지 후보 영역의 대응하는 시맨틱 표현을 생성한다. 영역 시맨틱 정보 세트는 N개의 영역 시맨틱 정보를 포함한다. 각각의 영역 시맨틱 정보는 하나의 이미지 후보 영역에 대응한다.In this embodiment, after obtaining the image candidate region set, the apparatus for finding the image region generates a corresponding semantic representation of the image candidate region by using the neural network to obtain the region semantic information set. The region semantic information set includes N pieces of region semantic information. Each region semantic information corresponds to one image candidate region.

신경망은 구체적으로 CNN일 수 있다. 실제 적용 중에, 신경망은 다른 유형의 신경망일 수 있다. 이것은 여기서는 개략적일 뿐이며, 본 개시에 대한 제한으로 이해되어서는 안된다.The neural network may specifically be a CNN. In practical application, the neural network may be another type of neural network. This is only schematic here and should not be construed as a limitation on the present disclosure.

전술한 과정에서, 즉, 이미지 영역을 찾기 위한 장치는 찾아질 이미지의 이미지 후보 영역 세트에 따라 영역 시맨틱 정보 세트를 생성하며, 영역 시맨틱 정보 세트의 각각의 영역 시맨틱 정보는 이미지 후보 영역 세트의 하나의 이미지 후보 영역에 대응한다.In the above process, that is, the apparatus for finding an image region generates a set of region semantic information according to a set of image candidate regions of an image to be found, and each region semantic information of the set of region semantic information is one of the set of image candidate regions. Corresponds to the image candidate area.

단계 103. 이미지 영역을 찾기 위한 장치는 GCN을 사용하여 영역 시맨틱 정보 세트에 대응하는 향상된 시맨틱 정보 세트를 획득하며, 향상된 시맨틱 정보 세트는 N개의 향상된 시맨틱 정보를 포함하고, 각각의 향상된 시맨틱 정보는 하나의 영역 시맨틱 정보에 대응하며, GCN은 다양한 영역 시맨틱 정보 사이의 연관 관계를 구축하도록 구성된다.Step 103. The apparatus for finding an image region obtains an enhanced semantic information set corresponding to the region semantic information set by using the GCN, wherein the enhanced semantic information set includes N enhanced semantic information, each enhanced semantic information is one Corresponding to the domain semantic information of , the GCN is configured to establish an association relationship between the various domain semantic information.

즉, 향상된 시맨틱 정보 세트의 각각의 향상된 시맨틱 정보는 영역 시맨틱 정보 세트의 하나의 영역 시맨틱 정보에 대응한다.That is, each enhanced semantic information of the enhanced semantic information set corresponds to one region semantic information of the region semantic information set.

본 실시예에서, 이미지 영역을 찾기 위한 장치는 GCN을 사용하여 영역 시맨틱 정보 세트에 대응하는 향상된 시맨틱 정보 세트를 획득한다. 즉, GCN을 사용하여 이미지 후보 영역의 시맨틱 표현이 향상될 수 있다. 향상된 시맨틱 정보 세트는 N개의 향상된 시맨틱 정보를 포함한다. 즉, 각각의 이미지 후보 영역은 하나의 영역 시맨틱 정보에 대응하고, 각각의 이미지 후보 영역은 하나의 영역 시맨틱 향상 정보에 대응한다. GCN은 노드 사이의 연관 관계를 구축하는 데 사용될 수 있다. 본 개시에서, 다양한 영역 시맨틱 정보 사이에 연관 관계가 구축될 수 있다.In this embodiment, the apparatus for finding an image region uses GCN to obtain an enhanced semantic information set corresponding to the region semantic information set. That is, the semantic representation of image candidate regions can be improved using GCN. The enhanced semantic information set includes N enhanced semantic information. That is, each image candidate region corresponds to one region semantic information, and each image candidate region corresponds to one region semantic enhancement information. GCN can be used to establish associations between nodes. In the present disclosure, an association relationship may be established between various region semantic information.

GCN은 컨볼루션 네트워크 모델이다. GCN에 대응하여, GCN의 목표는 그래프 G = (V, E)에서 신호 또는 특징의 매핑을 학습하는 것이다. 그래프를 구축하는 과정은 이미지 후보 영역이 획득된 후에 수행된다. 그래프는 이미지 후보 영역 사이의 공간적 정보에 따라 구축된다. 데이터에 포함된 정보와 데이터 사이의 관계는 향상된 시맨틱 정보가 획득될 수 있도록 이미지 후보 영역의 시맨틱 표현을 향상시키는 데 사용될 수 있다. GCN is a convolutional network model. Corresponding to the GCN, the goal of the GCN is to learn the mapping of signals or features in the graph G = (V, E). The process of constructing the graph is performed after the image candidate area is obtained. A graph is built according to the spatial information between image candidate regions. The information contained in the data and the relationship between the data can be used to improve the semantic representation of the image candidate region so that enhanced semantic information can be obtained.

단계 104. 이미지 영역을 찾기 위한 장치는 찾아질 텍스트에 대응하는 텍스트 특징 세트를 획득하며, 찾아질 텍스트는 T개의 단어를 포함하고, 텍스트 특징 세트는 T개의 단어 특징을 포함하며, 각각의 단어는 하나의 단어 특징에 대응하고, T는 1보다 크거나 같은 정수이다.Step 104. The apparatus for finding the image region obtains a set of text features corresponding to the text to be found, the text to be found includes T words, the text feature set includes T word features, and each word includes: Corresponds to one word feature, and T is an integer greater than or equal to 1.

본 실시예에서, 이미지 영역을 찾기 위한 장치는 찾아질 텍스트를 획득한다. 단계 104는 단계 101 이전에 수행될 수 있거나, 또는 단계 103 이후에 수행될 수 있거나, 또는 단계 101과 동시에 수행될 수 있음이 이해될 수 있다. 단계 104의 실행 순서는 여기에서 제한되지 않는다. 찾아질 텍스트는 구체적으로 사용자에 의해 입력된 텍스트일 수 있거나 또는 사용자에 의해 입력된 음성을 인식하여 획득된 텍스트일 수 있다. 찾아질 텍스트는 단어, 구, 문장, 단락 등의 형태로 표현된다. 찾아질 텍스트는 중국어, 영어, 일본어, 프랑스어, 독일어, 러시아어 등일 수 있다. 이것은 여기에서 제한되지 않는다.In the present embodiment, the apparatus for finding an image region acquires a text to be found. It may be understood that step 104 may be performed before step 101 , may be performed after step 103 , or may be performed concurrently with step 101 . The execution order of step 104 is not limited here. The text to be searched for may specifically be a text input by the user, or may be a text obtained by recognizing a voice input by the user. The text to be searched for is expressed in the form of words, phrases, sentences, paragraphs, and the like. The text to be searched for may be Chinese, English, Japanese, French, German, Russian, or the like. It is not limited here.

찾아질 텍스트가 획득된 후, 최종적으로 텍스트 특성 세트를 획득하기 위해 특징 추출 및 인코딩이 찾아질 텍스트의 각각의 단어에 대해 수행된다. 예를 들어, 찾아질 텍스트 "소년이 한 개의 사과를 들고 있다"는 4개의 단어, 즉 "소년", "들고 있다", "하나" 및 "사과"의 네 단어를 포함한다. 네 단어의 특징이 추출된 다음 텍스트 특징 세트를 추가로 획득하기 위해 인코딩된다. 찾아질 텍스트는 T개의 단어를 포함한다. 텍스트 특징 세트는 T개의 단어 특징을 포함한다. 각각의 단어는 하나의 단어 특징에 대응한다. T는 1보다 크거나 같은 정수이다.After the text to be found is obtained, feature extraction and encoding are performed for each word of the text to be found to finally obtain a set of text characteristics. For example, the text to be found contains four words "the boy is holding an apple", that is, the four words "boy", "holding", "one" and "apple". The features of the four words are extracted and then encoded to further obtain a set of text features. The text to be found contains T words. The text feature set includes T word features. Each word corresponds to one word feature. T is an integer greater than or equal to 1.

찾아질 텍스트 "소년이 한 개의 사과를 들고 있다"는 구체적인 표현임이 이해될 수 있을 것이다. 따라서, "소년"과 "사과"를 모두 포함하는 이미지 후보 영역은 찾아질 이미지로부터 획득될 수 있다.It may be understood that the text to be found is a concrete expression of "the boy is holding an apple". Accordingly, an image candidate area including both "boy" and "apple" can be obtained from the image to be found.

단계 105. 이미지 영역을 찾기 위한 장치는 이미지 영역 찾기 네트워크 모델을 사용하여 텍스트 특징 세트와 각각의 향상된 시맨틱 정보 사이의 매칭 정도를 획득하며, 이미지 영역 찾기 네트워크 모델은 이미지 후보 영역과 찾아질 텍스트 사이의 매칭 관계를 결정하도록 구성된다.Step 105. The apparatus for finding an image region uses an image region finding network model to obtain a degree of matching between the text feature set and each enhanced semantic information, and the image region finding network model is configured to determine the distance between the image candidate region and the text to be found and determine a matching relationship.

본 실시예에서, 이미지 영역을 찾기 위한 장치는 각각의 향상된 시맨틱 정보와 텍스트 특징 세트 각각을 이미지 영역 찾기 네트워크 모델에 개별적으로 입력할 수 있고, 이미지 영역 찾기 네트워크 모델은 대응하는 매칭 정도를 출력한다. 이미지 영역 찾기 네트워크 모델은 이미지 후보 영역과 찾아질 텍스트 사이의 매칭 관계를 결정하도록 구성된다. 즉, 매칭 정도가 높은 경우, 매칭 관계가 강함을 나타낸다.In this embodiment, the apparatus for finding an image region may individually input each set of enhanced semantic information and text feature into an image region finding network model, and the image region finding network model outputs a corresponding matching degree. The image region finding network model is configured to determine a matching relationship between the image candidate region and the text to be found. That is, when the matching degree is high, it indicates that the matching relationship is strong.

매칭 정도는 매칭 스코어 또는 매칭 식별자로 표현될 수 있거나, 또는 다른 유형의 매칭 관계로 표현될 수 있음이 이해될 수 있다.It will be appreciated that the degree of matching may be expressed as a match score or match identifier, or may be expressed as another type of match relationship.

전술한 과정에서, 즉, 이미지 영역을 찾기 위한 장치는 이미지 영역 찾기 네트워크 모델을 사용하여 찾아질 텍스트에 대응하는 텍스트 특징 세트와 각각의 향상된 시맨틱 정보 사이의 매칭 정도를 획득한다. 찾아질 텍스트의 각각의 단어는 텍스트 특징 세트의 하나의 단어 특징에 대응한다.In the above process, that is, the apparatus for finding an image region uses the image region finding network model to obtain a matching degree between a set of text features corresponding to the text to be found and each enhanced semantic information. Each word of the text to be found corresponds to one word feature of the text feature set.

단계 106. 이미지 영역을 찾기 위한 장치는 텍스트 특징 세트와 각각의 향상된 시맨틱 정보 사이의 매칭 정도에 따라 이미지 후보 영역 세트로부터 타깃 이미지 후보 영역을 결정한다.Step 106. The apparatus for finding an image region determines a target image candidate region from the image candidate region set according to a matching degree between the text feature set and each enhanced semantic information.

본 실시예에서, 이미지 영역을 찾기 위한 장치는 텍스트 특징 세트와 각각의 향상된 시맨틱 정보 사이의 매칭 정도에 따라 이미지 후보 영역 중에서 매칭 정도가 가장 높은 이미지 후보 영역을 타깃 이미지 후보 영역으로 선택할 수 있다. 설명의 편의를 위해, [표 1]은 텍스트 특징 세트와 향상된 시맨틱 정보 사이의 개략적인 매칭 정도이다.In the present embodiment, the apparatus for finding an image region may select an image candidate region having the highest matching degree among image candidate regions as the target image candidate region according to the matching degree between the text feature set and each enhanced semantic information. For convenience of explanation, [Table 1] is a schematic degree of matching between the text feature set and the enhanced semantic information.

[표 1]에서 알 수 있는 바와 같이, "텍스트 특징 세트 + 향상된 시맨틱 정보 D"에 대응하는 매칭 정도가 최대이다. 따라서, 이미지 영역을 찾기 위한 장치는 이미지 후보 영역 D를 타깃 이미지 후보 영역으로 사용한다.As can be seen from [Table 1], the matching degree corresponding to "text feature set + enhanced semantic information D" is maximum. Accordingly, the apparatus for finding the image region uses the image candidate region D as the target image candidate region.

본 개시의 실시예에서, 이미지 영역을 찾기 위한 방법이 제공된다. 이 방법은, 먼저 찾아질 이미지에서 이미지 후보 영역 세트를 획득하는 단계 ― 이미지 후보 영역 세트는 N개의 이미지 후보 영역을 포함함 ―, 다음 이미지 후보 영역 세트에 따라 영역 시맨틱 정보 세트를 생성하는 단계 ― 각각의 영역 시맨틱 정보는 ?汰? 이미지 후보 영역에 대응함 ―, 다음 GCN을 사용하여 영역 시맨틱 정보 세트에 대응하는 향상된 시맨틱 정보 세트를 획득하는 단계 ― 각각의 향상된 시맨틱 정보는 하나의 영역 시맨틱 정보에 대응하고, GCN은 다양한 영역 시맨틱 정보 사이의 연관 관계를 구축하도록 구성됨 ―, 또한, 찾아질 텍스트에 대응하는 텍스트 특징 세트를 획득하는 단계, 다음으로, 이미지 영역 찾기 네트워크 모델을 사용하여 텍스트 특징 세트와 각각의 향상된 시맨틱 정보 사이의 매칭 정도를 획득하는 단계, 마지막으로 텍스트 특징 세트와 각각의 향상된 시맨틱 정보 사이의 매칭 정도에 따라 이미지 후보 영역 세트에서 타깃 이미지 후보 영역을 결정하는 단계를 포함한다. 전술한 방식에서, 이미지 후보 영역 사이의 시맨틱 표현이 GCN을 사용하여 효과적으로 향상될 수 있으며, 이미지 후보 영역 사이의 공간적 관계가 고려되어, 이미지 영역을 찾는 정확도를 높일 수 있으므로, 이미지 이해 능력을 개선시킬 수 있다.In an embodiment of the present disclosure, a method for finding an image region is provided. The method includes first obtaining a set of image candidate regions in an image to be found, the set of image candidate regions comprising N image candidate regions, generating a set of region semantic information according to the next set of image candidate regions, each The domain semantic information of ?汰? Corresponding to an image candidate region—obtaining an enhanced semantic information set corresponding to the region semantic information set using the next GCN—each enhanced semantic information corresponds to one region semantic information, and the GCN is between various region semantic information. configured to build an association relationship of -, further, obtaining a set of text features corresponding to the text to be found, then using an image region finding network model to determine the degree of matching between the set of text features and each enhanced semantic information obtaining, and finally determining a target image candidate region from the image candidate region set according to a matching degree between the text feature set and each enhanced semantic information. In the above-described manner, the semantic representation between image candidate regions can be effectively improved by using GCN, and spatial relationships between image candidate regions are considered, so that the accuracy of finding image regions can be increased, thereby improving image understanding ability. can

선택적으로, 도 3에 대응하는 전술한 실시예에 기초하여, 본 개시의 실시예에서 제공되는 이미지 영역을 찾기 위한 방법의 제1 선택적 실시예에서, 이미지 영역을 찾기 위한 장치에 의해, 이미지 후보 영역 세트에 따라 영역 시맨틱 정보 세트를 생성하는 단계는,Optionally, based on the above-described embodiment corresponding to FIG. 3 , in a first optional embodiment of the method for finding an image region provided in an embodiment of the present disclosure, by an apparatus for finding an image region, an image candidate region The step of generating a set of region semantic information according to the set comprises:

이미지 영역을 찾기 위한 장치에 의해, CNN을 사용하여 각각의 이미지 후보 영역에 대응하는 영역 시맨틱 정보를 획득하는 단계 ― 이미지 후보 영역은 영역 정보를 포함하고, 영역 정보는 찾아질 이미지에서 이미지 후보 영역의 위치 정보와 이미지 후보 영역의 크기 정보를 포함함 ―; 및obtaining, by the apparatus for finding an image region, region semantic information corresponding to each image candidate region by using CNN, wherein the image candidate region includes region information, and the region information includes information about the image candidate region in the image to be found. including location information and size information of the image candidate area; and

이미지 영역을 찾기 위한 장치에 의해, N개의 이미지 후보 영역에 대응하는 영역 시맨틱 정보가 획득되는 경우, N개의 영역 시맨틱 정보에 따라 영역 시맨틱 정보 세트를 생성하는 단계generating, by the apparatus for finding an image region, a set of region semantic information according to the N region semantic information when region semantic information corresponding to the N image candidate regions is obtained;

를 포함할 수 있다.may include.

본 실시예에서, 이미지 후보 영역 세트를 획득한 후, 이미지 영역을 찾기 위한 장치는 CNN을 사용하여 각각의 이미지 후보 영역의 영역 시맨틱 정보를 생성할 수 있으며, 영역 시맨틱 정보는 이미지 후보 영역의 시맨틱 표현이다. 구체적으로, 이미지 후보 영역 세트는

로서 정의된다고 가정한다. 각각의 이미지 후보 영역은 영역 정보

를 포함한다. 이미지 후보 영역

는 이미지 후보 영역 세트의 하나의 이미지 후보 영역을 나타낸다.

및

은 찾아질 이미지에서 이미지 후보 영역의 위치 정보를 나타낸다. 구체적으로,

는 찾아질 이미지에서 이미지 후보 영역의 가장 높은 지점의 수평 좌표 위치 정보를 나타내고,

은 찾아질 이미지에서 이미지 후보 영역의 가장 높은 지점의 수직 좌표 위치 정보를 나타내며,

및

은 이미지 후보 영역의 크기 정보를 나타내고, 크기 정보는 찾아질 이미지에 대한 이미지 후보 영역의 비례 크기이며,

은 찾아질 이미지에서 이미지 후보 영역의 폭 정보를 나타내고,

은 찾아질 이미지에서 이미지 후보 영역의 높이 정보를 나타낸다.In this embodiment, after obtaining a set of image candidate regions, the apparatus for finding an image region may generate region semantic information of each image candidate region using CNN, wherein the region semantic information is the semantic representation of the image candidate region. to be. Specifically, the set of image candidate regions is

Assume that it is defined as Each image candidate area contains area information

includes image candidate area

denotes one image candidate region of the image candidate region set.

and

denotes position information of an image candidate region in the image to be found. Specifically,

represents the horizontal coordinate position information of the highest point of the image candidate area in the image to be found,

represents the vertical coordinate position information of the highest point of the image candidate area in the image to be found,

and

represents the size information of the image candidate area, the size information is the proportional size of the image candidate area with respect to the image to be found,

represents the width information of the image candidate region in the image to be found,

denotes height information of the image candidate region in the image to be found.

이미지 후보 영역

은 다음의 표현image candidate area

is the expression

을 획득하기 위해 CNN에 입력된다.is input to CNN to obtain

이와 같이, 대응하는 영역 시맨틱 정보

가 획득된다. 전술한 방식에서, 이미지 후보 영역 세트

에 대응하는 영역 시맨틱 정보 세트

가 획득되며, 여기서 n은 1보다 크거나 같고 N보다 작거나 같은 정수이다.In this way, the corresponding region semantic information

is obtained In the above manner, the image candidate region set

A set of region semantic information corresponding to

is obtained, where n is an integer greater than or equal to 1 and less than or equal to N.

이해의 편의를 위해, CNN은 일반적으로 여러 계층, 즉 컨볼루션 계층, ReLU(rectified linear unit) 계층, 풀링 계층 및 완전 연결 계층을 포함한다.For convenience of understanding, CNN generally includes several layers, namely a convolutional layer, a rectified linear unit (ReLU) layer, a pooling layer, and a fully connected layer.

컨볼루션 계층의 경우, CNN에서 각각의 컨볼루션 계층은 복수의 컨볼루션 유닛에 의해 형성된다. 각가의 컨볼루션 유닛의 파라미터는 역전파 알고리즘을 사용하여 최적화를 통해 획득된다. 컨볼루션 작동의 목적은 입력된 상이한 특징을 추출하는 것이다. 제1 컨벌루션 계층은 하위 레벨 특징만, 예를 들어 에지, 선, 각도 등과 같은 레벨을 추출할 수 있고, 더 많은 계층이 있는 네트워크는 하위 레벨 특징으로부터 더 복잡한 특징을 반복적으로 추출할 수 있다.In the case of convolutional layers, each convolutional layer in CNN is formed by a plurality of convolutional units. The parameters of each convolution unit are obtained through optimization using a backpropagation algorithm. The purpose of the convolution operation is to extract the input different features. The first convolutional layer may extract only low-level features, for example, levels such as edges, lines, angles, etc., and the network with more layers may iteratively extract more complex features from the low-level features.

ReLU 계층의 경우, 신경망 계층에서 활성화 기능을 위해 선형 정류(Linear Rectification)(ReLU)가 사용된다.In the case of the ReLU layer, Linear Rectification (ReLU) is used for the activation function in the neural network layer.

풀링 계층(pooling layer)의 경우, 일반적으로 컨볼루션 계층 이후에 매우 큰 차원의 특징이 획득되고, 그 특징이 복수의 영역으로 분할되며, 그 최대값 또는 평균값이 비교적 작은 차원의 새로운 특징을 획득하기 위해 취해진다.In the case of a pooling layer, in general, after a convolution layer, a feature of a very large dimension is obtained, the feature is divided into a plurality of regions, and the maximum or average value of a new feature of a relatively small dimension is obtained. taken for

완전 연결 계층(fully-connected layer)의 경우, 모든 로컬 특징이 결합되어 전역 특징을 형성하고 각각의 유형의 최종 스코어를 계산하는 데 사용된다.For a fully-connected layer, all local features are combined to form a global feature and used to calculate the final score of each type.

다음으로, 본 개시의 실시예에서, 영역 시맨틱 정보 세트를 생성하는 방식이 제공된다. 먼저 이미지 후보 영역에 대응하는 영역 시맨틱 정보가 CNN을 사용하여 획득한다. 이미지 후보 영역은 영역 정보를 포함한다. 영역 정보는 찾아질 이미지에서 이미지 후보 영역의 위치 정보와 이미지 후보 영역의 크기 정보를 포함한다. 영역 시맨틱 정보 세트는 N개의 이미지 후보 영역에 대응하는 영역 시맨틱 정보가 획득되는 경우 N개의 영역 시맨틱 정보에 따라 생성된다. 전술한 방식에서, CNN을 사용하여 각각의 이미지 후보 영역의 영역 시맨틱 정보가 추출될 수 있다. CNN은 피드 포워드(feed-forward) 신경망이다. CNN의 인공 뉴런은 부분 커버리지 영역에서 주변 유닛에 반응할 수 있으므로, 대규모 이미지 처리에 뛰어난 성능이 있다. 이와 같이, 정보 추출의 정확성이 향상된다.Next, in an embodiment of the present disclosure, a method for generating a region semantic information set is provided. First, region semantic information corresponding to the image candidate region is obtained using CNN. The image candidate region includes region information. The region information includes position information of the image candidate region and size information of the image candidate region in the image to be found. The region semantic information set is generated according to the N region semantic information when region semantic information corresponding to the N image candidate regions is obtained. In the above-described manner, region semantic information of each image candidate region may be extracted using CNN. CNN is a feed-forward neural network. The artificial neurons of CNNs can respond to neighboring units in the partial coverage area, so they have excellent performance for large-scale image processing. In this way, the accuracy of information extraction is improved.

선택적으로, 도 3에 대응하는 전술한 실시예에 기초하여, 본 개시의 실시예에서 제공되는 이미지 영역을 찾기 위한 방법의 제2 선택적 실시예에서, 이미지 영역을 찾기 위한 장치에 의해, GCN을 사용하여 영역 시맨틱 정보에 대응하는 향상된 시맨틱 정보 세트를 획득하는 단계는, Optionally, based on the above-described embodiment corresponding to FIG. 3 , in a second optional embodiment of the method for finding an image region provided in an embodiment of the present disclosure, by the apparatus for finding an image region, using GCN to obtain an improved semantic information set corresponding to the region semantic information,

이미지 영역을 찾기 위한 장치에 의해, 영역 시맨틱 정보 세트로부터 제1 영역 시맨틱 정보 및 제2 영역 시맨틱 정보를 획득하는 단계 ― 제1 영역 시맨틱 정보는 영역 시맨틱 정보 세트에서 임의의 하나의 영역 시맨틱 정보이고, 제2 영역 시맨틱 정보는 영역 시맨틱 정보 세트에서 임의의 하나의 영역 시맨틱 정보임 ―;obtaining, by the apparatus for finding an image region, first region semantic information and second region semantic information from the region semantic information set, wherein the first region semantic information is any one region semantic information in the region semantic information set; the second region semantic information is any one region semantic information in the region semantic information set;

이미지 영역을 찾기 위한 장치에 의해, 제1 영역 시맨틱 정보와 제2 영역 시맨틱 정보 사이의 연결 에지의 강도를 획득하는 단계;obtaining, by the apparatus for finding an image region, a strength of a connecting edge between the first region semantic information and the second region semantic information;

이미지 영역 치에 의해, 정규화된 강도를 획득하기 위해 제1 영역 시맨틱 정보와 제2 영역 시맨틱 정보 사이의 연결 에지의 강도를 정규화하는 단계;normalizing, by the image region value, the strength of a connecting edge between the first region semantic information and the second region semantic information to obtain a normalized intensity;

이미지 영역을 찾기 위한 장치에 의해, 영역 시맨틱 정보 세트의 다양한 영역 시맨틱 정보 사이의 정규화된 강도에 따라 타깃 연결 매트릭스를 결정하는 단계; 및determining, by the apparatus for finding an image region, a target connection matrix according to normalized strengths between various region semantic information of a region semantic information set; and

이미지 영역을 찾기 위한 장치에 의해, GCN을 사용하여 타깃 연결 매트릭스에 대응하는 향상된 시맨틱 정보 세트를 결정하는 단계determining, by the apparatus for finding an image region, an enhanced semantic information set corresponding to a target connection matrix using the GCN;

를 포함할 수 있다.may include.

본 실시예에서, 이미지 영역을 찾기 위한 장치는 이미지 후보 영역의 시맨틱 표현의 향상을 완성하기 위해 하나의 GCN을 사용한다. 먼저, 하나의 그래프가 구축될 필요가 있다. 이러한 그래프에서 각각의 노드는 이미지 후보 영역의 영역 시맨틱 정보에 대응한다. 모든 노드 사이에는 대응하는 연결 에지가 있다. 연결 에지의 연결 정보의 강도는 예측을 위해 하나의 심층망(deep network)에서 비롯된다.In this embodiment, the apparatus for finding an image region uses one GCN to complete the enhancement of the semantic representation of the image candidate region. First, one graph needs to be built. In this graph, each node corresponds to region semantic information of the image candidate region. Between every node there is a corresponding connecting edge. The strength of the connection information of the connection edge comes from one deep network for prediction.

여기서,

는 제1 영역 시맨틱 정보를 나타내고,

는 제2 영역 시맨틱 정보를 나타내며,

는 심층망을 나타내고, 구체적으로 다층 퍼셉트론(multi-layer perceptron), 벡터 내적 또는 코사인 유사성을 사용하여 구현될 수 있다.

는 제1 영역 시맨틱 정보와 제2 영역 시맨틱 정보 사이의 연결 에지의 강도를 나타낸다. 다음으로,

는 정규화된 강도를 획득하기 위해 정규화된다. 그 다음, 영역 시맨틱 정보 세트 내의 다양한 영역 시맨틱 정보 사이의 정규화된 강도에 따라 타깃 연결 매트릭스가 결정된다. 마지막으로, GCN을 사용하여 타깃 연결 매트릭스에 대응하는 향상된 시맨틱 정보 세트가 생성된다.here,

represents the first region semantic information,

represents the second area semantic information,

represents a deep network, and specifically can be implemented using a multi-layer perceptron, vector dot product, or cosine similarity.

denotes the strength of a connection edge between the first region semantic information and the second region semantic information. to the next,

is normalized to obtain a normalized intensity. Then, a target connection matrix is determined according to the normalized strength between the various region semantic information in the region semantic information set. Finally, a set of enhanced semantic information corresponding to the target connectivity matrix is generated using GCN.

그래프는 데이터 포맷이다. 그래프는 소셜 네트워크, 통신 네트워크, 단백질 분자 네트워크 등을 나타내는 데 사용될 수 있다. 그래프의 노드는 네트워크의 개인을 나타낸다. 연결 에지는 개인 간의 연결 관계를 나타낸다. 많은 기계 학습 태스크는 그래프 구조 데이터를 사용해야 한다. 따라서, GCN의 등장은 이러한 문제를 해결하기 위한 새로운 개념을 제공한다. 컨볼루션 슬라이스는 세 단계를 사용하여 구축될 수 있다. 단계 1, 그래프에서 고정 길이의 노드 시퀀스 선택하는 단계; 단계 2, 시퀀스의 각각의 노드에 대해 고정된 크기의 이웃 도메인 세트를 수집하는 단계; 및 단계 3, 컨벌루션 구조의 입력으로 사용하기 위해 현재 노드 및 노드에 대응하는 이웃 도메인에 의해 형성된 서브 그래프를 표준화하는 단계. 전술한 세 단계를 사용하여 모든 컨볼루션 슬라이스가 구축된 후, 컨볼루션 구조는 각각의 슬라이스에 대한 작동을 개별적으로 수행하는 데 사용된다.A graph is a data format. The graph can be used to represent social networks, communication networks, protein molecular networks, and the like. Nodes in the graph represent individuals in the network. A connection edge represents a connection relationship between individuals. Many machine learning tasks require the use of graph-structured data. Therefore, the emergence of GCN provides a new concept to solve these problems. A convolutional slice can be built using three steps. Step 1, selecting a node sequence of a fixed length from the graph; Step 2, collecting a fixed-size set of neighboring domains for each node of the sequence; and Step 3, normalizing the subgraph formed by the current node and the neighboring domain corresponding to the node for use as input to the convolutional structure. After all the convolutional slices are built using the three steps described above, the convolution structure is used to perform operations on each slice individually.

다음으로, 본 개시의 실시예에서, GCN을 사용하여 향상된 시맨틱 정보 세트을 획득하는 방식이 제공된다. 먼저, 제1 영역 시맨틱 정보 및 제2 영역 시맨틱 정보가 영역 시맨틱 정보 세트로부터 획득된다. 그 다음, 제1 영역 시맨틱 정보와 제2 영역 시맨틱 정보 사이의 연결 에지의 강도가 획득된다. 다음으로, 제1 영역 시맨틱 정보와 제2 영역 시맨틱 정보 사이의 연결 에지의 강도가 정규화되어 정규화된 강도를 획득한다. 그 다음, 영역 시맨틱 정보 세트 내의 다양한 영역 시맨틱 정보 사이의 정규화된 강도에 따라 타깃 연결 매트릭스가 결정된다. 마지막으로, GCN을 사용하여 타깃 연결 매트릭스에 대응하는 향상된 시맨틱 정보 세트가 결정된다. 전술한 방식에서, GCN을 사용하여 이미지 후보 영역 사이의 시맨틱 관계가 구축된다. 이러한 방식으로, 공간적 정보와 시맨틱 관계가 충분히 고려됨으로써, 이미지 기반 찾기 성능을 향상시킬 수 있다.Next, in an embodiment of the present disclosure, a method of obtaining an enhanced semantic information set using GCN is provided. First, the first region semantic information and the second region semantic information are obtained from the region semantic information set. Then, the strength of the connecting edge between the first region semantic information and the second region semantic information is obtained. Next, the strength of the connecting edge between the first region semantic information and the second region semantic information is normalized to obtain a normalized strength. Then, a target connection matrix is determined according to the normalized strength between the various region semantic information in the region semantic information set. Finally, the set of enhanced semantic information corresponding to the target connectivity matrix is determined using the GCN. In the above-described manner, a semantic relationship between image candidate regions is established using GCN. In this way, the spatial information and the semantic relationship are sufficiently taken into account, thereby improving the image-based finding performance.

선택적으로, 도 3에 대응하는 전술한 제2 실시예에 기초하여, 본 개시의 실시예에서 제공되는 이미지 영역을 찾기 위한 방법의 제3 선택적 실시예에서, 이미지 영역을 찾기 위한 장치에 의해, 영역 시맨틱 정보 세트의 다양한 영역 시맨틱 정보 사이의 정규화된 강도에 따라 타깃 연결 매트릭스를 결정하는 단계는,Optionally, based on the above-described second embodiment corresponding to FIG. 3 , in a third optional embodiment of the method for locating an image region provided in an embodiment of the present disclosure, by an apparatus for locating an image region, the region Determining the target connection matrix according to the normalized strength between the semantic information of various regions of the semantic information set comprises:

이미지 영역을 찾기 위한 장치에 의해, 영역 시맨틱 정보 세트 내의 다양한 영역 시맨틱 정보 사이의 정규화된 강도에 따라 연결 매트릭스를 생성하는 단계; 및generating, by the apparatus for finding an image region, a connection matrix according to normalized strengths between various region semantic information in a region semantic information set; and

이미지 영역을 찾기 위한 장치에 의해, 연결 매트릭스 및 단위 매트릭스에 따라 타깃 연결 매트릭스를 생성하는 단계generating a target connection matrix according to the connection matrix and the unit matrix, by the device for finding the image region;

를 포함할 수 있다.may include.

본 실시예에서, 이미지 영역을 찾기 위한 장치는 정규화된 강도를 획득하기 위해 먼저 제1 영역 시맨틱 정보와 제2 영역 시맨틱 정보 사이의 연결 에지의 강도를 정규화할 수 있다. 전술한 실시예에 기초하여, 제1 영역 시맨틱 정보와 제2 영역 시맨틱미 정보 사이의 연결 에지의 강도는 구체적으로,In the present embodiment, the apparatus for finding an image region may first normalize the intensity of a connecting edge between the first region semantic information and the second region semantic information to obtain a normalized intensity. Based on the above-described embodiment, the strength of the connecting edge between the first region semantic information and the second region semantic beauty information is specifically:

과 같이 표현되며,is expressed as

여기서,

는 제1 영역 시맨틱 정보를 나타내고,

는 제2 영역 시맨틱 정보를 나타내며,

,

는 모두 GCN의 모델 파라미터이고,

는 제1 영역 시맨틱 정보와 제2 영역 시맨틱 정보 사이의 연결 에지의 강도를 나타낸다.here,

represents the first region semantic information,

represents the second area semantic information,

,

are all model parameters of GCN,

denotes the strength of a connection edge between the first region semantic information and the second region semantic information.

정규화된 연산은 정규화된 강도를 획득하기 위해 대응하는 에지 정보에 대해 추가로 수행될 수 있으며,Normalized operation may be further performed on the corresponding edge information to obtain normalized strength,

여기서,

는 제1 영역 시맨틱 정보와 제2 영역 시맨틱 정보 사이의 정규화된 강도를 나타낸다. 대응하는 영역 시맨틱 정보의 연결 정보가 완전한 연결 매트릭스를 추가로 구축하기 위해 모든 영역 시맨틱 정보가 트래버스(traverse)된다. here,

denotes the normalized strength between the first region semantic information and the second region semantic information. All the region semantic information is traversed in order to further construct the connection information of the corresponding region semantic information complete connection matrix.

정보를 추가로 향상시키기 위해, 단위 매트릭스는 타깃 연결 매트릭스를 획득하기 위해 대응하는 완전한 연결 매트릭스에 추가된다.To further enhance the information, the unit matrix is added to the corresponding complete connectivity matrix to obtain a target connectivity matrix.

다음으로, 본 개시의 실시예에서, 영역 시맨틱 정보 세트 내의 다양한 영역 시맨틱 정보 사이의 정규화된 강도에 따라 타깃 연결 매트릭스를 결정하는 방식이 제공된다. 즉, 연결 매트릭스는 먼저 영역 시맨틱 정보 세트의 다양한 영역 시맨틱 정보 사이의 정규화된 강도에 따라 생성된다. 그런 다음, 타깃 연결 매트릭스는 연결 매트릭스와 단위 매트릭스에 따라 생성된다. 전술한 방식에서, 정규화의 처리 측정을 통해, 물리적 시스템에서 값의 절대값이 상대값 관계로 전환될 수 있으므로, 계산이 단순화되고 크기가 감소될 수 있다. 또한, 정보를 추가로 향상시키기 위해, 단위 매트릭스는 타깃 연결 매트릭스를 형성하기 위해 대응하는 연결 매트릭스에 추가된다. Next, in an embodiment of the present disclosure, a method of determining a target connection matrix according to normalized strengths between various region semantic information in a region semantic information set is provided. That is, the connection matrix is first generated according to the normalized strength between the various region semantic information of the region semantic information set. Then, a target connection matrix is generated according to the connection matrix and the unit matrix. In the above-described manner, through the process measurement of normalization, the absolute value of a value in the physical system can be converted into a relative value relationship, so that the calculation can be simplified and the size can be reduced. Further, to further enhance the information, the unit matrix is added to the corresponding connection matrix to form a target connection matrix.

선택적으로, 도 3에 대응하는 전술한 제2 또는 제3 실시예에 기초하여, 본 개시의 실시예에서 제공되는 이미지 영역을 찾기 위한 방법의 제4 선택적 실시예에서, 이미지 영역을 찾기 위한 장치에 의해, GCN을 사용하여 타깃 연결 매트릭스에 대응하는 향상된 시맨틱 정보 세트를 결정하는 단계는, Optionally, based on the above-described second or third embodiment corresponding to FIG. 3 , in a fourth optional embodiment of the method for finding an image region provided in an embodiment of the present disclosure, in an apparatus for finding an image region By using the GCN, determining the enhanced semantic information set corresponding to the target connection matrix comprises:

이미지 영역을 찾기 위한 장치에 의해, 다음과 같은 방식으로 향상된 시맨틱 정보 세트를 계산하는 단계를 포함할 수 있으며,calculating, by the device for finding an image region, an enhanced set of semantic information in the following manner,

여기서,

는 GCN의 k 번째 계층의 제1 네트워크 파라미터를 나타내고,

는 GCN의 k 번째 계층의 제2 네트워크 파라미터를 나타내며,

는 j 번째 노드가 i 번째 노드의 이웃 노드임을 나타내고,

는 타깃 연결 매트릭스의 요소를 나타낸다. here,

represents the first network parameter of the k-th layer of GCN,

represents the second network parameter of the k-th layer of GCN,

indicates that the j-th node is a neighbor of the i-th node,

represents the elements of the target connection matrix.

본 실시예에서, 이미지 영역을 찾기 위한 장치에 의해, 타깃 연결 매트릭스에 기초하여 GCN을 사용하여 이미지 후보 영역의 시맨틱 표현을 향상시키는 단계는 다음의 수학식을 사용할 수 있으며,In this embodiment, the step of improving the semantic representation of the image candidate region using GCN based on the target connection matrix by the apparatus for finding the image region may use the following equation,

여기서,

는 GCN의 k 번째 계층의 제1 네트워크 파라미터를 나타내고,

는 GCN의 k 번째 계층의 제2 네트워크 파라미터를 나타낸다. GCN의 네트워크 파라미터는 그래프의 컨볼루션 계층 사이에 공유되지 않음이 이해될 수 있다. 그러나, 고정된 컨볼루션 계층에서, GCN의 네트워크 파라미터는 공유될 수 있거나 또는 공유되지 않을 수 있다. 노드 i에 대응하는 이웃 노드는 노드 J로서 선택된다. 노드 사이의 유사성은 노드 사이의 시맨틱 유사성을 사용하여 측정될 수 있다. 따라서, 하나의 완전히 연결된 그래프 구조가 구축되고, 이에 상응하여 각각의 노드는 다른 노드에 연결된다. 마지막으로, 각각의 노드의 대응하는 시맨틱 표현은 구축된 타깃 연결 매트릭스에 기초하여 업데이트된다.here,

represents the first network parameter of the k-th layer of GCN,

denotes the second network parameter of the k-th layer of GCN. It can be understood that the network parameters of the GCN are not shared among the convolutional layers of the graph. However, in a fixed convolutional layer, the network parameters of the GCN may or may not be shared. A neighboring node corresponding to node i is selected as node J. Similarity between nodes can be measured using semantic similarity between nodes. Thus, one fully connected graph structure is built, with each node correspondingly connected to another node. Finally, the corresponding semantic representation of each node is updated based on the built-up target connection matrix.

다층 그래프 컨볼루션 처리는 GCN에서 수행될 수 있고, 전술한 수학식에 대해 여러 번 수행될 수 있으며, 동일한 네트워크 파라미터의 세트는 공유될 수 있거나 또는 네트워크 파라미터는 공유되지 않을 수 있다.Multilayer graph convolution processing may be performed in GCN, and may be performed multiple times for the above-mentioned equations, and the same set of network parameters may be shared or network parameters may not be shared.

또한, 본 개시의 실시예에서, GCN을 사용하여 결정되는 타깃 연결 매트릭스에 대응하는 향상된 시맨틱 정보 세트의 구체적인 방식이 제공된다. 전술한 방식에서, 구체적인 계산 방식이 GCN 기반 계산을 위해 제공되므로, 해결수단의 실행 가능성과 운용성을 향상시킬 수 있다.In addition, in an embodiment of the present disclosure, a specific manner of an enhanced semantic information set corresponding to a target connection matrix determined using GCN is provided. In the above-described manner, since a specific calculation method is provided for the GCN-based calculation, it is possible to improve the feasibility and operability of the solution.

선택적으로, 도 3에 대응하는 전술한 실시예에 기초하여, 본 개시의 실시예에서 제공되는 이미지 영역을 찾기 위한 방법의 제5 선택적 실시예에서, 이미지 영역을 찾기 위한 장치에 의해, 찾아질 텍스트에 대응하는 텍스트 특징 세트를 획득하는 단계는, Optionally, based on the above-described embodiment corresponding to FIG. 3 , in a fifth optional embodiment of the method for finding an image region provided in an embodiment of the present disclosure, by an apparatus for finding an image region, text to be found Obtaining a text feature set corresponding to

이미지 영역을 찾기 위한 장치에 의해, 찾아질 텍스트를 획득하는 단계;obtaining, by the apparatus for finding an image region, text to be found;

이미지 영역을 찾기 위한 장치에 의해, 찾아질 텍스트에 따라 텍스트 벡터 시퀀스를 획득하는 단계 ― 텍스트 벡터 시퀀스는 T개의 단어 벡터를 포함하고, 각각의 단어 벡터는 하나의 단어에 대응함 ―;obtaining, by the apparatus for finding an image region, a text vector sequence according to the text to be found, wherein the text vector sequence includes T word vectors, each word vector corresponding to one word;

이미지 영역을 찾기 위한 장치에 의해, 텍스트 특징을 획득하기 위해 텍스트 벡터 시퀀스 내의 각각의 단어 벡터를 인코딩하는 단계; 및encoding, by the apparatus for finding an image region, each word vector in the text vector sequence to obtain a text feature; and

이미지 영역을 찾기 위한 장치에 의해, T개의 단어 벡터에 대응하는 텍스트 특징이 획득되는 경우, T개의 텍스트 특징에 따라 텍스트 특징 세트를 생성하는 단계generating, by the apparatus for finding an image region, a text feature set according to the T text features when text features corresponding to the T word vectors are obtained;

를 포함할 수 있다.may include.

본 실시예에서, 이미지 영역을 찾기 위한 장치는 먼저 찾아질 텍스트를 획득한다. 찾아질 텍스트는 사용자가 입력된 텍스트일 수 있고, 사용자에 의해 입력된 음성으로부터 변환된 텍스트일 수 있거나, 또는 백엔드에서 추출된 텍스트일 수도있다. 찾아질 텍스트가 획득된 후, 찾아질 텍스트의 각각의 단어가 추출되고, 그 후 각각의 단어 벡터가 각각의 단어에 대해 구축된다. 찾아질 텍스트는 T개의 단어를 포함한다고 가정된다. 이 경우, T개의 단어 벡터가 획득될 수 있다. T개의 단어 벡터는 텍스트 벡터 시퀀스를 형성한다. 이미지 영역을 찾기 위한 장치는 LSTM 네트워크 구조를 사용하여 텍스트 벡터 시퀀스를 인코딩한다. 구체적으로, 각각의 단어 벡터는 LSTM 구조를 사용하여 인코딩되어 T개의 텍스트 특징을 획득함으로써, 텍스트 특징 세트를 생성할 수 있다.In this embodiment, the apparatus for finding an image region first acquires a text to be found. The text to be found may be text input by the user, text converted from voice input by the user, or text extracted from the backend. After the text to be found is obtained, each word of the text to be found is extracted, and then a respective word vector is constructed for each word. It is assumed that the text to be found contains T words. In this case, T word vectors may be obtained. The T word vectors form a text vector sequence. A device for finding an image region encodes a sequence of text vectors using an LSTM network structure. Specifically, each word vector may be encoded using an LSTM structure to obtain T text features, thereby generating a set of text features.

자연어 처리 태스크에서, 컴퓨터에서 단어를 표현하는 방법이 먼저 고려된다. 일반적으로, 두 개의 표현 방식, 즉 이산적 표현(원 핫 표현(one-hot representation))과 분산 표현이 있다. 원 핫 표현에서, 각각의 단어는 하나의 긴 벡터로 표현된다. 이러한 벡터의 차원은 단어 테이블 크기이다. 벡터에서, 하나의 차원의 값만이 1이고, 나머지 차원은 0이다. 이러한 차원은 현재 단어를 나타낸다. 본 개시에서 단어 벡터 차원은 300개의 차원일 수 있다. 단어 임베딩(word embedding)에서, 단어는 분산 표현으로 변환되거나, 또는 단어 벡터로서 지칭된다. 단어 벡터를 생성하는 방법에는 여러 가지가 있다. 이러한 방법은 모든 단어의 의미가 단어 주변의 단어로 표현될 수 있다는 한 가지 아이디어를 따른다. 단어 벡터를 생성하는 방식은 통계 기반 방법 및 언어 모델 기반 방법을 포함할 수 있다.In a natural language processing task, a method of representing a word in a computer is first considered. In general, there are two representations: a discrete representation (one-hot representation) and a distributed representation. In the one-hot representation, each word is represented by one long vector. The dimension of this vector is the word table size. In a vector, only one dimension has a value of 1 and the other dimensions are 0. These dimensions represent the current word. In this disclosure, the word vector dimension may be 300 dimensions. In word embeddings, words are transformed into distributed representations, or referred to as word vectors. There are several ways to create a word vector. These methods follow one idea that the meaning of any word can be expressed in words around it. A method of generating a word vector may include a statistical-based method and a language model-based method.

다음으로, 본 개시의 실시예에서, 텍스트 특징 세트를 획득하기 위한 방법이 제공된다. 즉, 찾아질 텍스트가 먼저 획득된다. 그런 다음, 텍스트 벡터 시퀀스가 찾아질 텍스트에 따라 획득된다. 텍스트 벡터 시퀀스는 T개의 단어 벡터가 포함한다. 각각의 단어 벡터는 하나의 단어에 대응한다. 다음으로, 텍스트 벡터 시퀀스의 각각의 단어 벡터가 인코딩되어 텍스트 특징을 획득한다. 텍스트 측징 세트는 T개의 단어 벡터에 대응하는 텍스트 특징이 획득되는 경우 T개의 텍스트 특징에 따라 생성된다. 전술한 방식에서, 찾아질 텍스트는 후속 모델 예측을 더욱 용이하게 하기 위해 특징 형태로 표현될 수 있으므로, 해결수단의 실행 가능성 및 운용성을 향상시킬 수 있다.Next, in an embodiment of the present disclosure, a method for obtaining a text feature set is provided. That is, the text to be found is obtained first. Then, a text vector sequence is obtained according to the text to be searched for. The text vector sequence contains T word vectors. Each word vector corresponds to one word. Next, each word vector of the text vector sequence is encoded to obtain text features. A text dimensioning set is generated according to the T text features when text features corresponding to the T word vectors are obtained. In the above-described manner, the text to be found can be expressed in the form of features to further facilitate subsequent model prediction, thereby improving the feasibility and operability of the solution.

선택적으로, 도 3에 대응하는 전술한 제5 실시예에 기초하여, 본 개시의 실시예에서 제공되는 이미지 영역을 찾기 위한 방법의 제6 선택적 실시예에서, 이미지 영역을 찾기 위한 장치에 의해, 텍스트 특징을 획득하기 위해 텍스트 벡터 시퀀스 내의 각각의 단어 벡터를 인코딩하는 단계는, Optionally, based on the above-described fifth embodiment corresponding to FIG. 3 , in a sixth optional embodiment of the method for finding an image region provided in an embodiment of the present disclosure, by an apparatus for finding an image region, a text Encoding each word vector in the text vector sequence to obtain a feature comprises:

이미지 영역을 찾기 위한 장치에 의해, 다음과 같은 방식으로 텍스트 특징을 획득하는 단계를 포함할 수 있으며,by the apparatus for finding an image region, it may include obtaining text features in the following manner,

여기서,

는 텍스트 특징 세트에서 t 번째 텍스트 특징을 나타내고,

는 LSTM 네트워크를 사용하여 인코딩을 수행하는 것을 나타내며,

는 텍스트 벡터 시퀀스에서 t 번째 단어 벡터를 나타내고,

는 텍스트 특징 세트에서 (t-1) 번째 텍스트 특징을 나타낸다. here,

denotes the t-th text feature in the text feature set,

indicates that encoding is performed using the LSTM network,

denotes the t-th word vector in the text vector sequence,

denotes the (t-1)-th text feature in the text feature set.

본 실시예에서, 이미지 영역을 찾기 위한 장치는 텍스트 특징을 획득하기 위해 LSTM 구조를 사용하여 각각의 단어 벡터를 인코딩할 수 있다. 입력된 찾아질 텍스트

에 대해, T는 찾아질 텍스트의 T개의 단어를 나타내고,

는 찾아질 텍스트에서 t 번째 단어를 나타낸다. 먼저, 찾아질 텍스트의 단어 벡터 표현은 각각의 단어의 단어 벡터 표현을 사용하여 획득될 수 있다. 즉, 텍스트 벡터 시퀀스

가 획득된다. 각각의 단어 벡터는 300개의 차원을 가질 수 있다. 찾아질 텍스트의 경우, LSTM 구조의 RNN이 찾아질 텍스트를 인코딩하는 데 사용된다. 즉, In this embodiment, the apparatus for finding an image region may encode each word vector using an LSTM structure to obtain a text feature. Entered text to be found

For , T denotes the T words of the text to be found,

denotes the t-th word in the text to be found. First, a word vector representation of the text to be found may be obtained using the word vector representation of each word. i.e. text vector sequence

is obtained Each word vector can have 300 dimensions. For the text to be searched for, the RNN of the LSTM structure is used to encode the text to be searched for. In other words,

LSTM에서 은닉 상태의 차원의 수량은 512로 설정될 수 있다. 찾아질 텍스트의 특징 표현은 처리 후에 획득된다. 즉, 텍스트 특징 세트

가 획득되며, 여기서 LSTM 처리의 구체적인 방식은 다음과 같으며,In LSTM, the number of dimensions of the hidden state may be set to 512. The feature representation of the text to be found is obtained after processing. i.e. a set of text features

is obtained, where the specific manner of LSTM processing is as follows,

,

, 및

, and

이며

is

여기서,

는 텍스트 벡터 시퀀스에서 t 번째 단어 벡터를 나타내고,

는 텍스트 특징 세트에서 (t-1) 번째 텍스트 특징을 나타내며,

는 입력 게이트를 나타내고,

는 포겟 게이트(forget gate)를 나타내며,

는 출력 게이트를 나타내고,

는 은닉 상태를 나타내며,

는 시그모이드 함수(sigmoid function)이고, tanh()는 쌍곡선 함수를 나타내며,

는 메모리 정보를 나타내고,

는 LSTM 파라미터를 나타내며,

는 점 곱셈(dot multiplication)을 나타내고,

는 변환 또는 매핑 매트릭스를 나타낸다.here,

denotes the t-th word vector in the text vector sequence,

denotes the (t-1)th text feature in the text feature set,

represents the input gate,

denotes a forget gate,

denotes the output gate,

represents the hidden state,

is a sigmoid function, tanh() denotes a hyperbolic function,

represents memory information,

represents the LSTM parameter,

represents dot multiplication,

denotes a transformation or mapping matrix.

LSTM은 장기(long-term) 저장 입력이다. 메모리 셀이라고 하는 특정 유닛은 가산기 및 게이트 제어 뉴런과 유사하고, 다음 시간 단계에서 가중치 값을 가지며 자체 상태의 실제 값과 누적된 외부 신호를 복사하기 위해 자체적으로 연결된다. 그러나, 이러한 자체 연결은 다른 유닛의 학습을 수행하고 메모리 내용을 지울 시기를 결정하기 위해 곱셈 게이트에 의해 제어된다.LSTM is a long-term storage input. Certain units, called memory cells, are analogous to adders and gate control neurons, have weighted values at the next time step, and connect themselves to copy the actual values of their states and accumulated external signals. However, this self-connection is controlled by the multiplication gate to determine when to perform the learning of the other unit and erase the memory contents.

다음으로, 본 개시의 실시예에서, 텍스트 특징을 획득하는 방식이 제공된다. 즉, LSTM 구조의 RNN이 단어 벡터를 인코딩하는 데 사용된다. 전술한 방식에서, LSTM 구조의 네트워크를 사용하면, 역전파 과정의 점진적인 감소로 인한 기울기 소실(vanishing gradient) 문제가 해결될 수 있다. 구체적으로, 언어 처리 태스크에서, LSTM은 시간적 높이와 관련된 문제, 예를 들어, 기계 번역, 대화 생성, 및 인코딩과 디코딩을 처리하는 데 적합하다.Next, in an embodiment of the present disclosure, a method of obtaining a text feature is provided. That is, the RNN of the LSTM structure is used to encode the word vector. In the above-described manner, if the network of the LSTM structure is used, the problem of vanishing gradient due to the gradual reduction of the backpropagation process can be solved. Specifically, in language processing tasks, LSTMs are suitable for handling temporal height related problems, such as machine translation, dialog generation, and encoding and decoding.

전술한 설명을 참조하여, 본 개시의 모델 훈련 방법이 아래에서 설명된다. 도 4를 참조하면, 본 방법이 모델 훈련 장치에 적용되는 예가 설명을 위한 예로서 사용된다. 모델 훈련 장치는 서버에 배치될 수 있다. 본 개시의 실시예에서 모델 훈련 방법의 실시예는 다음의 단계를 포함한다.With reference to the foregoing description, a model training method of the present disclosure is described below. Referring to FIG. 4 , an example in which the present method is applied to a model training apparatus is used as an example for explanation. The model training device may be deployed on a server. An embodiment of a model training method in an embodiment of the present disclosure includes the following steps.

단계 201. 모델 훈련 장치는 훈련될 텍스트 세트 및 훈련될 이미지 후보 영역 세트를 획득하고, 훈련될 텍스트 세트는 제1 훈련될 텍스트와 제2 훈련될 텍스트를 포함하며, 훈련될 이미지 후보 영역 세트는 제1 훈련될 이미지 후보 영역 및 제2 훈련될 이미지 후보 영역을 포함하고, 제1 훈련될 텍스트와 제1 훈련될 이미지 후보 영역은 매칭 관계를 가지며, 제1 훈련된 텍스트와 제2 훈련될 이미지 후보 영역은 매칭 관계를 갖지 않고, 제2 훈련될 텍스트와 제2 훈련될 이미지 후보 영역은 매칭 관계를 가지며, 제2 훈련될 텍스트와 제1 훈련될 이미지 후보 영역은 매칭 관계를 갖지 않는다. Step 201. The model training apparatus obtains a set of text to be trained and a set of image candidate regions to be trained, the set of text to be trained includes a first to-be-trained text and a second to-be-trained text, and the set of image candidate regions to be trained is a first a first image candidate region to be trained and a second image to-be-trained candidate region, wherein the first text to-be-trained and the first to-be-trained image candidate region have a matching relationship, and the first trained text and the second image candidate region to be trained does not have a matching relationship, the second to-be-trained text and the second to-be-trained image candidate region have a matching relationship, and the second to-be-trained text and the first to-be-trained image candidate region do not have a matching relationship.

본 실시예에서, 모델 훈련 장치는 먼저 훈련될 텍스트 세트 및 훈련될 이미지 후보 영역 세트를 획득한다. 훈련될 텍스트 세트는 제1 훈련될 텍스트와 제2 훈련될 텍스트를 포함한다. 훈련될 이미지 후보 영역 세트는 제1 훈련될 이미지 후보 영역 및 제2 훈련될 이미지 후보 영역을 포함한다. 이 경우, 매칭 관계를 갖는 제1 훈련될 텍스트와 제1 훈련될 이미지 후보 영역은 양성 샘플로 사용되고, 매칭 관계를 갖는 제2 훈련될 텍스트와 제2 훈련될 이미지 후보 영역은 양성 샘플로 사용된다. 매칭 관계를 갖지 않는 제1 훈련될 텍스트와 제2 훈련될 이미지 후보 영역은 음성 샘플로 사용되고, 매칭 관계를 갖지 않는 제2 훈련될 텍스트와 제2 훈련될 이미지 후보 영역은 음성 샘플로 사용된다.In this embodiment, the model training apparatus first obtains a set of texts to be trained and a set of image candidate regions to be trained. The set of texts to be trained includes a first text to be trained and a second text to be trained. The set of image candidate regions to be trained includes a first image candidate region to be trained and a second image candidate region to be trained. In this case, the first text to be trained and the first to-be-trained image candidate region having a matching relationship are used as positive samples, and the second to-be-trained text to be trained and the second to-be-trained image candidate region having a matching relationship are used as positive samples. The first text-to-be-trained and second to-be-trained image candidate regions that do not have a matching relationship are used as speech samples, and the second text to-be-trained and second to-be-trained image candidate regions that do not have a matching relationship are used as speech samples.

모델 훈련 장치가 서버에 배치되어 있음이 이해될 수 있다.It can be understood that the model training device is disposed on the server.

단계 202. 모델 훈련 장치는 제1 훈련될 텍스트, 제2 훈련될 텍스트, 제1 훈련될 이미지 후보 영역 및 제2 훈련될 이미지 후보 영역에 따라 타깃 손실 함수를 결정한다.Step 202. The model training apparatus determines a target loss function according to the first to-be-trained text, the second to-be-trained text, the first to-be-trained image candidate region, and the second to-be-trained image candidate region.

본 실시예에서, 모델 훈련 장치는 타깃 손실 함수를 구축하기 위해 양성 샘플과 음성 샘플에 따라 자연 문장과 이미지 후보 영역 사이의 매칭 관계를 학습한다. 타깃 손실 함수는 주로 후보 이미지 영역과 자연 문장 사이의 유사성을 측정하도록 구성된다.In this embodiment, the model training apparatus learns a matching relationship between natural sentences and image candidate regions according to positive and negative samples to build a target loss function. The target loss function is mainly constructed to measure the similarity between the candidate image region and the natural sentence.

단계 203. 모델 훈련 장치는 이미지 영역 찾기 네트워크 모델을 획득하기 위해 타깃 손실 함수를 사용하여 훈련될 이미지 영역 찾기 네트워크 모델을 훈련하고, 이미지 영역 찾기 네트워크 모델은 텍스트 특징 세트와 향상된 시맨틱 정보에 따라 이미지 후보 영역과 찾아질 텍스트 사이의 매칭 관계를 결정하도록 구성되며, 향상된 시맨틱 정보와 이미지 후보 영역은 대응관계를 갖고, 텍스트 특징 세트와 찾아질 텍스트는 대응관계를 갖는다.Step 203. The model training device trains an image region finding network model to be trained using a target loss function to obtain an image region finding network model, and the image region finding network model is an image candidate according to the text feature set and the enhanced semantic information. and determine a matching relationship between the region and the text to be found, wherein the enhanced semantic information and the image candidate region have a correspondence, and the text feature set and the text to be found have a correspondence.

본 실시예에서, 모델 훈련 장치는 이미지 영역 찾기 네트워크 모델을 추가로 획득하기 위해 훈련될 이미지 영역의 위치 찾기 네트워크 모델을 훈련시키기 위해 구축된 타깃 손실 함수를 사용한다. 이미지 후보 영역과 찾아질 텍스트 사이의 매칭 정도는 이미지 영역 찾기 네트워크 모델을 사용하여 예측될 수 있다. 매칭 정도가 높은 경우, 표현 연관 정도가 높다.In this embodiment, the model training apparatus uses the target loss function constructed to train the localization network model of the image region to be trained to further obtain the image region finding network model. The degree of matching between the image candidate region and the text to be found can be predicted using an image region finding network model. When the degree of matching is high, the degree of expression association is high.

본 개시의 실시예에서, 모델 훈련 방법이 제공된다. 이 방법은, 먼저 훈련될 텍스트 세트 및 훈련될 이미지 후보 영역 세트를 획득하는 단계 ― 훈련될 텍스트 세트는 제1 훈련될 텍스트 및 제2 훈련될 텍스트를 포함하고, 훈련될 이미지 후보 영역 세트는 제1 훈련될 이미지 후보 영역 및 제2 훈련될 이미지 후보 영역을 포함함 ―, 그 후 제1 훈련될 텍스트, 제2 훈련될 텍스트, 제1 훈련될 이미지 후보 영역 및 제2 훈련될 이미지 후보 영역에 따라 타깃 손실 함수를 결정하는 단계, 및 마지막으로 이미지 영역 찾기 네트워크 모드를 획득하기 위해 타깃 손실 함수를 사용하여 훈련될 이미지 영역 찾기 네트워크 모델을 훈련시키는 단계를 포함한다. 전술한 방식에서, 이미지 후보 영역과 텍스트 사이의 매칭 관계를 결정하도록 구성된 이미지 영역 찾기 네트워크 모델은 훈련을 통해 획득될 수 있으며, 사용된 타깃 함수는 학습을 통해 텍스트와 이미지 후보 영역 사이의 매칭 관계를 획득하기 위해 이미지 후보 영역과 텍스트 사이의 유사도를 측정하는 데 사용될 수 있으므로, 해결수단의 실행 가능성과 운용성을 향상시킬 수 있다.In an embodiment of the present disclosure, a model training method is provided. The method includes first obtaining a set of text to be trained and a set of image candidate regions to be trained, wherein the set of text to be trained includes a first text to be trained and a second text to be trained, and the set of image candidate regions to be trained includes a first comprising an image candidate region to be trained and a second image candidate region to be trained, then a target according to the first text to be trained, the second text to be trained, the first image candidate region to be trained and the second image candidate region to be trained determining a loss function, and finally training an image region finding network model to be trained using the target loss function to obtain an image region finding network mode. In the above manner, the image region finding network model configured to determine the matching relationship between the image candidate region and the text may be obtained through training, and the target function used may determine the matching relationship between the text and the image candidate region through learning. It can be used to measure the similarity between the image candidate region and the text to obtain, thus improving the feasibility and operability of the solution.

선택적으로, 도 4에 대응하는 전술한 실시예에 기초하여, 본 개시의 실시예에서 제공되는 모델 훈련 방법의 제1 선택적 실시예에서, 모델 훈련 장치에 의해, 제1 훈련될 텍스트, 제2 훈련될 텍스트, 제1 훈련될 이미지 후보 영역 및 제2 훈련될 이미지 후보 영역에 따라 타깃 손실 함수를 결정하는 단계는,Optionally, based on the above-described embodiment corresponding to FIG. 4 , in the first optional embodiment of the model training method provided in the embodiment of the present disclosure, by the model training apparatus, the first text to be trained, the second training Determining the target loss function according to the text to be trained, the first to-be-trained image candidate region, and the second to-be-trained image candidate region includes:

모델 훈련 장치에 의해, 다음과 같은 방식으로 타깃 손실 함수를 결정하는 단계를 포함할 수 있으며,determining, by the model training device, a target loss function in the following manner,

여기서,

는 타깃 손실 함수를 나타내고,

는 제1 훈련될 이미지 후보 영역을 나타내며,

는 제1 훈련될 텍스트를 나타내고,

는 제2 훈련될 이미지 후보 영역을 나타내며,

는 제2 훈련될 텍스트를 나타내고,

은 제1 파라미터 제어 가중치를 나타내며,

는 제2 파라미터 제어 가중치를 나타내고,

은 제1 미리 설정된 임계값을 나타내며,

는 제2 미리 설정된 임계값을 나타낸다.here,

represents the target loss function,

represents the first image candidate region to be trained,

represents the first to-be-trained text,

denotes the second to-be-trained image candidate region,

represents the second to-be-trained text,

denotes the data pair to be trained, max() denotes to take the maximum,

represents the first parameter control weight,

represents the second parameter control weight,

represents the first preset threshold,

denotes a second preset threshold value.

본 실시예에서, 모델 훈련 장치를 사용하여 구축된 타깃 손실 함수가 설명되고, 양성 샘플 및 음성 샘플에 기초하여 구축된 타깃 손실 함수는 다음과 같이 표현되며,In this embodiment, a target loss function constructed using the model training apparatus is described, and the target loss function constructed based on the positive sample and the negative sample is expressed as:

여기서,

는 양성 샘플을 나타낸다. 즉, 시맨틱 관계를 갖는 이미지 후보 영역과 자연어의 쌍을 나타낸다.

와

는 음성 샘플을 나타낸다. 즉, 이미지 후보 영역과 관련되지 않은 자연어의 쌍을 나타낸다.

는 양성 샘플이다. 대응하는 음성 샘플

는

에 대해 취해진다. 이러한 매칭 기능을 학습하면, 양성 샘플 사이의 매칭 관계가 음성 샘플 사이의 매칭 관계보다 더 높다.

는 양성 샘플이다. 대응하는 음성 샘플

은

에 대해 취해진다. 이러한 매칭 기능을 학습하면, 양성 샘플 사이의 매칭 관계가 음성 샘플 사이의 매칭 관계보다 더 높다.here,

indicates positive samples. That is, it represents a pair of an image candidate region and a natural language having a semantic relationship.

Wow

represents negative samples. That is, it represents a pair of natural language that is not related to the image candidate region.

is a positive sample. Corresponding voice samples

is

is taken about Learning this matching function, the matching relationship between positive samples is higher than that between negative samples.

is a positive sample. Corresponding voice samples

silver

다음으로, 본 개시의 실시예에서, 제1 훈련될 텍스트, 제2 훈련될 텍스트, 제1 훈련될 이미지 후보 영역 및 제2 훈련될 이미지 후보 영역에 따라 타깃 손실 함수를 결정하는 방식이 제공된다. 전술한 방식에서, 정의된 타깃 손실 함수는 두 개의 상이한 방향에서 이미지와 자연어 사이의 매칭 관계를 설명한다. 한 방향은 이미지 후보 영역을 자연어와 연관시키는 것이고, 다른 방향은 자연어를 이미지 후보 영역과 연관시키는 것이다. 이러한 타깃 손실 함수를 설계하는 주된 목적은 시맨틱적으로 연관되지 않은 이미지 후보 영역과 자연어의 쌍의 유사성보다 시맨틱적으로 연관된 이미지 후보 영역과 자연어의 쌍의 유사성이 더 높도록 만듬으로써, 모델 훈련의 정확성을 향상시킬 수 있다.Next, in an embodiment of the present disclosure, a method of determining a target loss function according to a first to-be-trained text, a second to-be-trained text, a first to-be-trained image candidate region, and a second to-be-trained image candidate region is provided. In the above manner, the defined target loss function describes the matching relationship between the image and the natural language in two different directions. One direction is to associate image candidate regions with natural language, and the other direction is to associate natural language with image candidate regions. The main purpose of designing such a target loss function is to make the similarity between the semantically related image candidate region and the natural language pair higher than that of the semantically unrelated image candidate region and natural language pair, so that the accuracy of model training is higher. can improve

본 개시에서 이미지 영역을 찾기 위한 장치가 아래에서 상세하게 설명된다. 도 5를 참조하면, 도 5는 본 개시의 실시예에 따른 이미지 영역을 찾기 위한 장치의 실시예의 개략도이다. 이미지 영역을 찾기 위한 장치(30)는,An apparatus for finding an image region in the present disclosure is described in detail below. Referring to FIG. 5 , FIG. 5 is a schematic diagram of an embodiment of an apparatus for finding an image region according to an embodiment of the present disclosure. The device 30 for finding an image area comprises:

찾아질 이미지에서 이미지 후보 영역 세트를 획득하도록 구성된 획득 모듈(301) ― 이미지 후보 영역 세트는 N개의 이미지 후보 영역을 포함하고, N은 1보다 크거나 같은 정수임 ―;an acquiring module 301, configured to acquire a set of image candidate regions in the image to be found, wherein the set of image candidate regions includes N image candidate regions, where N is an integer greater than or equal to one;

획득 모듈(301)에 의해 획득된 이미지 후보 영역 세트에 따라 영역 시맨틱 정보 세트(즉, 찾아질 이미지의 이미지 후보 영역 세트)를 생성하도록 구성된 생성 모듈(302) ― 영역 시맨틱 정보 세트는 N개의 영역 시맨틱 정보를 포함하고, 각각의 영역 시맨틱 정보는 하나의 이미지 후보 영역(즉, 이미지 후보 영역 세트의 하나의 이미지 후보 영역에 대응하는 영역 시맨틱 정보의 각각의 영역 시맨틱 정보)에 대응하고, A generating module 302, configured to generate a set of region semantic information (ie, an image candidate region set of an image to be found) according to the set of image candidate regions obtained by the acquiring module 301 , the region semantic information set includes N region semantics information, wherein each region semantic information corresponds to one image candidate region (ie, each region semantic information of region semantic information corresponding to one image candidate region of an image candidate region set);

상기 획득 모듈(301)은 GCN을 사용하여 생성 모듈(302)에 의해 생성된 영역 시맨틱 정보 세트에 대응하는 향상된 시맨틱 정보 세트를 획득하도록 추가로 구성되고, 향상된 시맨틱 정보 세트는 N개의 향상된 시맨틱 정보를 포함하며, 각각의 향상된 시맨틱 정보는 하나의 영역 시맨틱 정보(즉, 영역 시맨틱 정보 세트의 하나의 영역 시맨틱 정보에 대응하는 향상된 시맨틱 정보 세트의 각각의 향상된 시맨틱 정보)에 대응하고, GCN은 다양한 영역 시맨틱 정보 사이의 연관 관계를 구축하도록 구성되며,The acquiring module 301 is further configured to acquire an enhanced semantic information set corresponding to the regional semantic information set generated by the generating module 302 by using the GCN, wherein the enhanced semantic information set includes N enhanced semantic information wherein each enhanced semantic information corresponds to one region semantic information (that is, each enhanced semantic information in the enhanced semantic information set corresponding to one region semantic information in the region semantic information set), and the GCN corresponds to the various region semantics is configured to establish associations between information;

획득 모듈(301)은 찾아질 텍스트에 대응하는 텍스트 특징 세트를 획득하도록 추가로 구성되고, 찾아질 텍스트는 T개의 단어를 포함하며, 텍스트 특징 세트는 T개의 단어 특징을 포함하고, 각각의 단어는 하나의 단어 특징에 대응하며, T는 1보다 크거나 같고(즉, 텍스트 특징 세트의 한 단어 특징에 대응하는 찾아질 텍스트의 각각의 단어),The obtaining module 301 is further configured to obtain a set of text features corresponding to the text to be found, wherein the text to be found includes T words, and the set of text features includes T word features, wherein each word includes: corresponds to one word feature, where T is greater than or equal to 1 (ie, each word in the text to be found corresponding to a one-word feature in the set of text features),

획득 모듈(301)은 이미지 영역 찾기 네트워크 모델을 사용하여 텍스트 특징 세트(즉, 찾아질 텍스트에 대응하는 텍스트 특징 세트)와 각각의 향상된 시맨틱 정보 사이의 매칭 정도를 획득하도록 추가로 구성되고, 이미지 영역 찾기 네트워크 모델은 이미지 후보 영역과 찾아질 텍스트 사이의 매칭 관계를 결정하도록 구성됨 ―; 및The obtaining module 301 is further configured to obtain a matching degree between the text feature set (ie, the text feature set corresponding to the text to be found) and each enhanced semantic information by using the image region finding network model, the image region the find network model is configured to determine a matching relationship between the image candidate region and the text to be found; and

텍스트 특징 세트와 획득 모듈(301)에 의해 획득된 각각의 향상된 시맨틱 정보 사이의 매칭 정도에 따라 이미지 후보 영역 세트로부터 타깃 이미지 후보 영역을 결정하도록 구성된 결정 모듈(303)a determining module 303, configured to determine a target image candidate region from the image candidate region set according to a matching degree between the text feature set and each enhanced semantic information obtained by the acquiring module 301

을 포함한다.includes

본 실시예에서, 획득 모듈(301)은 찾아질 이미지의 이미지 후보 영역 세트를 획득하고, 이미지 후보 영역 세트는 N개의 이미지 후보 영역을 포함하며, N은 1보다 크거나 같은 정수이며, 생성 모듈(302)은 획득 모듈(301)에 의해 획득된 이미지 후보 영역 세트에 따라 영역 시맨틱 정보 세트를 생성하고, 영역 시맨틱 정보 세트는 N개의 영역 시맨틱 정보를 포함하며, 각각의 영역 시맨틱 정보는 하나의 이미지 후보 영역에 대응하고, 획득 모듈(301)은 GCN을 사용하여 생성 모듈(302)에 의해 생성된 영역 시맨틱 정보 세트에 대응하는 향상된 시맨틱 정보 세트를 획득하고, 향상된 시맨틱 정보는 N개의 향상된 시맨틱 정보를 포함하며, 각각의 향상된 시맨틱 정보는 하나의 영역 시맨틱 정보에 대응하고, 다양한 영역 시맨틱 정보 사이의 연관 관계를 구축하도록 구성되며, 획득 모듈(301)은 찾아질 텍스트에 대응하는 텍스트 특징 세트를 획득하고, 찾아질 텍스트는 T개의 단어를 포함하며, 텍스트 특징 세트는 T개의 단어 특징을 포함하고, 각각의 단어는 하나의 단어 특징에 대응하며, T는 1보자 크거나 같은 정수이고, 획득 모듈(301)은 이미지 영역 찾기 네트워크 모델을 사용하여 텍스트 특징 세트와 각각의 향상된 시맨틱 정보 사이의 매칭 정도를 획득하고, 이미지 영역 찾기 네트워크 모델은 이미지 후보 영역과 찾아질 텍스트 사이의 매칭 관계를 결정하도록 구성되고, 결정 모듈(303)은 텍스트 특징 세트와 획득 모듈(301)에 의해 획득된 각각의 향상된 시맨틱 정보 사이의 매칭 정도에 따라 이미지 후보 영역으로부터 타깃 이미지 후보 영역을 결정한다.In this embodiment, the acquiring module 301 acquires a set of image candidate regions of the image to be found, the image candidate region set includes N image candidate regions, N is an integer greater than or equal to 1, and the generating module ( 302 generates a region semantic information set according to the image candidate region set obtained by the acquiring module 301, the region semantic information set includes N regions semantic information, each region semantic information is one image candidate corresponding to the region, the acquiring module 301 acquires the enhanced semantic information set corresponding to the region semantic information set generated by the generating module 302 by using the GCN, the enhanced semantic information includes N enhanced semantic information wherein each enhanced semantic information corresponds to one region semantic information, and is configured to establish an association relationship between various region semantic information, the obtaining module 301 acquires a set of text features corresponding to the text to be found, The text to be searched includes T words, the text feature set includes T word features, each word corresponds to one word feature, T is an integer greater than or equal to 1, and the acquiring module 301 . uses the image region finding network model to obtain a matching degree between the text feature set and each enhanced semantic information, the image region finding network model is configured to determine a matching relationship between the image candidate region and the text to be found, determining The module 303 determines a target image candidate region from the image candidate region according to a matching degree between the text feature set and each enhanced semantic information obtained by the acquiring module 301 .

본 개시의 실시예에서, 이미지 영역을 찾기 위한 장치가 제공된다. 이 장치는 먼저 찾아질 이미지에서 이미지 후보 영역 세트를 획득하고 ― 이미지 후보 영역 세트는 N개의 이미지 후보 영역을 포함함 ―, 다음 이미지 후보 영역 세트에 따라 영역 시맨틱 정보 세트를 생성하며 ― 각각의 영역 시맨틱 정보는 하나의 이미지 후보 영역에 대응함 ―, 다음 GCN을 사용하여 영역 시맨틱 정보 세트에 대응하는 향상된 시맨틱 정보 세트를 획득하고 ― 각각의 향상된 시맨틱 정보는 하나의 영역 시맨틱 정보에 대응하고, GCN은 다양한 영역 시맨틱 정보 사이의 연관 관계를 구축하도록 구성됨 ―, 또한, 찾아질 텍스트에 대응하는 텍스트 특징 세트를 획득하며, 다음으로, 이미지 영역 찾기 네트워크 모델을 사용하여 텍스트 특징 세트와 각각의 향상된 시맨틱 정보 사이의 매칭 정도를 획득하고, 마지막으로 텍스트 특징 세트와 각각의 향상된 시맨틱 정보 사이의 매칭 정도에 따라 이미지 후보 영역 세트에서 타깃 이미지 후보 영역을 결정한다. 전술한 방식에서, 이미지 후보 영역 사이의 시맨틱 표현이 GCN을 사용하여 효과적으로 향상될 수 있으며, 이미지 후보 영역 사이의 공간적 관계가 고려되어, 이미지 영역을 찾는 정확도를 높일 수 있으므로, 이미지 이해 능력을 개선시킬 수 있다.In an embodiment of the present disclosure, an apparatus for finding an image region is provided. The apparatus first obtains a set of image candidate regions from the image to be found, the set of image candidate regions comprising N image candidate regions, and then generates a set of region semantic information according to the next set of image candidate regions, and ― each region semantic information corresponds to one image candidate region, and then obtains an enhanced semantic information set corresponding to the regional semantic information set using the GCN, each enhanced semantic information corresponds to one region semantic information, and the GCN is a multi-region configured to build an association relationship between the semantic information, further obtaining a set of text features corresponding to the text to be found, and then matching between the set of text features and each enhanced semantic information using the image region finding network model obtain a degree, and finally determine a target image candidate area from the image candidate area set according to the matching degree between the text feature set and each enhanced semantic information. In the above-described manner, the semantic representation between image candidate regions can be effectively improved by using GCN, and spatial relationships between image candidate regions are considered, so that the accuracy of finding image regions can be increased, thereby improving image understanding ability. can

선택적으로, 도 5에 대응하는 전술한 실시예에 기초하여, 본 개시의 본 실시예에서 제공되는 이미지 영역을 찾기 위한 장치(30)의 다른 실시예에서,Optionally, based on the above-described embodiment corresponding to FIG. 5 , in another embodiment of the apparatus 30 for finding an image region provided in this embodiment of the present disclosure,

생성 모듈(302)은 구체적으로 CNN을 사용하여 각각의 이미지 후보 영역에 대응하는 영역 시맨틱 정보를 획득하고 ― 이미지 후보 영역은 영역 정보를 포함하고, 영역 정보는 찾아질 이미지에서 이미지 후보 영역의 위치 정보와 이미지 후보 영역의 크기 정보를 포함함 ―,The generating module 302 specifically uses CNN to obtain region semantic information corresponding to each image candidate region, wherein the image candidate region includes region information, and the region information includes location information of an image candidate region in the image to be found. and size information of the image candidate area ―,

N개의 이미지 후보 영역에 대응하는 영역 시맨틱 정보가 획득되는 경우, N개의 영역 시맨틱 정보에 따라 영역 시맨틱 정보 세트를 생성하도록 구성된다.and when region semantic information corresponding to the N image candidate regions is obtained, generate a region semantic information set according to the N region semantic information.

다음, 본 개시의 실시예에서, 영역 시맨틱 정보 세트을 생성하는 방식이 제공된다. 이미지 후보 영역에 대응하는 영역 시맨틱 정보가 CNN을 사용하여 획득된다. 이미지 후보 영역은 영역 정보를 포함한다. 영역 정보는 찾아질 이미지에서 이미지 후보 영역의 위치 정보와 이미지 후보 영역의 크기 정보를 포함한다. 영역 시맨틱 정보 세트는 N개의 이미지 후보 영역에 대응하는 영역 시맨틱 정보가 획득되는 경우 N개의 영역 시맨틱 정보에 따라 생성된다. 전술한 방식에서, CNN을 사용하여 각각의 이미지 후보 영역의 영역 시맨틱 정보가 추출될 수 있다. CNN은 피드 포워드 신경망이다. CNN의 인공 뉴런은 부분 커버리지 영역에서 주변 유닛에 반응할 수 있으므로, 대규모 이미지 처리에 뛰어난 성능이 있다. 이와 같이, 정보 추출의 정확성이 향상된다.Next, in an embodiment of the present disclosure, a method of generating a region semantic information set is provided. Region semantic information corresponding to the image candidate region is obtained using CNN. The image candidate region includes region information. The region information includes position information of the image candidate region and size information of the image candidate region in the image to be found. The region semantic information set is generated according to the N region semantic information when region semantic information corresponding to the N image candidate regions is obtained. In the above-described manner, region semantic information of each image candidate region may be extracted using CNN. CNN is a feed-forward neural network. The artificial neurons of CNNs can respond to neighboring units in the partial coverage area, so they have excellent performance for large-scale image processing. In this way, the accuracy of information extraction is improved.

획득 모듈(301)은 구체적으로 영역 시맨틱 정보 세트로부터 제1 영역 시맨틱 정보 및 제2 영역 시맨틱 정보를 획득하고 ― 제1 영역 시맨틱 정보는 영역 시맨틱 정보 세트에서 임의의 하나의 영역 시맨틱 정보이고, 제2 영역 시맨틱 정보는 영역 시맨틱 정보 세트에서 임의의 하나의 영역 시맨틱 정보임 ―,The obtaining module 301 is specifically configured to obtain the first region semantic information and the second region semantic information from the region semantic information set, wherein the first region semantic information is any one region semantic information in the region semantic information set, and the second region semantic information is The region semantic information is any one region semantic information in the region semantic information set ―,

제1 영역 시맨틱 정보와 제2 영역 시맨틱 정보 사이의 연결 에지의 강도를 획득하며,obtaining the strength of a connection edge between the first region semantic information and the second region semantic information;

정규화된 강도를 획득하기 위해 제1 영역 시맨틱 정보와 제2 영역 시맨틱 정보 사이의 연결 에지의 강도를 정규화하고,Normalize the strength of the connecting edge between the first region semantic information and the second region semantic information to obtain a normalized strength,

영역 시맨틱 정보 세트의 다양한 영역 시맨틱 정보 사이의 정규화된 강도에 따라 타깃 연결 매트릭스를 결정하며, determine a target connection matrix according to the normalized strength between the various domain semantic information in the domain semantic information set;

GCN을 사용하여 타깃 연결 매트릭스에 대응하는 향상된 시맨틱 정보 세트를 결정하도록 구성된다.and determine an enhanced semantic information set corresponding to the target connectivity matrix using the GCN.

다음으로, 본 개시의 실시예에서, GCN을 사용하여 향상된 시맨틱 정보 세트를 획득하는 방식이 제공된다. 먼저, 제1 영역 시맨틱 정보 및 제2 영역 시맨틱 정보가 영역 시맨틱 정보 세트로부터 획득된다. 그 다음, 제1 영역 시맨틱 정보와 제2 영역 시맨틱 정보 사이의 연결 에지의 강도가 획득된다. 다음으로, 제1 영역 시맨틱 정보와 제2 영역 시맨틱 정보 사이의 연결 에지의 강도가 정규화되어 정규화된 강도를 획득한다. 그 다음, 영역 시맨틱 정보 세트 내의 다양한 영역 시맨틱 정보 사이의 정규화된 강도에 따라 타깃 연결 매트릭스가 결정된다. 마지막으로, GCN을 사용하여 타깃 연결 매트릭스에 대응하는 향상된 시맨틱 정보 세트가 결정된다. 전술한 방식에서, GCN을 사용하여 이미지 후보 영역 사이의 시맨틱 관계가 구축된다. 이러한 방식으로, 공간적 정보와 시맨틱 관계가 충분히 고려됨으로써, 이미지 기반 찾기 성능을 향상시킬 수 있다.Next, in an embodiment of the present disclosure, a method of obtaining an enhanced semantic information set using GCN is provided. First, the first region semantic information and the second region semantic information are obtained from the region semantic information set. Then, the strength of the connecting edge between the first region semantic information and the second region semantic information is obtained. Next, the strength of the connecting edge between the first region semantic information and the second region semantic information is normalized to obtain a normalized strength. Then, a target connection matrix is determined according to the normalized strength between the various region semantic information in the region semantic information set. Finally, the set of enhanced semantic information corresponding to the target connectivity matrix is determined using the GCN. In the above-described manner, a semantic relationship between image candidate regions is established using GCN. In this way, the spatial information and the semantic relationship are sufficiently taken into account, thereby improving the image-based finding performance.

획득 모듈(301)은 구체적으로 영역 시맨틱 정보 세트 내의 다양한 영역 시맨틱 정보 사이의 정규화된 강도에 따라 연결 매트릭스를 생성하고, The acquiring module 301 specifically generates a connection matrix according to the normalized strength between the various region semantic information in the region semantic information set,

연결 매트릭스 및 단위 매트릭스에 따라 타깃 연결 매트릭스를 생성하도록 구성된다.and generate a target connection matrix according to the connection matrix and the unit matrix.

다음, 본 개시의 실시예에서, 영역 시맨틱 정보 세트 내의 다양한 영역 시맨틱 정보 사이의 정규화된 강도에 따라 타깃 연결 매트릭스를 결정하는 방식이 제공된다. 즉, 연결 매트릭스는 먼저 영역 시맨틱 정보 세트의 다양한 영역 시맨틱 정보 사이의 정규화된 강도에 따라 생성된다. 그런 다음, 타깃 연결 매트릭스는 연결 매트릭스와 단위 매트릭스에 따라 생성된다. 전술한 방식에서, 정규화의 처리 측정을 통해, 물리적 시스템에서 값의 절대값이 상대값 관계로 전환될 수 있으므로, 계산이 단순화되고 크기가 감소될 수 있다. 또한, 정보를 추가로 향상시키기 위해, 단위 매트릭스는 타깃 연결 매트릭스를 형성하기 위해 대응하는 연결 매트릭스에 추가된다. Next, in an embodiment of the present disclosure, a method of determining a target connection matrix according to normalized strengths between various region semantic information in a region semantic information set is provided. That is, the connection matrix is first generated according to the normalized strength between the various region semantic information of the region semantic information set. Then, a target connection matrix is generated according to the connection matrix and the unit matrix. In the above-described manner, through the process measurement of normalization, the absolute value of a value in the physical system can be converted into a relative value relationship, so that the calculation can be simplified and the size can be reduced. Further, to further enhance the information, the unit matrix is added to the corresponding connection matrix to form a target connection matrix.

획득 모듈은 구체적으로 다음과 같은 방식으로 향상된 시맨틱 정보 세트를 계산하도록 구성되며,The acquiring module is specifically configured to calculate the set of enhanced semantic information in the following way,

여기서,

는 GCN의 k 번째 계층의 제1 네트워크 파라미터를 나타내고,

는 GCN의 k 번째 계층의 제2 네트워크 파라미터를 나타내며,

는 j 번째 노드가 i 번째 노드의 이웃 노드임을 나타내고,

는 타깃 연결 매트릭스의 요소를 나타낸다. here,

represents the first network parameter of the k-th layer of GCN,

represents the second network parameter of the k-th layer of GCN,

indicates that the j-th node is a neighbor of the i-th node,

represents the elements of the target connection matrix.

획득 모듈(301)은 구체적으로 찾아질 텍스트를 획득하고, The acquiring module 301 specifically acquires the text to be found,

찾아질 텍스트에 따라 텍스트 벡터 시퀀스를 획득하며 ― 텍스트 벡터 시퀀스는 T개의 단어 벡터를 포함하고, 각각의 단어 벡터는 하나의 단어에 대응함 ―,obtain a text vector sequence according to the text to be found, the text vector sequence comprising T word vectors, each word vector corresponding to a word;

텍스트 특징을 획득하기 위해 텍스트 벡터 시퀀스 내의 각각의 단어 벡터를 인코딩하고, encode each word vector in the text vector sequence to obtain a text feature,

T개의 단어 벡터에 대응하는 텍스트 특징이 획득되는 경우, T개의 텍스트 특징에 따라 텍스트 특징 세트를 생성하도록 구성된다.and when text features corresponding to the T word vectors are obtained, generate a set of text features according to the T text features.

다음, 본 개시의 실시예에서, 텍스트 특징 세트를 획득하기 위한 방법이 제공된다. 즉, 찾아질 텍스트가 먼저 획득된다. 그런 다음, 텍스트 벡터 시퀀스가 찾아질 텍스트에 따라 획득된다. 텍스트 벡터 시퀀스는 T개의 단어 벡터를 포함한다. 각각의 단어 벡터는 하나의 단어에 대응한다. 다음으로, 텍스트 벡터 시퀀스의 각각의 단어 벡터가 인코딩되어 텍스트 특징을 획득한다. 텍스트 특징 세트는 T개의 단어 벡터에 대응하는 텍스트 특징이 획득되는 경우 T개의 텍스트 특징에 따라 생성된다. 전술한 방식에서, 찾아질 텍스트는 후속 모델 예측을 더욱 용이하게 하기 위해 특징 형태로 표현될 수 있으므로, 해결수단의 실행 가능성 및 운용성을 향상시킬 수 있다.Next, in an embodiment of the present disclosure, a method for obtaining a text feature set is provided. That is, the text to be found is obtained first. Then, a text vector sequence is obtained according to the text to be searched for. The text vector sequence contains T word vectors. Each word vector corresponds to one word. Next, each word vector of the text vector sequence is encoded to obtain text features. A text feature set is generated according to the T text features when text features corresponding to the T word vectors are obtained. In the above-described manner, the text to be found can be expressed in the form of features to further facilitate subsequent model prediction, thereby improving the feasibility and operability of the solution.

획득 모듈(301)은 구체적으로 다음과 같은 방식으로 텍스트 특징을 획득하도록 구성되며, The acquiring module 301 is specifically configured to acquire the text feature in the following way,

여기서,

는 텍스트 특징 세트에서 t 번째 텍스트 특징을 나타내고,

는 텍스트 벡터 시퀀스에서 t 번째 단어 벡터를 나타내고,

는 텍스트 특징 세트에서 (t-1) 번째 텍스트 특징을 나타낸다.here,

denotes the t-th text feature in the text feature set,

indicates that encoding is performed using the LSTM network,

denotes the t-th word vector in the text vector sequence,

denotes the (t-1)-th text feature in the text feature set.

다음으로, 본 개시의 실시예에서, 텍스트 특징을 획득하는 방식이 제공된다. 즉, LSTM 구조의 RNN이 단어 벡터를 인코딩하는 데 사용된다. 전술한 방식에서, LSTM 구조의 네트워크를 사용하면, 역전파 과정의 점진적인 감소로 인한 기울기 소실 문제가 해결될 수 있다. 구체적으로, 언어 처리 태스크에서, LSTM은 시간적 높이와 관련된 문제, 예를 들어, 기계 번역, 대화 생성, 및 인코딩과 디코딩을 처리하는 데 적합하다.Next, in an embodiment of the present disclosure, a method of obtaining a text feature is provided. That is, the RNN of the LSTM structure is used to encode the word vector. In the above-described manner, if the network of the LSTM structure is used, the problem of gradient loss due to the gradual reduction of the backpropagation process can be solved. Specifically, in language processing tasks, LSTMs are suitable for handling temporal height related problems, such as machine translation, dialog generation, and encoding and decoding.

본 개시의 모델 훈련 장치가 아래에서 상세하게 설명된다. 도 6을 참조하면, 도 6은 본 개시의 실시예에 따른 모델 훈련 장치의 실시예의 개략도이다. 모델 훈련 장치(40)는,The model training apparatus of the present disclosure is described in detail below. Referring to FIG. 6 , FIG. 6 is a schematic diagram of an embodiment of a model training apparatus according to an embodiment of the present disclosure. Model training device 40,

훈련될 텍스트 세트 및 훈련될 이미지 후보 영역 세트를 획득하도록 구성된 획득 모듈(401) ― 훈련될 텍스트 세트는 제1 훈련될 텍스트와 제2 훈련될 텍스트를 포함하고, 훈련될 이미지 후보 영역 세트는 제1 훈련될 이미지 후보 영역 및 제2 훈련될 이미지 후보 영역을 포함하며, 제1 훈련될 텍스트와 제1 훈련될 이미지 후보 영역은 매칭 관계를 갖고 있고, 제1 훈련된 텍스트와 제2 훈련될 이미지 후보 영역은 매칭 관계를 갖고 있지 않으며, 제2 훈련될 텍스트와 제2 훈련될 이미지 후보 영역은 매칭 관계를 갖고 있고, 제2 훈련될 텍스트와 제1 훈련될 이미지 후보 영역은 매칭 관계를 갖고 있지 않음 -; an acquiring module 401 configured to acquire a set of text to be trained and a set of image candidate regions to be trained, wherein the set of text to be trained includes a first text to be trained and a second text to be trained, and the set of image candidate regions to be trained includes the first an image candidate region to be trained and a second image candidate region to be trained, wherein the first text to be trained and the first image to be trained region have a matching relationship, and the first trained text and the second image candidate region to be trained does not have a matching relationship, the second to-be-trained text and the second to-be-trained image candidate region have a matching relationship, and the second to-be-trained text and the first to-be-trained image candidate region do not have a matching relationship;

획득 모듈(401)에 의해 획득되는 제1 훈련될 텍스트, 제2 훈련될 텍스트, 제1 훈련될 이미지 후보 영역 및 제2 훈련될 이미지 후보 영역에 따라 타깃 손실 함수를 결정하도록 구성된 결정 모듈(402); 및A determining module 402, configured to determine a target loss function according to the first text to be trained, the second text to be trained, the first image to be trained region, and the second image to be trained region obtained by the acquiring module 401 . ; and

이미지 영역 찾기 네트워크 모델을 획득하기 위해 결정 모듈(402)에 의해 결정되는 타깃 손실 함수를 사용하여 훈련될 이미지 영역 찾기 네트워크 모델을 훈련시키도록 구성된 훈련 모델(403) ― 이미지 영역 찾기 네트워크 모델은 텍스트 특징 세트와 향상된 시맨틱 정보에 따라 이미지 후보 영역과 찾아질 텍스트 사이의 매칭 관계를 결정하도록 구성되고, 향상된 시맨틱 정보와 이미지 후보 영역은 대응관계를 가지며, 텍스트 특징 세트와 찾아질 텍스트는 대응관계를 가짐 ―a training model 403 configured to train an image region finding network model to be trained using the target loss function determined by the determining module 402 to obtain an image region finding network model, wherein the image region finding network model includes text features determine a matching relationship between the image candidate region and the text to be found according to the set and the enhanced semantic information, wherein the enhanced semantic information and the image candidate region have a correspondence, and the text feature set and the text to be found have a correspondence,

을 포함한다.includes

본 실시예에서, 획득 모듈(401)은 훈련될 텍스트 세트 및 훈련될 이미지 후보 영역 세트를 획득하고 ― 훈련될 텍스트 세트는 제1 훈련될 텍스트 및 제2 훈련될 텍스트를 포함하고, 훈련될 이미지 후보 영역 세트는 제1 훈련될 이미지 후보 영역 및 제2 훈련될 이미지 후보 영역을 포함하며, 제1 훈련될 텍스트와 제1 훈련될 이미지 후보 영역은 매칭 관계를 갖고 있고, 제1 훈련될 텍스트와 제2 훈련될 이미지 후보 영역은 매칭 관계를 갖고 있지 않으며, 제2 훈련될 텍스트와 제2 훈련될 이미지 후보 영역은 매칭 관계를 갖고 있고, 제2 훈련될 텍스트와 제1 훈련될 이미지 후보 영역은 매칭 관계를 갖고 있지 않음 ―, 결정 모듈(402)은 획득 모듈(401)에 의해 획득되는 제1 훈련될 텍스트, 제2 훈련될 텍스트, 제1 훈련될 이미지 후보 영역 및 제2 훈련될 이미지 후보 영역에 따라 타깃 손실 함수를 결정하며, 훈련 모듈(403)은 이미지 영역 찾기 네트워크 모델을 획득하기 위해 결정 모듈(402)에 의해 결정되는 타깃 손실 함수를 사용하여 훈련될 이미지 영역 찾기 네트워크 모델을 훈련시키며, 이미지 영역 찾기 네트워크 모델은 텍스트 특징 세트와 향상된 시맨틱 정보에 따라 이미지 후보 영역과 찾아질 텍스트 사이의 매칭 관계를 결정하도록 구성되고, 향상된 시맨틱 정보와 이미지 후보 영역은 대응관계를 가지며, 텍스트 특징 세트와 찾아질 텍스트는 대응관계를 갖는다.In this embodiment, the acquiring module 401 acquires a set of text to be trained and a set of image candidate regions to be trained, wherein the set of text to be trained includes a first text to be trained and a second text to be trained, and an image candidate to be trained The set of regions includes a first candidate image to be trained region and a second candidate image to be trained region, wherein the first text to be trained and the first candidate image to be trained region have a matching relationship, and the first text to be trained and the second The image candidate region to be trained does not have a matching relationship, the second to-be-trained text and the second to-be-trained image candidate region have a matching relationship, and the second to-be-trained text and the first to-be-trained image candidate region have a matching relationship. do not have -, the determining module 402 determines a target according to the first text-to-be-trained, second to-be-trained text, first to-be-trained image candidate region and second to-be-trained image candidate region obtained by the acquiring module 401 . determine a loss function, the training module 403 trains the image region finding network model to be trained using the target loss function determined by the determining module 402 to obtain an image region finding network model, The network model is configured to determine a matching relationship between the image candidate region and the text to be found according to the text feature set and the enhanced semantic information, the enhanced semantic information and the image candidate region have a correspondence, the text feature set and the text to be found are have a corresponding relationship.

본 개시의 실시예에서, 모델 훈련 장치가 제공된다. 모델 훈련 장치는 먼저 훈련될 텍스트 세트 및 훈련될 이미지 후보 영역 세트를 획득하고 ― 훈련될 텍스트 세트는 제1 훈련될 텍스트 및 제2 훈련될 텍스트를 포함하고, 훈련될 이미지 후보 영역 세트는 제1 훈련될 이미지 후보 영역 및 제2 훈련될 이미지 후보 영역을 포함함 ―, 그 후 제1 훈련될 텍스트, 제2 훈련될 텍스트, 제1 훈련될 이미지 후보 영역 및 제2 훈련될 이미지 후보 영역에 따라 타깃 손실 함수를 결정하며, 마지막으로 이미지 영역 찾기 네트워크 모델을 획득하기 위해 타깃 손실 함수를 사용하여 훈련될 이미지 영역 찾기 네트워크 모델을 훈련시킨다. 전술한 방식에서, 이미지 후보 영역과 텍스트 사이의 매칭 관계를 결정하도록 구성된 이미지 영역 찾기 네트워크 모델은 훈련을 통해 획득될 수 있으며, 사용된 타깃 함수는 학습을 통해 텍스트와 이미지 후보 영역 사이의 매칭 관계를 획득하기 위해 이미지 후보 영역과 텍스트 사이의 유사도를 측정하는 데 사용될 수 있으므로, 해결수단의 실행 가능성과 운용성을 향상시킬 수 있다.In an embodiment of the present disclosure, a model training apparatus is provided. The model training apparatus first obtains a set of text to be trained and a set of image candidate regions to be trained, wherein the set of text to be trained includes a first text to be trained and a second text to be trained, and the set of image candidate regions to be trained includes a first training comprising an image candidate region to be trained and a second image candidate region to be trained, then target loss according to the first text to be trained, the second text to be trained, the first image candidate region to be trained and the second image candidate region to be trained A function is determined, and finally, an image region finding network model to be trained is trained using the target loss function to obtain an image region finding network model. In the above manner, the image region finding network model configured to determine the matching relationship between the image candidate region and the text may be obtained through training, and the target function used may determine the matching relationship between the text and the image candidate region through learning. It can be used to measure the similarity between the image candidate region and the text to obtain, thus improving the feasibility and operability of the solution.

선택적으로, 도 6에 대응하는 전술한 실시예에 기초하여, 본 개시의 본 실시예에서 제공되는 모델 훈련 장치(40)의 다른 실시예에서,Optionally, based on the above-described embodiment corresponding to FIG. 6 , in another embodiment of the model training apparatus 40 provided in this embodiment of the present disclosure,

결정 모듈(402)은 구체적으로 다음과 같은 방식으로 타깃 손실 함수를 결정하도록 구성되며,The determining module 402 is specifically configured to determine the target loss function in the following manner,

여기서,

는 타깃 손실 함수를 나타내고,

는 제1 훈련될 이미지 후보 영역을 나타내며,

는 제1 훈련될 텍스트를 나타내고,

는 제2 훈련될 이미지 후보 영역을 나타내며,

는 제2 훈련될 텍스트를 나타내고,

은 제1 파라미터 제어 가중치를 나타내며,

는 제2 파라미터 제어 가중치를 나타내고,

은 제1 미리 설정된 임계값을 나타내며,

는 제2 미리 설정된 임계값을 나타낸다.here,

represents the target loss function,

represents the first image candidate region to be trained,

represents the first to-be-trained text,

denotes the second to-be-trained image candidate region,

represents the second to-be-trained text,

denotes the data pair to be trained, max() denotes to take the maximum,

represents the first parameter control weight,

represents the second parameter control weight,

represents the first preset threshold,

denotes a second preset threshold value.

다음, 본 개시의 실시예에서, 제1 훈련될 텍스트, 제2 훈련될 텍스트, 제1 훈련될 이미지 후보 영역 및 제2 훈련될 이미지 후보 영역에 따라 타깃 손실 함수를 결정하는 방식이 제공된다. 전술한 방식에서, 정의된 타깃 손실 함수는 두 개의 상이한 방향에서 이미지와 자연어 사이의 매칭 관계를 설명한다. 한 방향은 이미지 후보 영역을 자연어와 연관시키는 것이고, 다른 방향은 자연어를 이미지 후보 영역과 연관시키는 것이다. 이러한 타깃 손실 함수를 설계하는 주된 목적은 시맨틱적으로 연관되지 않은 이미지 후보 영역과 자연어의 쌍의 유사성보다 시맨틱적으로 연관된 이미지 후보 영역과 자연어의 쌍의 유사성이 더 높도록 만듬으로써, 모델 훈련의 정확성을 향상시킬 수 있다.Next, in an embodiment of the present disclosure, a method of determining a target loss function according to a first to-be-trained text, a second to-be-trained text, a first to-be-trained image candidate region, and a second to-be-trained image candidate region is provided. In the above manner, the defined target loss function describes the matching relationship between the image and the natural language in two different directions. One direction is to associate image candidate regions with natural language, and the other direction is to associate natural language with image candidate regions. The main purpose of designing such a target loss function is to make the similarity between the semantically related image candidate region and the natural language pair higher than that of the semantically unrelated image candidate region and natural language pair, so that the accuracy of model training is higher. can improve

본 개시의 실시예는 도 7에 도시된 바와 같이, 이미지 영역을 찾기 위한 다른 장치를 더 제공하며, 설명의 편의를 위해, 본 개시의 실시예와 관련된 부분만이 도시된다. 개시되지 않은 구체적인 기술적 사항에 대해서는, 본 개시의 실시예의 방법 부분을 참조한다. 장치는 휴대폰, 태블릿 컴퓨터, PDA(Personal Digital Assistant), POS(Point of Sales) 및 온보드 컴퓨터를 포함하는 임의의 단말 장치일 수 있으며, 휴대폰인 단말 장치가 예로서 사용된다.The embodiment of the present disclosure further provides another apparatus for finding an image region, as shown in FIG. 7 , and only parts related to the embodiment of the present disclosure are shown for convenience of description. For specific technical matters not disclosed, refer to the method part of the embodiment of the present disclosure. The device may be any terminal device including a mobile phone, a tablet computer, a personal digital assistant (PDA), a point of sales (POS), and an on-board computer, and a terminal device that is a mobile phone is used as an example.

도 7은 본 개시의 실시예에 따른 단말 장치와 관련된 휴대폰의 일부 구조의 블록도이다. 도 7을 참조하면, 휴대폰은 무선 주파수(radio frequency, RF) 회로(510), 메모리(520), 입력 유닛(530), 디스플레이 유닛(540), 센서(550), 오디오 회로(560), 와이파이(Wireless Fidelity, Wi-Fi) 모듈, 프로세서(580) 및 파워 서플라이(590)와 같은 컴포넌트를 포함한다. 당업자라면 도 7에 도시된 휴대폰의 구조가 휴대폰에 대한 제한을 구성하지 않으며, 휴대폰은 도면에 도시된 것보다 더 많은 컴포넌트를 포함하거나 또는 더 적은 컴포넌트를 포함할 수 있거나, 또는 일부 컴포넌트가 결합될 수 있거나, 또는 다른 컴포넌트 배치가 사용될 수 있을 이해할 수 있다.7 is a block diagram of a partial structure of a mobile phone related to a terminal device according to an embodiment of the present disclosure. Referring to FIG. 7 , the mobile phone includes a radio frequency (RF) circuit 510 , a memory 520 , an input unit 530 , a display unit 540 , a sensor 550 , an audio circuit 560 , and Wi-Fi. It includes components such as a (Wireless Fidelity, Wi-Fi) module, a processor 580 and a power supply 590 . For those skilled in the art, the structure of the mobile phone shown in Fig. 7 does not constitute a limitation for the mobile phone, and the mobile phone may include more or fewer components than those shown in the drawing, or some components may be combined. or other component arrangements may be used.

이하 도 7을 참조하여 휴대폰의 컴포넌트에 대해 구체적으로 설명한다.Hereinafter, components of the mobile phone will be described in detail with reference to FIG. 7 .

RF 회로(510)는 정보 수신 및 전송 프로세스 또는 호 프로세스 동안 신호를 수신하고 전송하도록 구성될 수 있다. 구체적으로, RF 회로는 기지국으로부터 다운링크 정보를 수신한 다음, 처리를 위해 다운링크 정보를 프로세서(580)에게 전달하고, 설계된 업링크 데이터를 기지국으로 전송한다. 일반적으로, RF 회로(510)는 안테나, 적어도 하나의 증폭기, 트랜시버, 커플러, 저잡음 증폭기(low noise amplifier, LNA) 및 듀플렉서를 포함하지만 이에 제한되지 않는다. 또한, RF 회로(510)는 무선 통신을 통해 네트워크 및 다른 장치와 통신할 수도 있다. 무선 통신은 GSM(Global System for Mobile Communications), GPRS(General Packet Radio Service), CDMA(Code Division Multiple Access), WCDMA(Wideband Code Division Multiple Access), LTE(Long Term Evolution), 이메일, SMS(Short Messaging Service) 등을 포함하지만 이에 제한되지 않는 임의의 통신 표준 또는 프로토콜을 사용할 수 있다. .The RF circuitry 510 may be configured to receive and transmit signals during an information reception and transmission process or a call process. Specifically, the RF circuit receives the downlink information from the base station, then passes the downlink information to the processor 580 for processing, and transmits the designed uplink data to the base station. In general, the RF circuitry 510 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (LNA), and a duplexer. In addition, the RF circuit 510 may communicate with a network and other devices through wireless communication. Wireless communication includes Global System for Mobile Communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), Email, Short Messaging (SMS) Service), and the like, may use any communication standard or protocol, including but not limited to. .

메모리(520)는 소프트웨어 프로그램 및 모듈을 저장하도록 구성될 수 있다. 프로세서(580)는 휴대폰의 다양한 기능 애플리케이션 및 데이터 처리를 구현하기 위해 메모리(520)에 저장된 소프트웨어 프로그램 및 모듈을 실행한다. 메모리(520)는 주로 프로그램 저장 영역과 데이터 저장 영역을 포함할 수 있다. 프로그램 저장 영역은 운영체제, 적어도 하나의 기능(음향 재생 기능, 이미지 디스플레이 기능 등)에 필요한 애플리케이션 프로그램 등을 저장할 수 있다. 데이터 저장 영역은 휴대폰 사용에 따라 생성된 데이터(오디오 데이터, 주소록 등)를 저장할 수 있다. 또한, 메모리(520)는 고속 RAM(Random Access Memory)을 포함할 수 있으며, 또한 적어도 하나의 자기 디스크 저장 장치, 플래시 메모리 또는 다른 휘발성 고체 저장 장치와 같은 비 휘발성 메모리를 포함할 수 있다.Memory 520 may be configured to store software programs and modules. The processor 580 executes software programs and modules stored in the memory 520 to implement various functional applications and data processing of the mobile phone. The memory 520 may mainly include a program storage area and a data storage area. The program storage area may store an operating system, an application program required for at least one function (a sound reproduction function, an image display function, etc.). The data storage area may store data (audio data, address book, etc.) generated according to the use of the mobile phone. Memory 520 may also include high-speed random access memory (RAM), and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory, or other volatile solid-state storage device.

입력 유닛(530)은 입력된 숫자 또는 문자 정보를 수신하고, 휴대폰의 사용자 설정 및 기능 제어와 관련된 키보드 신호 입력을 생성하도록 구성될 수 있다. 구체적으로, 입력 유닛(530)은 터치 패널(531) 및 다른 입력 장치(532)를 포함할 수 있다. 터치 스크린으로서도 지칭될 수 있는 터치 패널(531)은 터치 패널 상의 또는 터치 패널 근처의 사용자의 터치 조작(예를 들어, 손가락이나 스타일러스와 같은 임의의 적절한 객체나 액세서리를 사용한 터치 패널(531) 상의 또는 터치 패널(531) 근처의 사용자의 조작)을 수집하고, 미리 설정된 프로그램에 따라 대응하는 연결 장치를 구동할 수 있다. 선택적으로, 터치 패널(531)은 두 부분, 즉 터치 검출 장치와 터치 제어기를 포함할 수 있다. 터치 검출 장치는 사용자의 터치 위치를 검출하고, 터치 조작에 의해 생성되는 신호를 검출하며, 그 신호를 터치 제어기에게 전달한다. 터치 제어기는 터치 검출 장치로부터 터치 정보를 수신하고, 터치 정보를 터치 포인트 좌표로 변환하며, 터치 포인트 좌표를 프로세서(580)에게 전송한다. 또한, 터치 제어기는 프로세서(580)로부터 전송된 명령을 수신하여 실행할 수 있다. 또한, 터치 패널(531)은 저항 유형, 커패시티브 유형, 적외선 유형, 표면 탄성파 유형과 같은 다양한 유형을 사용하여 구현될 수 있다. 터치 패널(531)에 추가하여, 입력 유닛(530)은 다른 입력 장치(532)를 더 포함할 수 있다. 구체적으로, 다른 입력 장치(532)는 물리적 키보드, 기능 키(예를 들어, 볼륨 제어 키 또는 스위치 키), 트랙볼, 마우스 및 조이스틱 중 하나 이상을 포함할 수 있지만 이에 제한되는 것은 아니다.The input unit 530 may be configured to receive input number or text information, and to generate a keyboard signal input related to user settings and function control of the mobile phone. Specifically, the input unit 530 may include a touch panel 531 and another input device 532 . Touch panel 531 , which may also be referred to as a touch screen, is configured for a user's touch manipulation on or near the touch panel (eg, on or near touch panel 531 using any suitable object or accessory such as a finger or stylus). A user's operation near the touch panel 531) may be collected, and a corresponding connected device may be driven according to a preset program. Optionally, the touch panel 531 may include two parts, namely, a touch detection device and a touch controller. The touch detection device detects a user's touch position, detects a signal generated by a touch manipulation, and transmits the signal to the touch controller. The touch controller receives touch information from the touch detection device, converts the touch information into touch point coordinates, and transmits the touch point coordinates to the processor 580 . Also, the touch controller may receive and execute a command transmitted from the processor 580 . In addition, the touch panel 531 may be implemented using various types, such as a resistive type, a capacitive type, an infrared type, and a surface acoustic wave type. In addition to the touch panel 531 , the input unit 530 may further include another input device 532 . Specifically, other input devices 532 may include, but are not limited to, one or more of a physical keyboard, function keys (eg, volume control keys or switch keys), a trackball, a mouse, and a joystick.

디스플레이 유닛(540)은 사용자에 의해 입력된 정보 또는 사용자에게 제공되는 정보 및 휴대폰의 각종 메뉴를 디스플레이하도록 구성될 수 있다. 디스플레이 유닛(540)은 디스플레이 패널(541)을 포함할 수 있다. 선택적으로, 디스플레이 패널(541)은 액정 디스플레이(liquid crystal display, LCD), 유기 발광 다이오드(organic light-emitting diode, OLED) 등을 사용하여 구성될 수 있다. 또한, 터치 패널(531)은 디스플레이 패널(541)을 덮을 수 있다. 터치 패널(531) 상의 또는 터치 패널(531) 근처의 터치 조작을 검출한 후, 터치 패널은 터치 이벤트의 유형을 결정하기 위해 터치 조작을 프로세서(580)로 전달한다. 그 후, 프로세서(580)는 터치 이벤트의 유형에 따라 디스플레이 패널(541)에 대응하는 시각적 출력을 제공한다. 비록 도 7에서 터치 패널(531) 및 디스플레이 패널(541)이 휴대폰의 입출력 기능을 구현하기 위해 두 개의 분리된 부분으로 사용되었지만, 일부 실시예에서, 터치 패널(531) 및 디스플레이 패널(541)은 휴대폰의 입출력 기능을 구현하기 위해 통합될 수 있다.The display unit 540 may be configured to display information input by the user or information provided to the user and various menus of the mobile phone. The display unit 540 may include a display panel 541 . Optionally, the display panel 541 may be configured using a liquid crystal display (LCD), an organic light-emitting diode (OLED), or the like. Also, the touch panel 531 may cover the display panel 541 . After detecting a touch manipulation on or near the touch panel 531 , the touch panel passes the touch manipulation to the processor 580 to determine the type of touch event. Then, the processor 580 provides a visual output corresponding to the display panel 541 according to the type of the touch event. Although the touch panel 531 and the display panel 541 in FIG. 7 are used as two separate parts to implement the input/output function of the mobile phone, in some embodiments, the touch panel 531 and the display panel 541 are It can be integrated to implement the input/output function of the mobile phone.

휴대폰은 광학 센서, 모션 센서 및 기타 센서와 같은 적어도 하나의 센서(550)를 더 포함할 수 있다. 구체적으로, 광학 센서는 주변 광 센서 및 근접 센서를 포함할 수 있다. 주변 광 센서는 주변 광의 밝기에 따라 디스플레이 패널(541)의 휘도를 조절할 수 있다. 근접 센서는 휴대폰이 귀로 이동되는 경우 디스플레이 패널(541) 및/또는 백라이트를 끌 수 있다. 모션 센서의 한 유형으로, 가속도 센서는 다양한 방향(일반적으로 3축)의 가속도 크기를 검출할 수 있고, 정지 상태일 때 중력의 크기와 방향을 검출할 수 있으며, 휴대 전화의 자세(예를 들어, 가로 방향과 세로 방향 사이의 전환, 관련 게임 및 자력계 자세 보정), 진동 인식 관련 기능(예를 들어, 만보계 및 노크) 등을 인식하는 애플리케이션에 적용될 수 있다. 휴대폰에 구성될 수 있는 자이로스코프, 기압계, 습도계, 온도계 및 적외선 센서와 같은 다른 센서는 여기에서 더 설명되지 않는다.The mobile phone may further include at least one sensor 550 such as an optical sensor, a motion sensor, and other sensors. Specifically, the optical sensor may include an ambient light sensor and a proximity sensor. The ambient light sensor may adjust the brightness of the display panel 541 according to the brightness of the ambient light. The proximity sensor may turn off the display panel 541 and/or the backlight when the phone is moved to the ear. As a type of motion sensor, the accelerometer can detect the magnitude of acceleration in various directions (usually three axes), and can detect the magnitude and direction of gravity when stationary, and the posture of the mobile phone (for example, , switching between landscape and portrait orientations, related games and magnetometer posture correction), vibration recognition-related functions (eg, pedometer and knock), and the like. Other sensors such as gyroscopes, barometers, hygrometers, thermometers, and infrared sensors that may be configured in a mobile phone are not further described herein.

오디오 회로(560), 라우드 스피커(561) 및 마이크로폰(562)은 사용자와 휴대폰 사이의 오디오 인터페이스를 제공할 수 있다. 오디오 회로(560)는 수신된 오디오 데이터를 전기 신호로 변환하고 그 전기 신호를 스피커(561)로 전송할 수 있다. 스피커(561)는 전기 신호를 출력을 위한 사운드 신호로 변환한다. 한편, 마이크로폰(562)은 수집된 사운드 신호를 전기 신호로 변환한다. 오디오 회로(560)는 전기 신호를 수신하고, 전기 신호를 오디오 데이터로 변환하며, 처리를 위해 오디오 데이터를 프로세서(580)로 출력한다. 그 후, 프로세서는 RF 회로(510)를 사용하여 오디오 데이터를 예를 들어, 다른 휴대폰으로 전송하거나, 또는 추가 처리를 위해 오디오 데이터를 메모리(520)로 출력한다.Audio circuitry 560 , loudspeaker 561 and microphone 562 may provide an audio interface between the user and the mobile phone. The audio circuit 560 may convert the received audio data into an electrical signal and transmit the electrical signal to the speaker 561 . The speaker 561 converts the electrical signal into a sound signal for output. Meanwhile, the microphone 562 converts the collected sound signal into an electrical signal. The audio circuit 560 receives the electrical signal, converts the electrical signal into audio data, and outputs the audio data to the processor 580 for processing. The processor then uses the RF circuit 510 to transmit the audio data to, for example, another mobile phone, or outputs the audio data to the memory 520 for further processing.

Wi-Fi는 근거리 무선 전송 기술이다. 휴대폰은 Wi-Fi 모듈(570)을 사용하여 사용자가 이메일을 수신하고 전송하며, 웹 페이지를 탐색하고, 스트림 매체에 액세스하는 등의 작업을 도울 수 있다. 이것은 사용자에게 무선 광대역 인터넷 액세스를 제공한다. 비록 도 7이 Wi-Fi 모듈(570)을 도시하고 있지만, Wi-Fi 모듈은 휴대폰의 필수 컴포넌트가 아님을 이해할 수 있으며, Wi-Fi 모듈은 본 개시의 본질의 범위가 변경되지 않는 한 필요에 따라 생략될 수 있음을 알 수 있다.Wi-Fi is a short-range wireless transmission technology. The mobile phone may use the Wi-Fi module 570 to help a user receive and send email, navigate web pages, access stream media, and the like. It provides users with wireless broadband Internet access. Although FIG. 7 shows the Wi-Fi module 570, it is understood that the Wi-Fi module is not an essential component of the mobile phone, and the Wi-Fi module is not required unless the scope of the essence of the present disclosure is changed. It can be seen that it can be omitted accordingly.

프로세서(580)는 휴대폰의 제어 센터이고, 다양한 인터페이스 및 회선을 사용하여 전체 휴대폰의 다양한 부분과 연결된다. 메모리(520)에 저장된 소프트웨어 프로그램 및/또는 모듈을 운용하거나 또는 실행하고, 메모리(520)에 저장된 데이터를 호출함으로써, 프로세서는 휴대폰의 다양한 기능을 실행하고 데이터 처리를 수행함으로써, 전체 휴대폰을 모니터링할 수 있다. 선택적으로, 프로세서(580)는 하나 이상의 처리 유닛을 포함할 수 있다. 선택적으로, 프로세서(580)는 애플리케이션 프로세서 및 모뎀 프로세서를 통합할 수 있다. 애플리케이션 프로세서는 주로 운영 체제, 사용자 인터페이스, 애플리케이션 프로그램 등을 처리한다. 모뎀 프로세서는 주로 무선 통신을 처리한다. 전술한 모뎀은 프로세서(580)에 통합되지 않을 수도 있음을 이해할 수 있다.The processor 580 is the control center of the mobile phone and connects to the various parts of the overall mobile phone using various interfaces and lines. By operating or executing a software program and/or module stored in the memory 520 and calling data stored in the memory 520, the processor executes various functions of the mobile phone and performs data processing, thereby monitoring the entire mobile phone. can Optionally, processor 580 may include one or more processing units. Optionally, processor 580 may incorporate an application processor and a modem processor. The application processor mainly processes the operating system, user interface, application programs, and the like. The modem processor primarily handles wireless communications. It will be appreciated that the modem described above may not be integrated into the processor 580 .

휴대폰은 컴포넌트에 전력을 공급하기 위한 파워 서플라이(590)(예를 들어, 배터리)을 더 포함한다. 선택적으로, 파워 서플라이는 전력 관리 시스템을 사용하여 프로세서(580)에 논리적으로 연결될 수 있으므로, 전력 관리 시스템을 사용하여 충전, 방전 및 전력 소비 관리와 같은 기능을 구현할 수 있다.The mobile phone further includes a power supply 590 (eg, a battery) for powering the components. Optionally, the power supply can be logically coupled to the processor 580 using a power management system, so that the power management system can be used to implement functions such as charging, discharging, and power consumption management.

도면에는 도시되지 않았지만, 휴대폰은 카메라, 블루투스 모듈 등을 더 포함 할 수 있으며, 이들에 대해 여기서 더 이상 설명되지 않는다.Although not shown in the drawings, the mobile phone may further include a camera, a Bluetooth module, and the like, which are not further described herein.

본 개시의 본 실시예에서, 단말 장치에 포함되는 프로세서(580)는,In this embodiment of the present disclosure, the processor 580 included in the terminal device,

찾아질 이미지에서 이미지 후보 영역 세트를 획득하는 기능 ― 이미지 후보 영역 세트는 N개의 이미지 후보 영역을 포함하고, N은 1보다 크거나 같은 정수임 ―;a function to obtain a set of image candidate regions in the image to be found, wherein the set of image candidate regions includes N image candidate regions, where N is an integer greater than or equal to one;

이미지 후보 영역 세트에 따라 영역 시맨틱 정보 세트(즉, 찾아질 이미지의 이미지 후보 영역 세트)를 생성하는 기능 ― 영역 시맨틱 정보 세트는 N개의 영역 시맨틱 정보를 포함하고, 각각의 영역 시맨틱 정보는 하나의 이미지 후보 영역(즉, 이미지 후보 영역 세트의 하나의 이미지 후보 영역에 대응하는 영역 시맨틱 정보의 각각의 영역 시맨틱 정보)에 대응하며, A function of generating a set of region semantic information (i.e., a set of image candidate regions of the image to be found) according to a set of image candidate regions, the region semantic information set comprising N region semantic information, each region semantic information being one image corresponding to a candidate region (i.e., each region semantic information of region semantic information corresponding to one image candidate region in a set of image candidate regions);

GCN을 사용하여 영역 시맨틱 정보 세트에 대응하는 향상된 시맨틱 정보 세트를 획득하고, 향상된 시맨틱 정보 세트는 N개의 향상된 시맨틱 정보를 포함하며, 각각의 향상된 시맨틱 정보는 하나의 영역 시맨틱 정보(즉, 영역 시맨틱 정보 세트의 하나의 영역 시맨틱 정보에 대응하는 향상된 시맨틱 정보 세트의 각각의 향상된 시맨틱 정보)에 대응하고, GCN은 다양한 영역 시맨틱 정보 사이의 연관 관계를 구축하도록 구성되며,GCN is used to obtain an enhanced semantic information set corresponding to the region semantic information set, the enhanced semantic information set includes N enhanced semantic information, each enhanced semantic information is one region semantic information (that is, region semantic information each enhanced semantic information of the set of enhanced semantic information corresponding to one region semantic information of the set, the GCN is configured to establish an association relationship between the various region semantic information;

찾아질 텍스트에 대응하는 텍스트 특징 세트를 획득하고, 찾아질 텍스트는 T개의 단어를 포함하며, 텍스트 특징 세트는 T개의 단어 특징을 포함하고, 각각의 단어는 하나의 단어 특징에 대응하며, T는 1보다 크거나 같고(즉, 텍스트 특징 세트의 한 단어 특징에 대응하는 찾아질 텍스트의 각각의 단어),obtain a set of text features corresponding to the text to be found, the text to be found comprises T words, the text feature set comprises T word features, each word corresponding to one word feature, and T is greater than or equal to 1 (ie, each word in the text to be found corresponding to a one-word feature in the set of text features),

이미지 영역 찾기 네트워크 모델을 사용하여 텍스트 특징 세트(즉, 찾아질 텍스트에 대응하는 텍스트 특징 세트)와 각각의 향상된 시맨틱 정보 사이의 매칭 정도를 획득하고, 이미지 영역 찾기 네트워크 모델은 이미지 후보 영역과 찾아질 텍스트 사이의 매칭 관계를 결정하도록 구성됨 ―; 및The image region finding network model is used to obtain the degree of matching between the text feature set (that is, the text feature set corresponding to the text to be found) and each enhanced semantic information, and the image region finding network model is the image candidate region and the to be found. configured to determine a matching relationship between texts; and

텍스트 특징 세트와 각각의 향상된 시맨틱 정보 사이의 매칭 정도에 따라 이미지 후보 영역 세트로부터 타깃 이미지 후보 영역을 결정하는 기능The ability to determine a target image candidate region from a set of image candidate regions according to the degree of matching between the text feature set and each enhanced semantic information.

을 추가로 갖는다.has additionally

선택적으로, 프로세서(580)는 구체적으로, Optionally, the processor 580 specifically:

CNN을 사용하여 각각의 이미지 후보 영역에 대응하는 영역 시맨틱 정보를 획득하는 단계 ― 이미지 후보 영역은 영역 정보를 포함하고, 영역 정보는 찾아질 이미지에서 이미지 후보 영역의 위치 정보와 이미지 후보 영역의 크기 정보를 포함함 ―; 및Acquiring region semantic information corresponding to each image candidate region using CNN—The image candidate region includes region information, and the region information includes location information of an image candidate region and size information of an image candidate region in an image to be found. including ―; and

N개의 이미지 후보 영역에 대응하는 영역 시맨틱 정보가 획득되는 경우, N개의 영역 시맨틱 정보에 따라 영역 시맨틱 정보 세트를 생성하는 단계When region semantic information corresponding to the N image candidate regions is obtained, generating a region semantic information set according to the N region semantic information;

를 수행하도록 구성된다.is configured to perform

선택적으로, 프로세서(580)는 구체적으로,Optionally, the processor 580 specifically:

영역 시맨틱 정보 세트로부터 제1 영역 시맨틱 정보 및 제2 영역 시맨틱 정보를 획득하는 단계 ― 제1 영역 시맨틱 정보는 영역 시맨틱 정보 세트에서 임의의 하나의 영역 시맨틱 정보이고, 제2 영역 시맨틱 정보는 영역 시맨틱 정보 세트에서 임의의 하나의 영역 시맨틱 정보임 ―;obtaining first region semantic information and second region semantic information from the region semantic information set, wherein the first region semantic information is any one region semantic information in the region semantic information set, and the second region semantic information is region semantic information any one area semantic information in the set;

제1 영역 시맨틱 정보와 제2 영역 시맨틱 정보 사이의 연결 에지의 강도를 획득하는 단계;obtaining a strength of a connection edge between the first region semantic information and the second region semantic information;

정규화된 강도를 획득하기 위해 제1 영역 시맨틱 정보와 제2 영역 시맨틱 정보 사이의 연결 에지의 강도를 정규화하는 단계;normalizing the strength of a connecting edge between the first region semantic information and the second region semantic information to obtain a normalized strength;

영역 시맨틱 정보 세트의 다양한 영역 시맨틱 정보 사이의 정규화된 강도에 따라 타깃 연결 매트릭스를 결정하는 단계; 및determining a target connection matrix according to normalized strengths between various region semantic information of the region semantic information set; and

GCN을 사용하여 타깃 연결 매트릭스에 대응하는 향상된 시맨틱 정보 세트를 결정하는 단계Determining an enhanced semantic information set corresponding to a target connectivity matrix using GCN

를 수행하도록 구성된다.is configured to perform

영역 시맨틱 정보 세트 내의 다양한 영역 시맨틱 정보 사이의 정규화된 강도에 따라 연결 매트릭스를 생성하는 단계; 및generating a connection matrix according to normalized strengths between various region semantic information in the region semantic information set; and

연결 매트릭스 및 단위 매트릭스에 따라 타깃 연결 매트릭스를 생성하는 단계generating a target connection matrix according to the connection matrix and the unit matrix;

를 수행하도록 구성된다.is configured to perform

다음과 같은 방식으로 향상된 시맨틱 정보 세트를 계산하는 단계를 수행하도록 구성되며,is configured to perform the steps of calculating an enhanced set of semantic information in the following manner:

여기서,

는 GCN의 k 번째 계층의 제1 네트워크 파라미터를 나타내고,

는 GCN의 k 번째 계층의 제2 네트워크 파라미터를 나타내며,

는 j 번째 노드가 i 번째 노드의 이웃 노드임을 나타내고,

는 타깃 연결 매트릭스의 요소를 나타낸다. here,

represents the first network parameter of the k-th layer of GCN,

represents the second network parameter of the k-th layer of GCN,

indicates that the j-th node is a neighbor of the i-th node,

represents the elements of the target connection matrix.

찾아질 텍스트를 획득하는 단계;obtaining the text to be found;

찾아질 텍스트에 따라 텍스트 벡터 시퀀스를 획득하는 단계 ― 텍스트 벡터 시퀀스는 T개의 단어 벡터를 포함하고, 각각의 단어 벡터는 하나의 단어에 대응함 ―;obtaining a text vector sequence according to the text to be found, the text vector sequence comprising T word vectors, each word vector corresponding to one word;

텍스트 특징을 획득하기 위해 텍스트 벡터 시퀀스 내의 각각의 단어 벡터를 인코딩하는 단계; 및encoding each word vector in the text vector sequence to obtain a text feature; and

T개의 단어 벡터에 대응하는 텍스트 특징이 획득되는 경우, T개의 텍스트 특징에 따라 텍스트 특징 세트를 생성하는 단계when text features corresponding to the T word vectors are obtained, generating a set of text features according to the T text features;

를 수행하도록 구성된다.is configured to perform

다음과 같은 방식으로 텍스트 특징을 획득하는 단계를 수행하도록 구성되며,is configured to perform the steps of obtaining text features in the following manner,

여기서,

는 텍스트 특징 세트에서 t 번째 텍스트 특징을 나타내고,

는 텍스트 벡터 시퀀스에서 t 번째 단어 벡터를 나타내고,

denotes the t-th text feature in the text feature set,

indicates that encoding is performed using the LSTM network,

denotes the t-th word vector in the text vector sequence,

denotes the (t-1)-th text feature in the text feature set.

도 8은 본 개시의 실시예에 따른 서버의 개략적인 구조도이다. 서버(600)는 상이한 구성 또는 성능으로 인해 크게 변할 수 있고, 하나 이상의 중앙 처리 장치(central processing unit, CPU)(622)(예를 들어, 하나 이상의 프로세서) 및 메모리(632) 및 애플리케이션 프로그램(642) 및 데이터(644)를 저장하는 하나 이상의 저장 매체(630)(예를 들어 하나 이상의 대용량 저장 장치)를 포함할 수 있다. 메모리(632) 및 저장 매체(630)는 일시적이거나 영구적인 저장 장치일 수 있다. 저장 매체(630)에 저장된 프로그램은 하나 이상의 모듈(도면에 표시되지 않음)을 포함할 수 있으며, 각각의 모듈은 서버에 대한 일련의 명령 작동을 포함할 수 있다. 또한, CPU(622)는 저장 매체(630)와 통신하도록 설정될 수 있고, 서버(600)에서 저장 매체(630)의 일련의 명령 작동을 수행할 수 있다.8 is a schematic structural diagram of a server according to an embodiment of the present disclosure. Server 600 can vary widely due to different configurations or performance, and includes one or more central processing units (CPUs) 622 (eg, one or more processors) and memory 632 and application programs 642 . ) and one or more storage media 630 (eg, one or more mass storage devices) for storing data 644 . Memory 632 and storage medium 630 may be temporary or permanent storage devices. The program stored in the storage medium 630 may include one or more modules (not shown), and each module may include a series of command operations for the server. In addition, the CPU 622 may be set to communicate with the storage medium 630 , and may perform a series of instruction operations of the storage medium 630 in the server 600 .

서버(600)는 하나 이상의 파워 서플라이(626), 하나 이상의 유선 또는 무선 네트워크 인터페이스(650), 하나 이상의 입력/출력 인터페이스(658), 및/또는 Windows Server^TM, Mac OS X^TM, Unix^TM, Linux^TM 또는 FreeBSD^TM과 같은 하나 이상의 운영 체제(641)를 더 포함할 수 있다. Server 600 may include one or more power supplies 626 , one or more wired or wireless network interfaces 650 , one or more input/output interfaces 658 , and/or Windows Server ^TM , Mac OS X ^TM , Unix ^TM , Linux . It may further include one or more operating systems 641 such as ^TM or FreeBSD ^TM.

전술한 실시예에서 서버에 의해 수행되는 단계는 도 8에 도시된 서버 구조에 기초할 수 있다.The steps performed by the server in the above-described embodiment may be based on the server structure shown in FIG. 8 .

본 발명의 본 실시예에서, 서버에 포함된 CPU(622)는,In this embodiment of the present invention, the CPU 622 included in the server,

을 추가로 갖는다.has additionally

선택적으로, CPU(622)는 구체적으로,Optionally, the CPU 622 specifically:

를 수행하도록 구성된다.is configured to perform

여기서,

는 GCN의 k 번째 계층의 제1 네트워크 파라미터를 나타내고,

는 GCN의 k 번째 계층의 제2 네트워크 파라미터를 나타내며,

는 j 번째 노드가 i 번째 노드의 이웃 노드임을 나타내고,

는 타깃 연결 매트릭스의 요소를 나타낸다. here,

represents the first network parameter of the k-th layer of GCN,

represents the second network parameter of the k-th layer of GCN,

indicates that the j-th node is a neighbor of the i-th node,

represents the elements of the target connection matrix.

찾아질 텍스트를 획득하는 단계;obtaining the text to be found;

를 수행하도록 구성된다.is configured to perform

여기서,

는 텍스트 특징 세트에서 t 번째 텍스트 특징을 나타내고,

는 텍스트 벡터 시퀀스에서 t 번째 단어 벡터를 나타내고,

denotes the t-th text feature in the text feature set,

indicates that encoding is performed using the LSTM network,

denotes the t-th word vector in the text vector sequence,

denotes the (t-1)-th text feature in the text feature set.

본 개시의 본 실시예에서, 서버에 포함된 CPU(622)는,In this embodiment of the present disclosure, the CPU 622 included in the server,

훈련될 텍스트 세트 및 훈련될 이미지 후보 영역 세트를 획득하는 기능 ― 훈련될 텍스트 세트는 제1 훈련될 텍스트와 제2 훈련될 텍스트를 포함하고, 훈련될 이미지 후보 영역 세트는 제1 훈련될 이미지 후보 영역 및 제2 훈련될 이미지 후보 영역을 포함하며, 제1 훈련될 텍스트와 제1 훈련될 이미지 후보 영역은 매칭 관계를 갖고 있고, 제1 훈련된 텍스트와 제2 훈련될 이미지 후보 영역은 매칭 관계를 갖고 있지 않으며, 제2 훈련될 텍스트와 제2 훈련될 이미지 후보 영역은 매칭 관계를 갖고 있고, 제2 훈련될 텍스트와 제1 훈련될 이미지 후보 영역은 매칭 관계를 갖고 있지 않음 -; a function of obtaining a set of text to be trained and a set of image candidate regions to be trained, wherein the set of text to be trained includes a first text to be trained and a second text to be trained, and the set of image candidate regions to be trained is a first image candidate region to be trained. and a second image candidate region to be trained, wherein the first text to be trained and the first image candidate region to be trained have a matching relationship, and the first trained text and the second image candidate region to be trained have a matching relationship not, the second text to be trained and the second to-be-trained image candidate region have a matching relationship, and the second to-be-trained text and the first to-be-trained image candidate region do not have a matching relationship;

제1 훈련될 텍스트, 제2 훈련될 텍스트, 제1 훈련될 이미지 후보 영역 및 제2 훈련될 이미지 후보 영역에 따라 타깃 손실 함수를 결정하는 기능; 및determining a target loss function according to the first text to be trained, the second text to be trained, the first to-be-trained image candidate region, and the second to-be-trained image candidate region; and

이미지 영역 찾기 네트워크 모델을 획득하기 위해 타깃 손실 함수를 사용하여 훈련될 이미지 영역 찾기 네트워크 모델을 훈련시키는 기능 ― 이미지 영역 찾기 네트워크 모델은 텍스트 특징 세트와 향상된 시맨틱 정보에 따라 이미지 후보 영역과 찾아질 텍스트 사이의 매칭 관계를 결정하도록 구성되고, 향상된 시맨틱 정보와 이미지 후보 영역은 대응관계를 가지며, 텍스트 특징 세트와 찾아질 텍스트는 대응관계를 가짐 ―The ability to train an image region finding network model to be trained using a target loss function to obtain an image region finding network model — the image region finding network model is constructed between image candidate regions and the text to be found according to a set of text features and enhanced semantic information. configured to determine a matching relationship of , wherein the enhanced semantic information and the image candidate region have a correspondence, and the text feature set and the text to be found have a correspondence —

을 추가로 가진다.have additional

다음과 같은 방식으로 타깃 손실 함수를 결정하는 단계를 수행하도록 구성되며,is configured to perform the steps of determining a target loss function in the following manner:

여기서,

는 타깃 손실 함수를 나타내고,

는 제1 훈련될 이미지 후보 영역을 나타내며,

는 제1 훈련될 텍스트를 나타내고,

는 제2 훈련될 이미지 후보 영역을 나타내며,

는 제2 훈련될 텍스트를 나타내고,

은 제1 파라미터 제어 가중치를 나타내며,

는 제2 파라미터 제어 가중치를 나타내고,

은 제1 미리 설정된 임계값을 나타내며,

는 제2 미리 설정된 임계값을 나타낸다.here,

represents the target loss function,

represents the first image candidate region to be trained,

represents the first to-be-trained text,

denotes the second to-be-trained image candidate region,

represents the second to-be-trained text,

denotes the data pair to be trained, max() denotes to take the maximum,

represents the first parameter control weight,

represents the second parameter control weight,

represents the first preset threshold,

denotes a second preset threshold value.

당업자는 설명의 편의성 및 간결성을 위해, 전술한 시스템, 장치 및 유닛의 특정 작업 프로세스에 대해 전술한 방법 실시예에서 대응하는 프로세스가 참조될 수 있으며, 세부 사항은 여기에서 다시 설명되지 않는 다는 것을 명확하게 이해할 수 있다. It is clear to those skilled in the art that for convenience and conciseness of description, corresponding processes in the above-described method embodiments may be referred to for specific working processes of the above-described systems, apparatuses and units, and details are not described herein again. can be understood

본 개시에서 제공되는 실시예에서, 개시된 시스템, 장치, 및 방법은 다른 방식으로 구현될 수 있다는 점이 이해되어야 한다. 예를 들어, 설명된 장치 실시예는 단지 예시적인 것이다. 예를 들어, 유닛 분할은 논리적 기능 분할일 뿐이며, 실제 구현에서는 다른 분할일 수 있다. 예를 들어, 복수의 유닛 또는 컴포넌트는 다른 시스템에 결합되거나 또는 통합될 수 있거나, 또는 일부 특징이 무시되거나 수행되지 않을 수 있다. 또한, 디스플레이되거나 논의된 상호 결합 또는 직접 결합 또는 통신 연결은 일부 인터페이스를 사용하여 구현될 수 있다. 장치 또는 유닛 사이의 간접 결합 또는 통신 연결은 전기적, 기계적, 또는 다른 형태의 연결로 구현될 수다.In the embodiments provided in this disclosure, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the described device embodiments are exemplary only. For example, unit division is only logical function division, and may be another division in actual implementation. For example, a plurality of units or components may be combined or integrated into other systems, or some features may be ignored or not performed. Also, the mutual couplings or direct couplings or communication connections displayed or discussed may be implemented using some interfaces. An indirect coupling or communication connection between devices or units may be implemented as an electrical, mechanical, or other type of connection.

별도의 컴포넌트로 설명된 유닛은 물리적으로 분리되어 있거나 분리되어 있지 않을 수 있으며, 유닛으로 디스플레이되는 컴포넌트는 물리적 유닛일 수도 있고 아닐 수도 있으며, 한 위치에 있을 수도 있고 복수의 네트워크 유닛에 분산될 수도 있다. 이러한 유닛의 일부 또는 전부는 실시예의 해결수단의 목적을 달성하기 위해 실제 요구에 따라 선택될 수 있다.A unit described as a separate component may or may not be physically separated, and a component displayed as a unit may or may not be a physical unit, and may be located in one location or distributed over a plurality of network units . Some or all of these units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

또한, 본 개시의 실시예에서의 기능 유닛은 하나의 처리 유닛으로 통합될 수 있거나, 또는 각각의 유닛이 물리적으로 단독으로 존재할 수 있거나, 또는 둘 이상의 유닛이 하나의 유닛으로 통합될 수 있다. 통합 유닛은 하드웨어의 형태로 구현될 수 있거나, 또는 소프트웨어 기능 유닛의 형태로 구현될 수 있다.In addition, the functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software functional unit.

통합 유닛이 소프트웨어 기능 유닛의 형태로 구현되고, 독립 제품으로서 판매되거나 사용될 때, 통합 유닛은 컴퓨터 판독 가능 저장 매체에 저장될 수 있다. 이러한 이해에 기초하여, 본 개시의 기술적 해결수단은 본질적으로, 또는 종래 기술에 기여하는 부분, 또는 기술적 해결수단의 전부 또는 일부는 소프트웨어 제품의 형태로 구현될 수 있다. 컴퓨터 소프트웨어 제품은 저장 매체에 저장되고, (PC, 서버, 네트워크 장치일 수 있는) 컴퓨터 장치에, 본 개시의 실시예에서 설명된 방법 단계의 전부 또는 일부를 수행할 것을 명령하기 위한 여러 개의 명령을 포함한다. 전술한 저장 매체는, USB 플래시 드라이브, 착탈식 하드 디스크, 리드 온리 메모리(read-only memory, ROM), RAM, 자기 디스크, 또는 광 디스크와 같은, 프로그램 코드를 저장할 수 있는 임의의 매체를 포함한다.When the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, the integrated unit may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present disclosure may be implemented in the form of a software product essentially, or a part contributing to the prior art, or all or a part of the technical solution. A computer software product is stored in a storage medium and contains several instructions for instructing a computer device (which may be a PC, a server, or a network device) to perform all or part of the method steps described in the embodiments of the present disclosure. include The aforementioned storage medium includes any medium capable of storing a program code, such as a USB flash drive, a removable hard disk, a read-only memory (ROM), a RAM, a magnetic disk, or an optical disk.

본 개시의 실시예는 컴퓨터 판독 가능 저장 매체를 더 제공하고, 컴퓨터 판독 가능 저장 매체는 명령을 저장하며, 명령은 컴퓨터 상에서 실행될 때 컴퓨터로 하여금 전술한 실시예에서 제공되는 이미지 영역을 찾기 위한 방법의 임의의 가능한 구현을 수행하게 한다.Embodiments of the present disclosure further provide a computer-readable storage medium, wherein the computer-readable storage medium stores instructions, which, when executed on a computer, cause a computer to use a method for locating an image region provided in the above-described embodiments. Lets do any possible implementation.

선택적으로, 컴퓨터 판독 가능 저장 매체에 저장된 명령은,Optionally, the instructions stored on the computer-readable storage medium include:

찾아질 이미지의 이미지 후보 영역 세트에 따라 영역 시맨틱 정보를 생성하는 단계 ― 영역 시맨틱 정보 세트의 각각의 영역 시맨틱 정보는 이미지 후보 영역 세트의 하나의 이미지 후보 영역에 대응함 ―;generating region semantic information according to the image candidate region set of the image to be found, each region semantic information of the region semantic information set corresponding to one image candidate region of the image candidate region set;

GCN을 사용하여 영역 시맨틱 정보 세트에 대응하는 향상된 시맨틱 정보 세트를 획득하는 단계 ― 향상된 시맨틱 정보 세트의 각각의 향상된 시맨틱 정보는 영역 시맨틱 정보 세트의 하나의 영역 시맨틱 정보에 대응하고, GCN은 다양한 영역 시맨틱 정보 사이의 연관 관계를 구축하도록 구성됨 ―;obtaining an enhanced semantic information set corresponding to the region semantic information set using the GCN, wherein each enhanced semantic information of the enhanced semantic information set corresponds to one region semantic information of the region semantic information set, and the GCN is configured to establish associations between information;

이미지 영역 찾기 네트워크 모델을 사용하여 찾아질 텍스트에 대응하는 텍스트 특징 세트와 각각의 향상된 시맨틱 정보 사이의 매칭 정도를 획득하는 단계 ― 이미지 영역 찾기 네트워크 모델은 이미지 후보 영역과 찾아질 텍스트 사이의 매칭 관계를 결정하도록 구성되고, 찾아질 텍스트의 각각의 단어는 텍스트 특징 세트의 한 단어에 대응함 ―; 및obtaining a degree of matching between each enhanced semantic information and a set of text features corresponding to the text to be found using the image region finding network model—the image region finding network model determines the matching relationship between the image candidate region and the text to be found configured to determine, wherein each word of the text to be found corresponds to a word of the set of text features; and

텍스트 특징 세트와 각각의 향상된 시맨틱 정보 사이의 매칭 정도에 따라 이미지 후보 영역 세트로부터 타깃 이미지 후보 영역을 결정하는 단계determining a target image candidate region from the image candidate region set according to a degree of matching between the text feature set and each enhanced semantic information;

를 수행하도록 구성된다.is configured to perform

여기서,

는 GCN의 k 번째 계층의 제1 네트워크 파라미터를 나타내고,

는 GCN의 k 번째 계층의 제2 네트워크 파라미터를 나타내며,

는 j 번째 노드가 i 번째 노드의 이웃 노드임을 나타내고,

는 타깃 연결 매트릭스의 요소를 나타낸다. here,

represents the first network parameter of the k-th layer of GCN,

represents the second network parameter of the k-th layer of GCN,

indicates that the j-th node is a neighbor of the i-th node,

represents the elements of the target connection matrix.

찾아질 텍스트를 획득하는 단계;obtaining the text to be found;

찾아질 텍스트에 따라 텍스트 벡터 시퀀스를 획득하는 단계 ― 텍스트 벡터 시퀀스는 T개의 단어 벡터를 포함하고, 각각의 단어 벡터는 하나의 단어에 대응하며, T는 1보다 크거나 같음 ―;obtaining a text vector sequence according to the text to be found, the text vector sequence comprising T word vectors, each word vector corresponding to one word, and T being greater than or equal to one;

를 수행하도록 구성된다.is configured to perform

여기서,

는 텍스트 특징 세트에서 t 번째 텍스트 특징을 나타내고,

는 텍스트 벡터 시퀀스에서 t 번째 단어 벡터를 나타내고,

denotes the t-th text feature in the text feature set,

indicates that encoding is performed using the LSTM network,

denotes the t-th word vector in the text vector sequence,

denotes the (t-1)-th text feature in the text feature set.

본 개시의 실시예는 컴퓨터 판독 가능 저장 매체를 더 제공하고, 컴퓨터 판독 가능 저장 매체는 명령을 저장하며, 명령은 컴퓨터 상에서 실행될 때 컴퓨터로 하여금 전술한 실시예에서 제공되는 모델 훈련 방법의 임의의 가능한 구현을 수행하게 한다.Embodiments of the present disclosure further provide a computer-readable storage medium, wherein the computer-readable storage medium stores instructions, which, when executed on a computer, cause the computer to perform any possible of the model training methods provided in the above-described embodiments. Let the implementation

훈련될 텍스트 세트 및 훈련될 이미지 후보 영역 세트를 획득하는 단계 ― 훈련될 텍스트 세트는 제1 훈련될 텍스트와 제2 훈련될 텍스트를 포함하고, 훈련될 이미지 후보 영역 세트는 제1 훈련될 이미지 후보 영역 및 제2 훈련될 이미지 후보 영역을 포함하며, 제1 훈련될 텍스트와 제1 훈련될 이미지 후보 영역은 매칭 관계를 갖고 있고, 제1 훈련된 텍스트와 제2 훈련될 이미지 후보 영역은 매칭 관계를 갖고 있지 않으며, 제2 훈련될 텍스트와 제2 훈련될 이미지 후보 영역은 매칭 관계를 갖고 있고, 제2 훈련될 텍스트와 제1 훈련될 이미지 후보 영역은 매칭 관계를 갖고 있지 않음 -; obtaining a set of text to be trained and a set of image candidate regions to be trained, wherein the set of text to be trained includes a first text to be trained and a second text to be trained, and the set of image candidate regions to be trained includes a first image candidate region to be trained. and a second image candidate region to be trained, wherein the first text to be trained and the first image candidate region to be trained have a matching relationship, and the first trained text and the second image candidate region to be trained have a matching relationship not, the second text to be trained and the second to-be-trained image candidate region have a matching relationship, and the second to-be-trained text and the first to-be-trained image candidate region do not have a matching relationship;

제1 훈련될 텍스트, 제2 훈련될 텍스트, 제1 훈련될 이미지 후보 영역 및 제2 훈련될 이미지 후보 영역에 따라 타깃 손실 함수를 결정하는 단계; 및determining a target loss function according to the first to-be-trained text, the second to-be-trained text, the first to-be-trained image candidate region, and the second to-be-trained image candidate region; and

이미지 영역 찾기 네트워크 모델을 획득하기 위해 타깃 손실 함수를 사용하여 훈련될 이미지 영역 찾기 네트워크 모델을 훈련시키는 단계 ― 이미지 영역 찾기 네트워크 모델은 텍스트 특징 세트와 향상된 시맨틱 정보에 따라 이미지 후보 영역과 찾아질 텍스트 사이의 매칭 관계를 결정하도록 구성되고, 향상된 시맨틱 정보와 이미지 후보 영역은 대응관계를 가지며, 텍스트 특징 세트와 찾아질 텍스트는 대응관계를 가짐 ―training an image region finding network model to be trained using the target loss function to obtain an image region finding network model — the image region finding network model is formed between the image candidate region and the text to be found according to a set of text features and enhanced semantic information. configured to determine a matching relationship of , wherein the enhanced semantic information and the image candidate region have a correspondence, and the text feature set and the text to be found have a correspondence —

를 수행하도록 구성된다.is configured to perform

여기서,

는 타깃 손실 함수를 나타내고,

는 제1 훈련될 이미지 후보 영역을 나타내며,

는 제1 훈련될 텍스트를 나타내고,

는 제2 훈련될 이미지 후보 영역을 나타내며,

는 제2 훈련될 텍스트를 나타내고,

은 제1 파라미터 제어 가중치를 나타내며,

는 제2 파라미터 제어 가중치를 나타내고,

은 제1 미리 설정된 임계값을 나타내며,

는 제2 미리 설정된 임계값을 나타낸다.here,

represents the target loss function,

represents the first image candidate region to be trained,

represents the first to-be-trained text,

denotes the second to-be-trained image candidate region,

represents the second to-be-trained text,

denotes the data pair to be trained, max() denotes to take the maximum,

represents the first parameter control weight,

represents the second parameter control weight,

represents the first preset threshold,

denotes a second preset threshold value.

본 개시의 실시예는 명령을 포함하는 컴퓨터 판독 가능 저장 매체를 더 제공하고, 명령은 컴퓨터 상에서 실행될 때 컴퓨터로 하여금 전술한 실시예에서 제공되는 이미지 영역을 찾기 위한 방법의 임의의 가능한 구현을 수행하게 하거나 또는 전술한 실시예에서 제공되는 모델 훈련 방법의 임의의 가능한 구현을 수행하게 한다.Embodiments of the present disclosure further provide a computer-readable storage medium comprising instructions, which, when executed on a computer, cause the computer to perform any possible implementation of the method for locating an image region provided in the above-described embodiments. or to perform any possible implementation of the model training method provided in the foregoing embodiments.

전술한 실시예는 본 개시의 기술적 해결수단을 설명하기 위한 것일뿐, 본 개시를 제한하기 위한 것이 아니다. 본 개시가 전술한 실시예를 참조하여 상세하게 설명되었지만, 당업자라면 본 개시의 실시예의 기술적 해결수단의 정신 및 범위를 벗어나지 않고 전술한 실시예에서 설명된 기술적 해결수단에 여전히 수정을 가하거나 일부 기술적 특징을 동등하게 대체할 수 있음을 이해해야 한다. The above-described embodiment is only for explaining the technical solution of the present disclosure, not for limiting the present disclosure. Although the present disclosure has been described in detail with reference to the above-described embodiments, those skilled in the art may still add modifications or some technical solutions to the technical solutions described in the above-described embodiments without departing from the spirit and scope of the technical solutions of the embodiments of the present disclosure. It should be understood that features can be substituted equally.

Claims

A method for finding an image area, comprising:
generating a set of region semantic information according to the set of image candidate regions of the to-be located image, wherein each region semantic information of the set of region semantic information is a set of image candidate regions. Corresponding to one image candidate area;
obtaining an enhanced semantic information set corresponding to the regional semantic information set using a Graph Convolutional Network (GCN), wherein each enhanced semantic information of the enhanced semantic information set is one of the regional semantic information sets. corresponding to the domain semantic information of , wherein the GCN is configured to establish an association relationship between various domain semantic information;
obtaining a degree of matching between the respective enhanced semantic information and a set of text features corresponding to the text to be found using an image region finding network model, wherein the image region finding network model is configured between the image candidate region and the text to be found. determine a matching relationship of , wherein each word of the text to be found corresponds to a single word feature of the set of text features; and
determining a target image candidate region from the image candidate region set according to a matching degree between the text feature set and each of the enhanced semantic information;
A method for finding an image area containing

According to claim 1,
The step of generating a region semantic information set according to the image candidate region set of the image to be found includes:
obtaining region semantic information corresponding to each image candidate region using a convolutional neural network (CNN), wherein the image candidate region includes region information, and the region information is an image of the image to be found including location information of the candidate area and size information of the image candidate area; and
generating the region semantic information set according to the N region semantic information when region semantic information corresponding to N image candidate regions is obtained, wherein N is an integer greater than or equal to 1
A method for finding an image region, comprising:

3. The method of claim 1 or 2,
Using the GCN to obtain an enhanced semantic information set corresponding to the regional semantic information set includes:
obtaining first region semantic information and second region semantic information from the region semantic information set, wherein the first region semantic information is any one region semantic information of the region semantic information set, and the second region semantic information is any one region semantic information of the region semantic information set;
obtaining a strength of a connection edge between the first region semantic information and the second region semantic information;
normalizing the strength of a connecting edge between the first region semantic information and the second region semantic information to obtain a normalized strength;
determining a target connection matrix according to normalized strengths between various region semantic information of the region semantic information set; and
determining an enhanced semantic information set corresponding to the target connection matrix using the GCN;
A method for finding an image region, comprising:

4. The method of claim 3,
Determining a target connection matrix according to the normalized strength between various region semantic information of the region semantic information set includes:
generating a connection matrix according to normalized strengths between various region semantic information of the region semantic information set; and
generating the target connection matrix according to the connection matrix and the unit matrix;
A method for finding an image region, comprising:

5. The method of claim 3 or 4,
Using the GCN to determine the enhanced semantic information set corresponding to the target connection matrix comprises:
the following way

calculating the enhanced semantic information as
here

represents the first network parameter of the k-th layer of GCN,

represents the second network parameter of the k-th layer of GCN,

indicates that the j-th node is a neighbor of the i-th node,

represents an element of the target connection matrix,
A method for finding an image area.

6. The method according to any one of claims 1 to 5,
Before using the image region finding network model to obtain a degree of matching between a set of text features corresponding to the text to be found and the respective enhanced semantic information, the method for finding the image region comprises:
obtaining the text to be found;
obtaining a text vector sequence according to the text to be found, the text vector sequence comprising T word vectors, each word vector corresponding to one word, and T being an integer greater than or equal to one;
encoding each word vector of the text vector sequence to obtain a text feature; and
generating the text feature set according to the T text features when text features corresponding to the T word vectors are obtained;
A method for finding an image area that further contains

7. The method of claim 6,
encoding each word vector of the sequence of text vectors to obtain the text feature comprises:
the following way

obtaining the text feature with
here

denotes the t-th text feature of the set of text features,

represents the t-th word vector of the text vector sequence,

represents the (t-1)-th text feature of the text feature set,
A method for finding an image area.

A model training method comprising:
obtaining a set of text to be trained and a set of image candidate regions to be trained, wherein the set of text to be trained includes a first text to be trained and a second text to be trained, and the set of image candidate regions to be trained is a first image to be trained image. a candidate region and a second candidate image to be trained region, wherein the first to-be-trained text and the first to-be-trained image candidate region have a matching relationship, and the first trained text and the second to-be-trained image candidate region does not have a matching relationship, the second to-be-trained text and the second to-be-trained image candidate region have a matching relationship, and the second to-be-trained text and the first to-be-trained image candidate region have a matching relationship. do not have ―;
determining a target loss function according to the first to-be-trained text, the second to-be-trained text, the first to-be-trained image candidate region, and the second to-be-trained image candidate region; and
training an image region finding network model to be trained using the target loss function to obtain an image region finding network model, wherein the image region finding network model is to be found with image candidate regions according to a text feature set and enhanced semantic information. determine a matching relationship between text, wherein the enhanced semantic information and the image candidate region have a correspondence, and the text feature set and the text to be found have a correspondence;
A model training method comprising

9. The method of claim 8,
determining a target loss function according to the first to-be-trained text, the second to-be-trained text, the first to-be-trained image candidate region, and the second to-be-trained image candidate region,
the following way

determining the target loss function as
here

represents the target loss function,

represents the first to-be-trained image candidate region,

represents the first to-be-trained text,

represents the second to-be-trained image candidate region,

represents the second to-be-trained text,

denotes the data pair to be trained, max() denotes to take the maximum,

represents the first parameter control weight,

represents the second parameter control weight,

represents the first preset threshold,

represents the second preset threshold,
How to train a model.

A device for finding an image area, comprising:
a generating module, configured to generate a set of region semantic information according to a set of image candidate regions of an image to be found, wherein each region semantic information of the set of region semantic information corresponds to one image candidate region of the set of image candidate regions;
an obtaining module, configured to obtain, using a graph convolution network (GCN), an enhanced semantic information set corresponding to the regional semantic information set generated by the generating module, wherein each enhanced semantic information of the enhanced semantic information set is Corresponding to one region semantic information of a region semantic information set, the GCN is configured to establish an association relationship between various region semantic information,
The obtaining module is further configured to obtain a matching degree between the text feature set corresponding to the text to be found using the image area finding network model and the respective enhanced semantic information, wherein the image area finding network model is configured to: determine a matching relationship between a candidate region and the text to be found, wherein each word of the text to be found corresponds to one word feature of the set of text features; and
a determining module, configured to determine a target image candidate region from the image candidate region set according to a matching degree between the text feature set and each enhanced semantic information obtained by the acquiring module
A device for finding an image area containing

A model training device comprising:
an acquiring module configured to acquire a set of text to be trained and a set of image candidate regions to be trained, wherein the set of text to be trained includes a first text to be trained and a second text to be trained, and wherein the set of image candidate regions to be trained is a first training an image candidate region to be trained and a second image candidate region to be trained, wherein the first to-be-trained text and the first to-be-trained image candidate region have a matching relationship, and the first trained text and the second to-be-trained region The image candidate region does not have a matching relationship, the second to-be-trained text and the second to-be-trained image candidate region have a matching relationship, and the second to-be-trained text and the first to-be-trained image candidate region match not in a relationship ―;
a determining module, configured to determine a target loss function according to the first to-be-trained text, the second to-be-trained text, the first to-be-trained image candidate region, and the second to-be-trained image candidate region obtained by the acquiring module ; and
a training module, configured to train an image region finding network model to be trained using the target loss function determined by the determining module to obtain an image region finding network model, wherein the image region finding network model includes a text feature set and enhanced semantics and determine a matching relationship between the image candidate region and the text to be found according to the information, wherein the enhanced semantic information and the image candidate region have a correspondence, and the text feature set and the text to be found have a correspondence. —
A model training device comprising a.

A terminal device comprising:
Memory, transceiver, processor and bus systems
includes,
The memory is configured to store a program,
The processor is
generating a set of region semantic information according to a set of image candidate regions of an image to be found, wherein each region semantic information of the set of region semantic information corresponds to one image candidate region of the set of image candidate regions;
obtaining an enhanced semantic information set corresponding to the set of region semantic information using a graph convolution network (GCN), wherein each enhanced semantic information of the set of region semantic information is one region semantic information of the set of region semantic information corresponding to , wherein the GCN is configured to establish an association relationship between various domain semantic information;
obtaining a degree of matching between the respective enhanced semantic information and a set of text features corresponding to the text to be found using an image region finding network model, wherein the image region finding network model is between the image candidate region and the text to be found. determine a matching relationship of , wherein each word of the text to be found corresponds to a single word feature of the set of text features; and
determining a target image candidate region from the image candidate region set according to a degree of matching between the text feature set and each of the enhanced semantic information;
configured to execute a program in the memory to perform
wherein the bus system is configured to couple the memory and the processor to enable the memory and the processor to communicate.
terminal device.

13. The method of claim 12,
The processor is
an operation of obtaining region semantic information corresponding to each image candidate region using a convolutional neural network (CNN), wherein the image candidate region includes region information, the region information comprising a location of an image candidate region of the image to be found information and size information of the image candidate region; and
An operation of generating the region semantic information set according to the N regions semantic information when region semantic information corresponding to N image candidate regions is obtained, wherein N is an integer greater than or equal to 1
further configured to execute the program in the memory to perform
terminal device.

14. The method of claim 12 or 13,
The processor is
obtaining first region semantic information and second region semantic information from the region semantic information set, wherein the first region semantic information is any one region semantic information of the region semantic information set, and the second region semantic information is any one region semantic information of the region semantic information set;
obtaining a strength of a connection edge between the first region semantic information and the second region semantic information;
normalizing the strength of a connecting edge between the first region semantic information and the second region semantic information to obtain a normalized strength;
determining a target connection matrix according to normalized strengths between various region semantic information of the region semantic information set; and
Determining an enhanced semantic information set corresponding to the target connection matrix using the GCN
further configured to execute the program in the memory to perform
terminal device.

15. The method of claim 14,
The processor is
generating a connection matrix according to normalized strengths between various region semantic information of the region semantic information set; and
generating the target connection matrix according to the connection matrix and the unit matrix
further configured to execute the program in the memory to perform
terminal device.

16. The method of claim 14 or 15,
The processor is
the following way

further configured to execute the program in the memory to perform the operation of calculating the enhanced semantic information with
here

represents the first network parameter of the k-th layer of GCN,

represents the second network parameter of the k-th layer of GCN,

indicates that the j-th node is a neighbor of the i-th node,

represents an element of the target connection matrix,
terminal device.

As a server,
including memory, transceiver, processor and bus systems;
The memory is configured to store a program,
The processor is
generating a set of region semantic information according to a set of image candidate regions of an image to be found, wherein each region semantic information of the set of region semantic information corresponds to one image candidate region of the set of image candidate regions;
obtaining an enhanced semantic information set corresponding to the set of region semantic information using a graph convolution network (GCN), wherein each enhanced semantic information of the set of region semantic information is one region semantic information of the set of region semantic information corresponding to , wherein the GCN is configured to establish an association relationship between various domain semantic information;
obtaining a degree of matching between the respective enhanced semantic information and a set of text features corresponding to the text to be found using an image region finding network model, wherein the image region finding network model is between the image candidate region and the text to be found. determine a matching relationship of , wherein each word of the text to be found corresponds to a single word feature of the set of text features; and
determining a target image candidate region from the image candidate region set according to a degree of matching between the text feature set and each of the enhanced semantic information;
configured to execute a program in the memory to perform
wherein the bus system is configured to couple the memory and the processor to enable the memory and the processor to communicate.
server.

As a server,
including memory, transceiver, processor and bus systems;
The memory is configured to store a program,
The processor is
obtaining a set of text to be trained and a set of image candidate regions to be trained, wherein the set of text to be trained includes a first text to be trained and a second text to be trained, and wherein the set of image candidate regions to be trained is a first image to be trained image. a candidate region and a second candidate image to be trained region, wherein the first to-be-trained text and the first to-be-trained image candidate region have a matching relationship, and the first trained text and the second to-be-trained image candidate region does not have a matching relationship, the second to-be-trained text and the second to-be-trained image candidate region have a matching relationship, and the second to-be-trained text and the first to-be-trained image candidate region have a matching relationship. do not have ―;
determining a target loss function according to the first to-be-trained text, the second to-be-trained text, the first to-be-trained image candidate region, and the second to-be-trained image candidate region; and
training an image region finding network model to be trained using the target loss function to obtain an image region finding network model, wherein the image region finding network model is to be found with image candidate regions according to a text feature set and enhanced semantic information. determine a matching relationship between text, wherein the enhanced semantic information and the image candidate region have a correspondence, and the text feature set and the text to be found have a correspondence;
configured to execute a program in the memory to perform
wherein the bus system is configured to couple the memory and the processor to enable the memory and the processor to communicate.
server.

A method for finding an image area, comprising:
receiving an image finding command;
obtaining an image candidate region set of an image to be found according to the image finding command in response to the image finding command, wherein the image candidate region set includes N image candidate regions, where N is an integer greater than or equal to one; ;
generating a set of region semantic information according to the set of image candidate regions, wherein the set of region semantic information includes N pieces of region semantic information, each region semantic information corresponding to one image candidate region;
obtaining an enhanced semantic information set corresponding to the regional semantic information set using a graph convolutional network (GCN), wherein the enhanced semantic information set includes N enhanced semantic information, each enhanced semantic information comprising one corresponding to region semantic information, wherein the GCN is configured to establish an association relationship between various region semantic information;
obtaining a set of text features corresponding to the text to be looked up, wherein the text to be found comprises T words, the set of text features comprises T word features, each word corresponding to one word feature, and , T is an integer greater than or equal to 1 —;
obtaining a degree of matching between the set of text features and each of the enhanced semantic information using an image region finding network model, wherein the image region finding network model determines a matching relationship between the image candidate region and the text to be found configured to ―;
determining a target image candidate region from the image candidate region set according to a matching degree between the text feature set and each of the enhanced semantic information; and
sending the image creation command to the client so that the client can display the target image candidate region according to the image creation command
A method for finding an image area containing

A computer readable storage medium comprising:
save the command,
The instructions, when executed on a computer, cause the computer to perform a method for finding an image region according to any one of claims 1 to 7 or a method for training a model according to claims 8 or 9. to do,
computer readable storage medium.