KR20230165683A

KR20230165683A - An apparatus for processing video and image search of natural languages based on caption data and a method for operating it

Info

Publication number: KR20230165683A
Application number: KR1020220164524A
Authority: KR
Inventors: 이동식; 장성익; 이원경
Original assignee: 주식회사 자비스넷
Priority date: 2022-05-27
Filing date: 2022-11-30
Publication date: 2023-12-05
Also published as: KR102474436B1

Abstract

본 발명의 실시 예에 따른 장치는, 영상 이미지 검색 장치에 있어서, 영상 정보를 수집하는 영상 정보 수집부; 상기 수집된 영상 정보의 시각 데이터 분석 정보를, 사전 구축된 자연어 처리 모델에 입력하여, 상기 영상 정보를 설명하는 캡션 문장 데이터를 생성하는 캡션 생성부; 상기 수집된 영상 정보의 시각 데이터 분석 정보와 상기 캡션 문장 데이터를 비교하여, 상기 캡션 문장 데이터에 대응하는 객체 분류 검색 정보를 구성하는 객체 분류 검색 정보 구성부; 및 상기 캡션 문장 데이터 및 상기 객체 분류 검색 정보를 이용하여, 이미지 또는 영상의 자연어 검색을 위한 검색 데이터베이스를 구성하는 검색 데이터베이스부;를 포함하고, 상기 객체 분류 검색 정보 구성부는, 상기 캡션 문장 데이터로부터 추출되는 객체 정보와, 상기 시각 데이터 분석 정보에 포함된 객체 식별 정보를 비교하여, 각 객체별 상기 객체 분류 검색 정보의 검색 가중치를 상이하게 부여하는 정보별 가중치 설정부;를 포함한다.A device according to an embodiment of the present invention is a video image search device, comprising: an image information collection unit that collects image information; a caption generator that inputs visual data analysis information of the collected image information into a pre-built natural language processing model to generate caption sentence data describing the image information; an object classification search information constructor that compares visual data analysis information of the collected image information with the caption sentence data and configures object classification search information corresponding to the caption sentence data; And a search database unit that configures a search database for natural language search of images or videos using the caption sentence data and the object classification search information, wherein the object classification search information constructor extracts from the caption sentence data. and a weight setting unit for each information that compares object information and object identification information included in the visual data analysis information and assigns different search weights to the object classification search information for each object.

Description

An apparatus for processing video and image search of natural languages based on caption data and a method for operating it}

본 발명은 장치 및 그 동작 방법에 관한 것이다. 보다 구체적으로, 본 발명은 캡션 데이터를 기반으로 자연어 검색을 수행하는 영상 이미지 검색 장치 및 그 동작 방법에 관한 것이다.The present invention relates to a device and a method of operating the same. More specifically, the present invention relates to a video image search device that performs natural language search based on caption data and a method of operating the same.

최근 스마트폰과 같은 다양한 기능이 구비되고, 인터넷에 접속이 간편해진 전자기기의 발전으로 인해, 단편적인 텍스트 정보뿐만 아니라 이미지 및 동영상과 같은 영상 콘텐츠를 개개인이 간편하고, 쉽게 생성하고 배포할 수 있게 되었다.Recently, with the development of electronic devices equipped with various functions such as smartphones and easy access to the Internet, it has become possible for individuals to simply and easily create and distribute video content such as images and videos as well as fragmentary text information. It has been done.

이로 인해, 영상 콘텐츠는 대량으로 계속해서 생성 및 배포되고 있으며, 관련 데이터량의 증가로 인하여 영상 콘텐츠의 검색에 대한 사용자의 요구도 증가하고 있다.As a result, video content continues to be created and distributed in large quantities, and user demand for searching video content is also increasing due to the increase in the amount of related data.

이에 따라, 사용자들은 일반적인 텍스트를 이용한 문자 검색뿐만 아니라 영상 이미지를 포함하는 콘텐츠 자체에 대해서도 검색을 통해 자신이 원하는 이미지나 영상에 대한 정보의 검색을 원하고 있다.Accordingly, users want to search for information about the image or video they want through not only text search using general text, but also the content itself, including video images.

그러나, 현재의 일반적인 정보 검색 시스템은 대부분의 텍스트를 기반으로 사용자가 키워드를 입력하면, 키워드에 대응되는 텍스트를 포함한 문서 또는 이미지 등을 제공하는 방식이 대부분이며, 이는 사용자가 원하는 정확한 영상 및 이미지 검색을 지원하기에는 한계가 있다.However, most current general information retrieval systems provide documents or images containing text corresponding to the keyword when the user inputs a keyword based on most text. This allows the user to search for the exact video and image desired. There are limits to supporting .

이를 해결하기 위해, 최근에는 인공지능의 자동화 검색 기술을 이용한 이미지 검색이 제안되고는 있으나, 이는 이미지의 특징 정보를 이용한 학습이기 때문에, 이미지 내 특정한 객체 등만이 부분적이고 일반적인 형태로만 검색될 수 있을 뿐이어서, 사용자들의 통상적인 자연어 기반의 검색이 자연스럽게 이루어지지 못하며, 그 정확도도 매우 떨어지고 있는 실정이다.To solve this problem, image search using automated search technology of artificial intelligence has been proposed recently, but since this is learning using feature information of images, only specific objects in the image can be searched in a partial and general form. As a result, users cannot naturally search based on natural language, and the accuracy is very low.

본 발명은 상기한 바와 같은 문제점을 해결하고자 안출된 것으로, 이미지 또는 영상 정보를 설명하는 캡션 문장 데이터를 이용한 검색 데이터베이스를 구축하고, 검색 데이터베이스의 캡션 문장 데이터에 대응하는 자연어 기반 검색을 수행한 후, 그 검색 결과를 이미지 또는 영상 정보로 제공함으로써, 보다 정확하고 신속하면서도 사용자가 원하는 이미지 및 영상 정보를 검색할 수 있도록 하는 영상 이미지 검색 장치 및 그 동작 방법 을 제공하는 데 그 목적이 있다.The present invention was developed to solve the above-mentioned problems. After constructing a search database using caption sentence data describing image or video information and performing a natural language-based search corresponding to the caption sentence data in the search database, The purpose is to provide a video image search device and method of operating the same that allows users to search for desired images and video information more accurately and quickly by providing the search results as images or video information.

상기한 바와 같은 과제를 해결하기 위한 본 발명의 실시 예에 따른 장치는, 영상 이미지 검색 장치에 있어서, 영상 정보를 수집하는 영상 정보 수집부; 상기 수집된 영상 정보의 시각 데이터 분석 정보를, 사전 구축된 자연어 처리 모델에 입력하여, 상기 영상 정보를 설명하는 캡션 문장 데이터를 생성하는 캡션 생성부; 상기 수집된 영상 정보의 시각 데이터 분석 정보와 상기 캡션 문장 데이터를 비교하여, 상기 캡션 문장 데이터에 대응하는 객체 분류 검색 정보를 구성하는 객체 분류 검색 정보 구성부; 및 상기 캡션 문장 데이터 및 상기 객체 분류 검색 정보를 이용하여, 이미지 또는 영상의 자연어 검색을 위한 검색 데이터베이스를 구성하는 검색 데이터베이스부;를 포함하고, 상기 객체 분류 검색 정보 구성부는, 상기 캡션 문장 데이터로부터 추출되는 객체 정보와, 상기 시각 데이터 분석 정보에 포함된 객체 식별 정보를 비교하여, 각 객체별 상기 객체 분류 검색 정보의 검색 가중치를 상이하게 부여하는 정보별 가중치 설정부;를 포함한다.An apparatus according to an embodiment of the present invention for solving the problems described above is a video image search device, comprising: an image information collection unit that collects image information; a caption generator that inputs visual data analysis information of the collected image information into a pre-built natural language processing model to generate caption sentence data describing the image information; an object classification search information constructor that compares visual data analysis information of the collected image information with the caption sentence data and configures object classification search information corresponding to the caption sentence data; And a search database unit that configures a search database for natural language search of images or videos using the caption sentence data and the object classification search information, wherein the object classification search information constructor extracts from the caption sentence data. and a weight setting unit for each information that compares object information and object identification information included in the visual data analysis information and assigns different search weights to the object classification search information for each object.

또한, 상기한 바와 같은 과제를 해결하기 위한 본 발명의 실시 예에 따른 장치의 동작 방법은, 영상 이미지 검색 장치의 동작 방법에 있어서, 영상 정보를 수집하는 단계; 상기 수집된 영상 정보의 시각 데이터 분석 정보를, 사전 구축된 자연어 처리 모델에 입력하여, 상기 영상 정보를 설명하는 캡션 문장 데이터를 생성하는 단계; 상기 수집된 영상 정보의 시각 데이터 분석 정보와 상기 캡션 문장 데이터를 비교하여, 상기 캡션 문장 데이터에 대응하는 객체 분류 검색 정보를 구성하는 단계; 및 상기 캡션 문장 데이터 및 상기 객체 분류 검색 정보를 이용하여, 이미지 또는 영상의 자연어 검색을 위한 검색 데이터베이스를 구성하는 단계;를 포함하고, 상기 객체 분류 검색 정보를 구성하는 단계는, 상기 캡션 문장 데이터로부터 추출되는 객체 정보와, 상기 시각 데이터 분석 정보에 포함된 객체 식별 정보를 비교하여, 각 객체별 상기 객체 분류 검색 정보의 검색 가중치를 상이하게 부여하는 단계;를 포함한다.In addition, a method of operating a device according to an embodiment of the present invention to solve the above-described problem includes: collecting image information; Inputting visual data analysis information of the collected image information into a pre-built natural language processing model to generate caption sentence data describing the image information; Comparing visual data analysis information of the collected image information with the caption sentence data, and constructing object classification search information corresponding to the caption sentence data; and configuring a search database for natural language search of an image or video using the caption sentence data and the object classification search information. The step of configuring the object classification search information includes using the caption sentence data. Comparing extracted object information and object identification information included in the visual data analysis information, and assigning different search weights to the object classification search information for each object.

본 발명의 실시 예에 따르면, 수집된 영상 정보의 시각 데이터 분석 정보를, 사전 구축된 자연어 처리 모델에 입력하여, 상기 영상 정보를 설명하는 캡션 문장 데이터를 생성하고, 상기 수집된 영상 정보의 시각 데이터 분석 정보와 상기 캡션 문장 데이터를 비교하여, 상기 캡션 문장 데이터에 대응하는 객체 분류 검색 정보를 구성함에 따라, 상기 캡션 문장 데이터 및 상기 객체 분류 검색 정보를 이용하여, 이미지 또는 영상의 자연어 검색을 위한 검색 데이터베이스를 구성할 수 있다.According to an embodiment of the present invention, visual data analysis information of the collected image information is input into a pre-built natural language processing model to generate caption sentence data describing the image information, and visual data of the collected image information is generated. By comparing analysis information and the caption sentence data, object classification search information corresponding to the caption sentence data is constructed, and the caption sentence data and the object classification search information are used to search for natural language search of an image or video. You can configure a database.

이에 따라, 본 발명의 실시 예에 다르면, 검색 데이터베이스의 캡션 문장 데이터에 대응하는 자연어 기반 검색을 수행한 후, 그 검색 결과를 이미지 또는 영상 정보로 제공함으로써, 보다 정확하고 신속하면서도 사용자가 원하는 이미지 및 영상 정보를 검색할 수 있도록 하는 영상 이미지 검색 장치 및 그 동작 방법을 제공할 수 있다.Accordingly, according to an embodiment of the present invention, a natural language-based search corresponding to caption sentence data in a search database is performed, and then the search results are provided as image or video information, thereby providing the user with more accurate and faster images and images desired by the user. A video image search device and method of operating the same that enable searching video information can be provided.

도 1은 본 발명의 실시 예에 따른 전체 시스템을 개략적으로 도시한 개념도이다.
도 2는 본 발명의 실시 예에 따른 영상 이미지 검색 장치를 보다 구체적으로 설명하기 위한 블록도이다.
도 3은 본 발명의 실시 예에 따른 시각 데이터 분석 정보의 처리를 보다 구체적으로 설명하기 위한 블록도이다.
도 4는 본 발명의 실시 예에 따른 영상 이미지의 검색 과정을 나타낸 흐름도이다.
도 5는 본 발명의 실시 예에 따른 검색 결과의 출력을 나타낸 도면이다.1 is a conceptual diagram schematically showing the entire system according to an embodiment of the present invention.
Figure 2 is a block diagram to explain in more detail a video image search device according to an embodiment of the present invention.
Figure 3 is a block diagram to explain in more detail the processing of visual data analysis information according to an embodiment of the present invention.
Figure 4 is a flowchart showing a video image search process according to an embodiment of the present invention.
Figure 5 is a diagram showing the output of search results according to an embodiment of the present invention.

이하의 내용은 단지 본 발명의 원리를 예시한다. 그러므로 당업자는 비록 본 명세서에 명확히 설명되거나 도시되지 않았지만 본 발명의 원리를 구현하고 본 발명의 개념과 범위에 포함된 다양한 장치를 발명할 수 있는 것이다. 또한, 본 명세서에 열거된 모든 조건부 용어 및 실시예들은 원칙적으로, 본 발명의 개념이 이해되도록 하기 위한 목적으로만 명백히 의도되고, 이와 같이 특별히 열거된 실시예들 및 상태들에 제한적이지 않는 것으로 이해되어야 한다.The following merely illustrates the principles of the invention. Therefore, those skilled in the art will be able to invent various devices that embody the principles of the present invention and are included in the spirit and scope of the present invention, although not explicitly described or shown herein. In addition, it is understood that all conditional terms and embodiments listed herein are, in principle, expressly intended only for the purpose of ensuring that the concept of the invention is understood, and are not limited to the embodiments and conditions specifically listed as such. It has to be.

또한, 본 발명의 원리, 관점 및 실시예들 뿐만 아니라 특정 실시예를 열거하는 모든 상세한 설명은 이러한 사항의 구조적 및 기능적 균등물을 포함하도록 의도되는 것으로 이해되어야 한다. 또한 이러한 균등물들은 현재 공지된 균등물뿐만 아니라 장래에 개발될 균등물 즉 구조와 무관하게 동일한 기능을 수행하도록 발명된 모든 소자를 포함하는 것으로 이해되어야 한다.Additionally, it is to be understood that any detailed description reciting principles, aspects, and embodiments of the invention, as well as specific embodiments, is intended to encompass structural and functional equivalents thereof. In addition, these equivalents should be understood to include not only currently known equivalents but also equivalents developed in the future, that is, all elements invented to perform the same function regardless of structure.

따라서, 예를 들어, 본 명세서의 블록도는 본 발명의 원리를 구체화하는 예시적인 회로의 개념적인 관점을 나타내는 것으로 이해되어야 한다. 이와 유사하게, 모든 흐름도, 상태 변환도, 의사 코드 등은 컴퓨터가 판독 가능한 매체에 실질적으로 나타낼 수 있고 컴퓨터 또는 프로세서가 명백히 도시되었는지 여부를 불문하고 컴퓨터 또는 프로세서에 의해 수행되는 다양한 프로세스를 나타내는 것으로 이해되어야 한다.Accordingly, for example, the block diagrams herein should be understood as representing a conceptual view of an example circuit embodying the principles of the invention. Similarly, all flow diagrams, state transition diagrams, pseudo-code, etc. are understood to represent various processes that can be substantially represented on a computer-readable medium and are performed by a computer or processor, whether or not the computer or processor is explicitly shown. It has to be.

또한 프로세서, 제어 또는 이와 유사한 개념으로 제시되는 용어의 명확한 사용은 소프트웨어를 실행할 능력을 가진 하드웨어를 배타적으로 인용하여 해석되어서는 아니되고, 제한 없이 디지털 신호 프로세서(DSP) 하드웨어, 소프트웨어를 저장하기 위한 롬(ROM), 램(RAM) 및 비 휘발성 메모리를 암시적으로 포함하는 것으로 이해되어야 한다. 주지관용의 다른 하드웨어도 포함될 수 있다.Additionally, the clear use of terms such as processor, control, or similar concepts should not be construed as exclusively referring to hardware capable of executing software, and should not be construed as referring exclusively to hardware capable of executing software, including without limitation digital signal processor (DSP) hardware and ROM for storing software. It should be understood as implicitly including ROM, RAM, and non-volatile memory. Other hardware for public use may also be included.

상술한 목적, 특징 및 장점은 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해질 것이며, 그에 따라 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명의 기술적 사상을 용이하게 실시할 수 있을 것이다. 또한, 본 발명을 설명함에 있어서 본 발명과 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에 그 상세한 설명을 생략하기로 한다. The above-described purpose, features and advantages will become clearer through the following detailed description in conjunction with the accompanying drawings, and accordingly, those skilled in the art will be able to easily implement the technical idea of the present invention. There will be. Additionally, in describing the present invention, if it is determined that a detailed description of known technologies related to the present invention may unnecessarily obscure the gist of the present invention, the detailed description will be omitted.

이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 일실시예를 상세히 설명하기로 한다.Hereinafter, a preferred embodiment according to the present invention will be described in detail with reference to the attached drawings.

도 1은 본 발명의 실시 예에 따른 전체 시스템을 개략적으로 도시한 개념도이다.1 is a conceptual diagram schematically showing the entire system according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 실시 예에 따른 전체 시스템은, 영상 이미지 검색 장치(100), 사용자 단말(200), 중계 서버(300) 및 웹 서버(400)를 포함한다.Referring to FIG. 1, the entire system according to an embodiment of the present invention includes a video image search device 100, a user terminal 200, a relay server 300, and a web server 400.

여기서, 상기 영상 이미지 검색 장치(100)는, 영상 정보를 수집하고, 상기 수집된 영상 정보의 시각 데이터 분석 정보를, 사전 구축된 자연어 처리 모델에 입력하여, 상기 영상 정보를 설명하는 캡션 문장 데이터를 생성하며, 상기 수집된 영상 정보의 시각 데이터 분석 정보와 상기 캡션 문장 데이터를 비교하여, 상기 캡션 문장 데이터에 대응하는 객체 분류 검색 정보를 구성하고, 상기 캡션 문장 데이터 및 상기 객체 분류 검색 정보를 이용하여, 이미지 또는 영상의 자연어 검색을 위한 검색 데이터베이스를 구성한다.Here, the video image search device 100 collects video information, inputs visual data analysis information of the collected video information into a pre-built natural language processing model, and generates caption sentence data describing the video information. Generates object classification search information corresponding to the caption sentence data by comparing visual data analysis information of the collected image information with the caption sentence data, and uses the caption sentence data and the object classification search information to generate object classification search information. , Constructs a search database for natural language search of images or videos.

그리고, 상기 영상 이미지 검색 장치(100)는, 검색 데이터베이스를 이용하여, AI 기반의 검색 서비스를 제공할 수 있으며, 이를 위해, 이미지 분석을 위한 이미지 처리 모델과, 상기 이미지 분석 결과로부터 캡션 문장 데이터를 생성하기 위한 자연어 처리 모델을 이용하여, 검색 서비스 데이터베이스를 구축할 수 있다.In addition, the video image search device 100 can provide an AI-based search service using a search database, and for this purpose, an image processing model for image analysis and caption sentence data from the image analysis results are provided. A search service database can be built using a natural language processing model for creation.

또한, 상기 영상 이미지 검색 장치(100)는 상기 이미지 분석 결과의 객체 분류 검색 정보와, 상기 캡션 문장 데이터의 분류 정보별 중요도 및 가중치에 따라, 상기 캡션 문장 데이터에 대응하는 객체 분류 검색 정보와 문장 분류 정확도를 산출하며, 상기 구성된 객체 분류 검색 정보와 정확도를 이용하여, 수집된 이미지 또는 영상의 정보를 설명하도록 사전 구성된 캡션 문장 데이터 중 검색 결과에 부합하는 최적의 캡션 문장 데이터를 선별하고, 상기 캡션 문장 데이터에 대응하여 데이터베이스상에 저장된 이미지 또는 영상 데이터를 검색 결과로서 사용자에게 제공할 수 있다.In addition, the video image search device 100 provides object classification search information and sentence classification corresponding to the caption sentence data according to the object classification search information of the image analysis result and the importance and weight of each classification information of the caption sentence data. Accuracy is calculated, and the object classification search information and accuracy are used to select optimal caption sentence data that matches the search results among caption sentence data pre-configured to describe the information of the collected image or video, and the caption sentence In response to the data, image or video data stored in the database can be provided to the user as a search result.

이러한 영상 이미지 검색 장치(100)의 출력 데이터는 영상 이미지 검색 장치(100)의 서비스 프로세스 수행에 따라 사용자 단말(200), 중계 서버(300), 웹 서버(400) 등으로 전달될 수 있다.The output data of the video image search device 100 may be transmitted to the user terminal 200, relay server 300, web server 400, etc. according to the performance of the service process of the video image search device 100.

또한, 사용자 단말(200)은 상기 영상 이미지 검색 장치(100)로 인터페이스 를 통한 사용자 입력을 전송할 수 있으며, 상기 영상 이미지 검색 장치(100)는 상기 사용자 입력에 대응하는 적절한 캡션 문장 데이터를 색인하고, 상기 색인된 캡션 문장 데이터에 대응하는 이미지 또는 영상 정보를 포함하는 응답 정보를 상기 사용자 단말(200)로 제공함으로써, 사용자는 검색 결과로서 제공된 이미지 또는 영상 정보를 확인해볼 수 있다.In addition, the user terminal 200 may transmit a user input through an interface to the video image search device 100, and the video image search device 100 indexes appropriate caption sentence data corresponding to the user input, By providing response information including image or video information corresponding to the indexed caption sentence data to the user terminal 200, the user can check the image or video information provided as a search result.

이를 위해, 상기 영상 이미지 검색 장치(100)는 상기 사용자 단말(200), 중계 서버(300), 웹 서버(400)와 상호 유무선 네트워크로 연결될 수 있고, 각 구성요소는 상기 유무선 네트워크를 통해 통신을 수행하기 위한 하나 이상의 유무선 통신 모듈을 구비할 수 있다. 상기 유무선 통신 네트워크는 널리 알려진 다양한 통신 방식이 적용될 수 있다.To this end, the video image search device 100 can be connected to the user terminal 200, the relay server 300, and the web server 400 through a wired or wireless network, and each component communicates through the wired or wireless network. It may be equipped with one or more wired and wireless communication modules to perform. The wired/wireless communication network may use various widely known communication methods.

한편, 상기 중계 서버(300)는 상기 영상 이미지 검색 장치(100)로부터 이미지 또는 영상 정보의 캡션 문장 데이터를 상기 웹 서버(400)를 통해 수집하여 저장 및 관리하고, 사용자 단말(200)로 중계하는 서버일 수 있다.Meanwhile, the relay server 300 collects, stores and manages caption sentence data of image or video information from the video image search device 100 through the web server 400, and relays it to the user terminal 200. It could be a server.

그리고, 상기 사용자 단말(200)은 상기 영상 이미지 검색 장치(100)로부터 제공된 캡션 문장 데이터 기반 검색 결과를 수신하여 출력하거나, 상기 영상 이미지 검색 장치(100)로 사용자 질의 정보를 입력할 수 있는 하나 이상의 통신 수단 및 입출력 수단을 구비할 수 있다.In addition, the user terminal 200 receives and outputs search results based on caption sentence data provided from the video image search device 100, or inputs user query information into the video image search device 100. It may be equipped with communication means and input/output means.

또한, 사용자 단말(200)에서는 상기 영상 이미지 검색 장치(100)로 자연어 기반의 검색어를 질의 정보로서 입력할 수 있다. 자연어 기반 검색어는 예를 들어, “하늘색 옷을 입었던 내 아이 사진 찾아줘” 등이 예시될 수 있으며, 영상 이미지 검색 장치(100)는 이에 대응하는 자연어 분석 기반의 검색 키워드를 구성하고, 구성된 검색 키워드를 이용한 검색 데이터베이스의 캡션 문장 데이터를 색인하며, 색인된 캡션 문장 데이터에 대응하는 이미지 또는 영상 정보를 식별하여, 식별된 이미지 또는 영상 정보를 사용자 단말(200)로 전달할 수 있다.Additionally, the user terminal 200 can input a natural language-based search word as query information into the video image search device 100. For example, a natural language-based search term may include “Find a picture of my child wearing light blue clothes,” and the video image search device 100 configures a search keyword based on natural language analysis corresponding to the search keyword, and the configured search keyword. Caption sentence data in a search database using can be indexed, image or video information corresponding to the indexed caption sentence data can be identified, and the identified image or video information can be transmitted to the user terminal 200.

여기서, 이미지 또는 영상 정보는 웹 또는 인터넷상에 업로드된 이미지 또는 영상 정보의 파일 데이터를 포함하거나, 웹 또는 인터넷상에 업로드된 이미지 또는 영상 정보의 주소 정보를 포함할 수 있다. 또한, 본 발명의 실시 예에 따르면, 사용자 단말(200) 또는 기타 외부 기기에 저장된 이미지 또는 영상 정보의 접근 경로 정보를 포함할 수도 있다.Here, the image or video information may include file data of the image or video information uploaded to the web or the Internet, or may include address information of the image or video information uploaded to the web or the Internet. Additionally, according to an embodiment of the present invention, it may include access path information of images or video information stored in the user terminal 200 or other external devices.

한편, 상기 캡션 문장 데이터의 정확한 색인을 위하여, 본 발명의 실시 예에 따른 영상 이미지 검색 장치(100)는 영상 정보의 시각 데이터 분석 정보와 상기 캡션 문장 데이터를 비교하여, 상기 캡션 문장 데이터에 대응하는 객체 분류 검색 정보를 구성하고, 상기 캡션 문장 데이터 및 상기 객체 분류 검색 정보를 이용하여, 이미지 또는 영상의 자연어 검색을 위한 검색 데이터베이스를 구성할 수 있는 바, 이에 대하여는 보다 구체적으로 후술하도록 한다.Meanwhile, in order to accurately index the caption sentence data, the video image search device 100 according to an embodiment of the present invention compares the visual data analysis information of image information with the caption sentence data, and selects the caption sentence data corresponding to the caption sentence data. Object classification search information can be configured, and a search database for natural language search of images or videos can be configured using the caption sentence data and the object classification search information, which will be described in more detail later.

도 2는 본 발명의 실시 예에 따른 영상 이미지 검색 장치를 보다 구체적으로 설명하기 위한 블록도이며, 도 3은 본 발명의 실시 예에 따른 시각 데이터 분석 정보의 처리를 보다 구체적으로 설명하기 위한 블록도이다.Figure 2 is a block diagram for explaining in more detail a video image search device according to an embodiment of the present invention, and Figure 3 is a block diagram for explaining in more detail the processing of visual data analysis information according to an embodiment of the present invention. am.

도 2를 참조하면, 본 발명의 실시 예에 따른 영상 이미지 검색 장치(100)는, 영상 정보 수집부(110), 시각 데이터 분석 정보 처리부(120), 캡션 생성부(130), 객체 분류 검색 정보 구성부(140), 검색어 입력부(150), 검색 데이터베이스부(160) 및 검색결과 출력부(170)를 포함할 수 있다.Referring to FIG. 2, the video image search device 100 according to an embodiment of the present invention includes an image information collection unit 110, a visual data analysis information processing unit 120, a caption generation unit 130, and object classification search information. It may include a configuration unit 140, a search word input unit 150, a search database unit 160, and a search result output unit 170.

먼저, 영상 정보 수집부(110)는 사용자 단말(200)의 요청에 따라, 영상 정보를 수집한다. 여기서, 영상 정보는 사용자 단말(200)에 저장되어 있는 이미지 또는 영상 데이터이거나, 원격 연결된 기타 외부 기기에 저장되어 있는 이미지 또는 영상 데이터일 수 있으며, 이미지 또는 영상 데이터는 이미지 또는 영상 파일 그 자체이거나, 사용자가 지정한 파일 경로 정보 또는 파일 주소 정보로부터 획득되는 이미지 또는 영상 데이터를 포함할 수 있다.First, the image information collection unit 110 collects image information according to a request from the user terminal 200. Here, the image information may be an image or video data stored in the user terminal 200, or may be an image or video data stored in another remotely connected external device, and the image or video data may be the image or video file itself, It may include image or video data obtained from file path information or file address information specified by the user.

그리고, 캡션 생성부(130)는 시각 데이터 분석 정보 처리부(120)에서 처리되는 시각 데이터 분석 정보에서 추출된 객체 식별 정보를, 캡션 문장 데이터를 생성을 위한 사전 구축된 자연어 처리 모델에 입력하여, 상기 영상 정보를 설명하는 캡션 문장 데이터를 생성한다.In addition, the caption generator 130 inputs the object identification information extracted from the visual data analysis information processed by the visual data analysis information processor 120 into a pre-built natural language processing model for generating caption sentence data, Generate caption sentence data that explains video information.

여기서, 도 3을 참조하면, 상기 시각 데이터 분석 정보 처리부(120)는 배경 모델링부(121), 형태 연산 처리부(122), 요소 연결 처리부(123), 객체 추적부(124), 3차원 시점 변환부(125), 객체 분류 처리부(126), 시각 데이터 분석 결과 출력부(127) 및 딥러닝 분석 처리부(128)를 포함할 수 있다.Here, referring to FIG. 3, the visual data analysis information processing unit 120 includes a background modeling unit 121, a shape operation processing unit 122, an element connection processing unit 123, an object tracking unit 124, and a 3D viewpoint transformation. It may include a unit 125, an object classification processing unit 126, a visual data analysis result output unit 127, and a deep learning analysis processing unit 128.

우선, 배경 모델링부(121)는 상기 영상 정보 수집부(110)가 수집한 영상 정보에 대응하는 배경 영역을 모델링 처리할 수 있고, 배경 모델링에 따라, 전경에 대응하는 객체를 식별하고, 추적될 수 있도록 하는 배경 영역 데이터를 연산할 수 있다.First, the background modeling unit 121 can model the background area corresponding to the image information collected by the image information collection unit 110, and according to the background modeling, identify an object corresponding to the foreground and track it. You can calculate background area data that allows you to

이러한 배경 모델링부(121)는 복잡한 환경에서 객체를 정확히 검출할 수 있도록, 잘 알려진 가우시안 혼합 모델(Gaussian Mixutre Model)을 기반으로 하는 배경 확률 모델을 영상 데이터로부터 생성할 수 있다. 이는 조명의 변화, 배경에 첨가되거나 제거되는 객체, 흔들리는 나뭇가지나 분수 등의 움직임을 가지는 배경, 통행량이 많은 영역 등의 많은 변수를 반영하여 배경 영역 데이터로 생성할 수 있도록 한다.The background modeling unit 121 can generate a background probability model based on the well-known Gaussian Mixutre Model from image data so that objects can be accurately detected in a complex environment. This allows the creation of background area data by reflecting many variables such as changes in lighting, objects added or removed from the background, backgrounds with movement such as swaying tree branches or fountains, and areas with high traffic.

보다 구체적으로, 배경 모델링부(121)는 상기 영상 정보로부터 연속적 영상 프레임이 입력되면, 배경 감산 처리를 통해 시간 t에 대한 통계학적 확률에 따른 픽셀당 배경 모델을 구성할 수 있다. 배경 모델링부(121)는 현재 프레임의 영상 정보에서 가우시안 혼합 모델에 따라 구성된 배경 정보를 차감하고, 상기 배경 정보를 픽셀당 배경 모델에 누적 업데이트 하는 방식으로 이루어지는 MOG(Mixtrue of Guassians) 방식에 따라 배경 감산 처리를 수행할 수 있다.More specifically, when consecutive image frames are input from the image information, the background modeling unit 121 can construct a background model per pixel according to statistical probability for time t through background subtraction processing. The background modeling unit 121 subtracts background information constructed according to a Gaussian mixture model from the image information of the current frame and creates the background according to the MOG (Mixtrue of Guassians) method, which cumulatively updates the background information to a per-pixel background model. Subtraction processing can be performed.

배경 모델링부(121)의 배경 감산 처리에 따라, 배경이 차감된 영상 정보에는 전경 픽셀 정보가 남게 되는 바, 이를 블롭(blob) 이미지라고 할 수 있으며, 블롭 이미지는 전경 픽셀에 대응하는 이진 맵 데이터라고도 할 수 있다.According to the background subtraction process of the background modeling unit 121, foreground pixel information remains in the background-subtracted image information, which can be referred to as a blob image, and the blob image is binary map data corresponding to the foreground pixel. It can also be said that

그리고, 형태 연산 처리부(122)는, 상기 배경 모델링부(121)에서의 배경 모델링에 따라 차감 출력되는 전경 픽셀 정보의 이진 맵 데이터로부터, 형태 영역 정보를 결정하는 형태 연산 처리를 수행할 수 있으며, 상기 형태 영역 정보는 복수의 픽셀들을 하나 이상의 형태 구조 요소로 그룹핑하는 그룹핑 정보일 수 있다.In addition, the shape operation processing unit 122 may perform shape operation processing to determine shape area information from binary map data of foreground pixel information that is subtracted and output according to background modeling in the background modeling unit 121, The shape area information may be grouping information that groups a plurality of pixels into one or more shape structural elements.

예를 들어, 형태 연산 처리부(122)는 하나 이상의 형태론적 연산 필터들을 상기 전경 픽셀 정보의 이진 맵 데이터에 적용함에 따라, 상기 형태 영역 정보를 결정할 수 있다.For example, the shape operation processing unit 122 may determine the shape area information by applying one or more morphological operation filters to the binary map data of the foreground pixel information.

여기서, 하나 이상의 형태론적 연산 필터는 이진 침식(erosion) 연산 필터 및 이진 팽창(dilation) 연산 필터를 포함할 수 있는 바, 각 필터들은 이진 맵 데이터 내 밝은 영역의 크기를 사전 결정된 형태 구조 요소의 크기 정보에 비례하여 확장하거나 축소시키는 연산 처리를 수행할 수 있다. 팽창 연산 필터 적용시 형태 구조 요소보다 작은 어두운 영역이 제거되고, 반대로 침식 연산 필터 적용시 형태 구조 요소보다 작은 밝은 영역들이 제거될 수 있으며, 동시에, 제거되지 않는 큰 영역들의 크기도 줄거나 커지게 형성될 수 있다.Here, one or more morphological operation filters may include a binary erosion operation filter and a binary dilation operation filter, and each filter changes the size of the bright area in the binary map data to the size of the predetermined shape structural element. Computational processing that expands or reduces information in proportion to the information can be performed. When applying a dilation operation filter, dark areas smaller than the shape structure element can be removed, and conversely, when applying an erosion operation filter, bright areas smaller than the shape structure element can be removed, and at the same time, the size of large areas that are not removed can be reduced or increased. It can be.

그리고, 사전 특정된 크기의 객체 검출을 위해, 형태 연산 처리부(122)는 이진 개방 필터 및 폐쇄 필터를 이용하여, 상기 형태 구조 요소보다 큰 영역의 크기는 그대로 유지한 채, 작은 영역들만 제거시키는 처리를 수행할 수도 있다.In order to detect objects of a pre-specified size, the shape operation processing unit 122 uses a binary open filter and a closed filter to remove only small areas while maintaining the size of areas larger than the shape structural element. You can also perform .

그리고, 요소 연결 처리부(123)는, 상기 형태 연산 처리부(122)에서 특정된 형태 구조 요소에 대응하는 영상 정보를 획득하고, 각 형태 구조 요소들을 연결 처리하여 독립적인 객체 연결 영역들로 분류하며, 분류된 객체 연결 영역에 대응하는 고유의 라벨 값들을 할당하는 연결 성분 라벨링 처리를 수행할 수 있다. 이러한 연결 성분 라벨링 처리는 특히 이진 맵 데이터에서 효과적으로 이용될 수 있다.Then, the element connection processing unit 123 acquires image information corresponding to the shape structural elements specified in the shape operation processing unit 122, connects and processes each shape structural elements, and classifies them into independent object connection areas, Connected component labeling processing can be performed to assign unique label values corresponding to the classified object connection area. This connected component labeling process can be used particularly effectively in binary map data.

또한, 상기 요소 연결 처리부(123)는 분류된 객체 연결 영역들을 고유의 라벨 값으로 구분하고, 각 객체 영역의 크기, 위치, 방향, 둘레와 같은 영역의 특징값들을 결정하여 출력할 수 있다. 상기 요소 연결 처리부(123)는 순환적 알고리즘 또는 순차적 알고리즘을 적용하여, 연결 성분 라벨링 처리를 수행하고, 객체 연결 영역들을 분류할 수 있다.Additionally, the element connection processing unit 123 may divide the classified object connection areas into unique label values, and determine and output feature values of the area, such as the size, position, direction, and perimeter of each object area. The element connection processing unit 123 may apply a recursive algorithm or a sequential algorithm to perform connection component labeling processing and classify object connection areas.

예를 들어, 순차적 알고리즘은 각 전경 화소의 픽셀 데이터에 대해 그 화소의 상단 화소와 왼쪽 화소의 라벨을 검색하여 현재 화소의 라벨을 결정하는 알고리즘일 수 있다. 상기 상단 화소와 상기 왼쪽 화소를 포함하는 이웃 화소들은 라벨링 과정에서 이미 처리된 화소일 수 있다.For example, a sequential algorithm may be an algorithm that determines the label of the current pixel by searching the pixel data of each foreground pixel for the labels of the top and left pixels of that pixel. Neighboring pixels including the top pixel and the left pixel may be pixels that have already been processed in the labeling process.

여기서, 상기 요소 연결 처리부(123)는 이웃 화소들이 모두 전경 화소가 아닌 경우에는 현재 화소에 새로운 라벨값을 할당할 수 있다. 그리고, 요소 연결 처리부(123)는 두 이웃 화소 중 하나의 화소만 전경 화소인 경우에는 그 화소의 라벨값을 현재 화소에 할당할 수 있다.Here, the element connection processor 123 may assign a new label value to the current pixel if none of the neighboring pixels are foreground pixels. Additionally, when only one pixel among two neighboring pixels is a foreground pixel, the element connection processing unit 123 may assign the label value of that pixel to the current pixel.

그리고, 상기 요소 연결 처리부(123)는, 두 이웃 화소가 모두 전경 화소이면서 같은 라벨값을 갖는 경우에는 동일한 라벨값을 현재 화소에 할당할 수 있다.Additionally, the element connection processing unit 123 may assign the same label value to the current pixel when both neighboring pixels are foreground pixels and have the same label value.

그러나, 두 화소가 전경 화소지만 서로 다른 라벨값을 갖는 경우에는 이 두 영역은 현재 화소에 의해 서로 연결되는 영역이므로 동일한 라벨값으로 병합되어야 한다.However, if two pixels are foreground pixels but have different label values, these two areas are connected to each other by the current pixel and must be merged with the same label value.

따라서, 상기 요소 연결 처리부(123)는 두 화소의 라벨값 중 더 작은 값을 현재화소의 라벨값으로 할당하고, 두 라벨은 동치 테이블에 동치 라벨로 등록할 수 있다. 이러한 첫 번째 라벨링 과정의 수행이 종료되면 동치 테이블에는 동일한 영역으로 병합되어야 하는 라벨에 대한 정보가 저장될 수 있다.Accordingly, the element connection processing unit 123 may assign the smaller value of the label values of the two pixels as the label value of the current pixel, and register the two labels as equal labels in the equivalence table. When this first labeling process is completed, information about labels that must be merged into the same area can be stored in the equivalence table.

그리고, 상기 요소 연결 처리부(123)는, 이 동치 테이블을 이용하여 두 번째 라벨링 과정에서 각 객체 연결 영역의 모든 화소에 동일한 라벨을 할당할 수 있다.In addition, the element connection processing unit 123 can use this equivalence table to assign the same label to all pixels in each object connection area in the second labeling process.

이에 따라, 상기 요소 연결 처리부(123)는 전경 픽셀 데이터를 포함하는 블롭(blob) 이미지 데이터와 함께 연결 성분 라벨링 처리에 따른 객체 연결 영역 정보를 출력할 수 있는 바, 객체 연결 영역 정보는 각 객체 연결 영역에 대응하는 특성 정보(위치 정보, 크기 정보, 방향 정보, 둘레 정보 등)를 포함할 수 있다.Accordingly, the element connection processing unit 123 can output object connection area information according to the connection component labeling process along with blob image data including foreground pixel data, and the object connection area information is connected to each object. It may include characteristic information (position information, size information, direction information, perimeter information, etc.) corresponding to the area.

그리고, 객체 추적부(124)는 블롭(blob) 이미지 데이터 및 객체 연결 영역 정보에 기초하여, 영상 정보 내 객체 정보를 추적하기 위한 추적 데이터를 획득할 수 있다. 추적 데이터는 영상 내에서 발견 및 추적된 객체의 리스트 정보를 포함할 수 있다.Additionally, the object tracking unit 124 may obtain tracking data for tracking object information in image information based on blob image data and object connection area information. Tracking data may include list information of objects found and tracked within the image.

보다 구체적으로, 객체 추적부(124)는 시간 t에서의 블롭 이미지와 t-1에서의 블롭 이미지를 누적 매칭시켜, 객체 연결 영역에 대응하는 객체를 식별 처리하고, 리스트에 산입할 수 있는 바, 칼만 필터 방식으로 추적 처리하는 것이 예시될 수 있다. 칼만 필터는 시간에 따라 진행한 측정을 기반으로 입력 데이터를 재귀적으로 누적시켜 선형 역학계의 상태를 검출하는 필터이다.More specifically, the object tracking unit 124 can cumulatively match the blob image at time t and the blob image at t-1, identify and process the object corresponding to the object connection area, and include it in the list. Tracking processing using the Kalman filter method may be an example. The Kalman filter is a filter that detects the state of a linear dynamical system by recursively accumulating input data based on measurements made over time.

이에 따라, 상기 객체 추적부(124)는 영상 정보 내 식별되는 객체 추적 데이터를 출력할 수 있다. 상기 객체 추적 데이터는 객체의 리스트 정보와 함께, 식별된 각 객체의 크기 정보 및 이동 속도 정보를 포함할 수 있다.Accordingly, the object tracking unit 124 may output object tracking data identified within the image information. The object tracking data may include size information and movement speed information of each identified object, along with object list information.

그리고, 3차원 시점 변환부(125)는 전술한 객체 추적 데이터 내 각 객체의 크기 정보 및 이동 속도 정보를 카메라 장치별 3차원 시점 조건에 따른 실제적 수치로 변환 처리한다. 즉 상기 객체 추적부(124)의 추적 데이터는 객체의 크기와 속도 정보를 갖고 있으나 픽셀간의 상대적 값이므로, 상기 3차원 시점 변환부(125)는 이미지의 픽셀 좌표를 실제 현실 공간상의 좌표로 변환처리할 수 있는 바, 캘리브레이션을 통해 미리 지정된 화각, 높이 등의 카메라 특성 정보를 이용한 변환 처리가 수행될 수 있다.In addition, the 3D viewpoint conversion unit 125 converts the size information and movement speed information of each object in the above-described object tracking data into actual values according to the 3D viewpoint conditions for each camera device. That is, the tracking data of the object tracking unit 124 has size and speed information of the object, but is a relative value between pixels, so the 3D viewpoint conversion unit 125 converts the pixel coordinates of the image into coordinates in actual space. Conversion processing can be performed using camera characteristic information such as angle of view and height, which are previously specified through calibration.

이에 따라, 상기 시점 변환부(125)는 객체 추적 데이터 내 리스트된 객체들의 크기 정보 및 속도 정보를 변환하여 실제 현실 공간상의 크기 정보 및 속도 정보로 변환 처리할 수 있다.Accordingly, the viewpoint conversion unit 125 can convert the size information and speed information of objects listed in the object tracking data into size information and speed information in actual space.

한편, 객체 분류 처리부(126)는, 사전 설정된 분류 기준에 기초하여, 상기 객체 추적 데이터에서 식별된 객체에 분류 정보를 할당하는 처리를 수행할 수 있고, 시각 데이터 분석 결과 출력부(127)는 상기 할당된 분류 정보에 따라 영상 정보의 시각 데이터 분석 정보를 시각 데이터 분석 결과로 출력할 수 있다.Meanwhile, the object classification processing unit 126 may perform processing to assign classification information to the object identified in the object tracking data based on preset classification criteria, and the visual data analysis result output unit 127 may perform the processing of assigning classification information to the object identified in the object tracking data. According to the assigned classification information, the visual data analysis information of the image information can be output as a visual data analysis result.

다만, 상기 분류 기준 자체를 설정하거나, 분류 기준에 대응하는지 여부를 정확히 판단하는 프로세스는, 딥러닝 분석 처리부(128)에서 처리되는 것이 보다 효율적일 수 있다.However, it may be more efficient for the process of setting the classification standard itself or accurately determining whether it corresponds to the classification standard to be processed in the deep learning analysis processing unit 128.

즉, 외부의 다양한 영상 정보의 누적 학습을 통해 객체를 정확히 분류 검출할 수 있도록 하기 위하여 객체 분류 처리부(126)는 딥러닝 분석 처리부(128)를 통해 객체 분류 처리를 위한 상기 추적 데이터 및 영상 정보를 딥러닝 분석하고, 딥러닝 분석 결과 정보를 이용한 객체 분류 처리를 수행할 수 있다.That is, in order to accurately classify and detect objects through cumulative learning of various external image information, the object classification processing unit 126 collects the tracking data and image information for object classification processing through the deep learning analysis processing unit 128. You can perform deep learning analysis and object classification processing using deep learning analysis result information.

그리고, 상기 딥러닝 분석 처리부(128)는, 상기 3차원 시점 변환부(125)를 통해 획득되는 초기 분석에 따른 상기 추적 데이터 및 상기 영상 정보에 기초하여, 딥러닝 분산 처리 데이터를 생성하고, 상기 딥러닝 분산 처리 데이터를 하나 이상의 딥러닝 분산 처리 장치로 전송 처리할 수 있으며, 상기 딥러닝 분산 처리 장치로부터 수신되는 딥러닝 기반 영상 정보 분석 결과를 수신하여, 객체 분류 처리부(126)로 전달할 수 있다.And, the deep learning analysis processing unit 128 generates deep learning distributed processing data based on the tracking data and the image information according to the initial analysis obtained through the 3D viewpoint conversion unit 125, and Deep learning distributed processing data can be transmitted and processed to one or more deep learning distributed processing devices, and deep learning-based image information analysis results received from the deep learning distributed processing devices can be received and transmitted to the object classification processor 126. .

또한, 상기 딥러닝 분석 처리부(128)는, 딥 러닝 방식에 따라 사전 학습된 영상 정보로부터 신경망 데이터를 구축하며, 상기 신경망 데이터를 이용하여 상기 분산 처리 요청 데이터에 대응하는 객체 정보를 검출하고, 검출된 객체 정보의 분류 정보를 결정할 수 있다.In addition, the deep learning analysis processing unit 128 constructs neural network data from image information pre-learned according to a deep learning method, detects object information corresponding to the distributed processing request data using the neural network data, and detects object information corresponding to the distributed processing request data. The classification information of the object information can be determined.

그리고, 상기 딥러닝 분석 처리부(128)는 예를 들어, DNN(Deep Neural Network) 기반의 딥 러닝 알고리즘에 따른 신경 회로망이 설정되고, 해당 신경 회로망은 입력층(Input Layer), 하나 이상의 은닉층(Hidden Layers) 및 출력층(Output Layer)으로 구성될 수 있다. 여기서, 상기 딥 러닝 알고리즘은 DNN 이외의 다른 신경망이 적용될 수도 있으며, 일례로, CNN(Convolution Neural Network)이나 RNN(Recurrent Neural Network)과 같은 신경망이 적용될 수 있다.In addition, the deep learning analysis processing unit 128, for example, sets up a neural network according to a deep learning algorithm based on DNN (Deep Neural Network), and the neural network includes an input layer and one or more hidden layers. It can be composed of Layers and Output Layer. Here, the deep learning algorithm may be applied to a neural network other than DNN, for example, a neural network such as a Convolution Neural Network (CNN) or a Recurrent Neural Network (RNN).

여기서, 상기 딥러닝 분석 처리부(128)는, 딥 러닝 기반의 신경 회로망을 통해, 상기 영상 정보로부터 획득되는 하나 이상의 특정 이미지에서 대상 객체를 식별할 수 있으며, 상기 대상 객체가 아닌 특정 객체가 식별된 경우 상기 특정 이미지와 하나 이상의 상기 이미지 사이의 유사도를 산출할 수 있다.Here, the deep learning analysis processing unit 128 can identify a target object in one or more specific images obtained from the image information through a deep learning-based neural network, and a specific object other than the target object is identified. In this case, the degree of similarity between the specific image and one or more of the images may be calculated.

그리고, 상기 유사도가 미리 설정된 기준치 이상인 경우, 상기 특정 이미지에 대하여 상기 신경 회로망을 기반으로 영상 처리된 출력값과 객체 정보 사이의 오차에 대한 오차 정보를 생성하고, 미리 설정된 역전파 알고리즘(back propagation algorithm)을 통해 상기 오차 정보를 기반으로 상기 신경 회로망을 구성하는 파라미터를 조정할 수 있다.And, when the similarity is greater than a preset standard value, error information about the error between the image processed output value and object information is generated based on the neural network for the specific image, and a preset back propagation algorithm is used. Parameters constituting the neural network can be adjusted based on the error information.

여기서, 이미지 사이의 유사도 비교 방식으로는 히스토그램 매칭(HistogramHere, the similarity comparison method between images is histogram matching (Histogram matching).

matching)이나, 상기 이미지에서 식별된 객체에 지정된 템플릿 매칭(Template matching) 또는 상기 이미지에서 상기 신경 회로망을 통해 추출된 특징점 비교 등을 이용하여 유사도를 비교하는 것이 예시될 수 있다.An example may be comparing similarity using template matching assigned to an object identified in the image, or comparing feature points extracted from the image through the neural network.

그리고, 상기 딥러닝 분석 처리부(128)는, 상기 오차 정보를 기초로 상기 역전파 알고리즘을 통해 상기 신경 회로망을 구성하는 입력층, 하나 이상의 은닉층 및 출력층 사이의 연결 강도에 대한 가중치(weight) 또는 상기 입력층, 은닉층 및 출력층에 구성된 유닛의 바이어스(bias)를 가변하여 상기 대상 객체에 대한 식별 오류가 최소화되도록 학습처리할 수 있다.And, the deep learning analysis processing unit 128 sets a weight for the connection strength between the input layer, one or more hidden layers, and the output layer constituting the neural network through the backpropagation algorithm based on the error information, or By varying the bias of the units comprised in the input layer, hidden layer, and output layer, learning processing can be performed to minimize identification errors for the target object.

또한, 상기 딥러닝 분석 처리부(128)는, 상기 신경회로망을 통해 영상 이미지 검색 장치(100) 또는 다른 다양한 영상 장치로부터 지속적으로 수신된 이미지를 반복 학습하여 상기 객체정보를 갱신할 수 있으며, 이를 통해 상기 영상 및 이미지에서 상기 객체 정보에 포함된 객체 특징별 파라미터에 대응되는 객체를 정확하게 식별할 수 있다.In addition, the deep learning analysis processing unit 128 can update the object information by repeatedly learning images continuously received from the video image search device 100 or other various imaging devices through the neural network, through which Objects corresponding to parameters for each object characteristic included in the object information can be accurately identified from the video and image.

그리고, 상기 딥러닝 분석 처리부(128)는, 객체 검출부를 통해, 상기 학습된 신경망(또는 신경회로망) 데이터에 기초하여 딥러닝 분산 처리 요청 데이터로부터 획득되는 추적 데이터 및 영상 정보로부터 식별된 하나 이상의 객체를 검출할 수 있다.And, the deep learning analysis processing unit 128, through the object detection unit, one or more objects identified from tracking data and image information obtained from deep learning distributed processing request data based on the learned neural network (or neural network) data. can be detected.

다시 도 2를 참조하면, 상기 캡션 생성부(130)는, 전술한 바와 같은 처리에 따라 시각 데이터 분석 정보 처리부(120)에서 출력되는 시각 데이터 분석 정보로부터 추출된 객체 식별 정보를 이용하여 캡션 문장 데이터를 생성할 수 있다.Referring again to FIG. 2, the caption generator 130 generates caption sentence data using object identification information extracted from the visual data analysis information output from the visual data analysis information processing unit 120 according to the above-described processing. can be created.

이를 위해, 캡션 생성부(130)는 캡션 문장 데이터 구성부(133), 형태소 보정 처리부(135), 유의어 DB 구성부(137), 얼굴 인식 DB 구성부(139)를 포함할 수 있다.To this end, the caption generator 130 may include a caption sentence data constructor 133, a morpheme correction processor 135, a synonym DB constructor 137, and a face recognition DB constructor 139.

보다 구체적으로, 캡션 문장 데이터 구성부(133)는 상기 시각 데이터 분석 정보 처리부(120)에서 출력된 시각 데이터 분석 정보에 대응하는 캡션 문장 데이터를 구성한다.More specifically, the caption sentence data configuration unit 133 configures caption sentence data corresponding to the visual data analysis information output from the visual data analysis information processing unit 120.

여기서, 상기 캡션(Caption) 문장 데이터는 상기 영상 정보에 대응하여 검출된 이미지 또는 영상을 자연어 기반으로 설명하는 문장 데이터일 수 있으며, 객체 정보, 동작 정보, 시간 정보 중 적어도 하나를 포함하는 문자열 형태로 구성될 수 있다.Here, the caption sentence data may be sentence data that describes the image or video detected in response to the image information based on natural language, and is in the form of a string containing at least one of object information, motion information, and time information. It can be configured.

이러한 문장 데이터 생성을 위해, 캡션 문장 데이터 구성부(133)는 시각 데이터 분석 정보로부터 자연어를 기반으로 하는 문장을 생성하는 시각 데이터 기반 문장 생성 학습 모델을 사전 구축할 수 있다. 시각 데이터 분석 정보는 예를 들어, 이미지 내 특징 정보 및 객체 분류 정보를 포함할 수 있으며, 출력 데이터인 캡션 데이터와 상기 시각 데이터 분석 정보 간 학습 데이터를 이용한 연관 학습을 수행함에 따라, 상기 시각 데이터 기반 문장 생성 학습 모델이 사전 구축될 수 있다.To generate such sentence data, the caption sentence data constructor 133 may pre-construct a visual data-based sentence generation learning model that generates sentences based on natural language from visual data analysis information. Visual data analysis information may include, for example, feature information and object classification information within an image, and as association learning is performed using learning data between caption data as output data and the visual data analysis information, the visual data is based on the visual data. A sentence generation learning model may be pre-built.

그리고, 상기 형태소 보정 처리부(135)는 캡션 생성부(130)에서 생성된 캡션 문장 데이터의 형태소를 분석하여, 동사 및 형용사에 대응하는 경우에는 원형으로 보정 처리할 수 있다.Additionally, the morpheme correction processing unit 135 may analyze the morphemes of the caption sentence data generated by the caption generation unit 130 and correct them to their original form if they correspond to verbs and adjectives.

예를 들어, 상기 형태소 보정 처리부(135)는 상기 캡션 문장 데이터의 형태소를 명사, 동사, 형용사 및 부사로 분석할 수 있으며, 상기 분석된 각 형태소 중 동사 및 형용사의 경우에는 해당하는 경우에는 그 원형으로 보정 처리할 수 있다.For example, the morpheme correction processing unit 135 may analyze the morphemes of the caption sentence data into nouns, verbs, adjectives, and adverbs, and in the case of verbs and adjectives among each morpheme analyzed, their original forms, if applicable, It can be corrected with .

예를 들어, “손을 든 아이”로 캡션 데이터가 구성된 경우, 상기 형태소 보정 처리부(135)는 “손, 들다, 아이”로 형태소 보정 처리를 수행할 수 있다. 이러한 처리는 검색 성능 및 정확도를 향상시키기 위한 것으로서, 상기 형태소 보정 처리부(135)는 사용자의 요청에 따른 자연어 검색어가 입력된 경우에도 검색어 중 동사 및 형용사를 식별하여, 검색어의 동사 및 형용사에 대응한 원형으로의 보정 처리를 더 수행할 수 있다.For example, if the caption data consists of “child raising hand,” the morpheme correction processing unit 135 may perform morpheme correction processing with “hand, raise, child.” This processing is to improve search performance and accuracy. The morpheme correction processing unit 135 identifies verbs and adjectives among the search words even when a natural language search word is entered according to the user's request, and provides information corresponding to the verbs and adjectives in the search word. Correction processing to the original shape can be further performed.

한편, 유의어 DB 구성부(137)는, 상기 캡션 문장 데이터에 대응하여 하나 이상의 단어들을 식별하고, 식별된 단어들의 유의어를 색인하고, 색인된 유의어를 DB화하여 구성한다.Meanwhile, the synonym DB configuration unit 137 identifies one or more words corresponding to the caption sentence data, indexes synonyms of the identified words, and converts the indexed synonyms into a database.

이를 위해, 유의어 DB 구성부(137)는 미리 구축된 유의어 사전 데이터베이스에 접속하거나, 유의어 사전 데이터베이스를 미리 저장 및 관리할 수 있다. 여기서, 유의어의 색인 범주는 사전 설정된 유사도 범위 값에 따라 설정될 수 있으며, 유사도 산출 방식은 예를 들어, Word2Vec을 이용한 코사인 유사도 산출 프로세스 등이 이용될 수 있다.To this end, the thesaurus DB configuration unit 137 may access a pre-built thesaurus database or store and manage the thesaurus database in advance. Here, the index category of the synonym may be set according to a preset similarity range value, and the similarity calculation method may use, for example, a cosine similarity calculation process using Word2Vec.

그리고, 유의어 DB 구성부(137)는 상기 캡션 문장 데이터에 대응하여, 상기 캡션 문장 데이터로부터 추출된 단어의 유의어들을 검색 키워드로 설정하고, 상기 검색 키워드를 상기 캡션 문장 데이터의 부가 데이터로서 포함시킬 수 있다. 이에 따라, 유의어 색인여부 설정 및 유사도 범위 설정에 의해, 검색 정확도가 각 데이터 특성에 맞게 조절될 수 있다.In addition, the synonym DB configuration unit 137 may set synonyms of words extracted from the caption sentence data as search keywords in response to the caption sentence data, and include the search keyword as additional data of the caption sentence data. there is. Accordingly, search accuracy can be adjusted to suit each data characteristic by setting whether to index synonyms and setting the similarity range.

한편, 얼굴 인식 DB 구성부(139)는, 상기 시각 데이터 분석 정보로부터 하나 이상의 객체 얼굴 인식 단어를 식별할 수 있으며, 식별된 객체 얼굴 인식 단어를 이용한 DB를 구축하여 저장 및 관리할 수 있다.Meanwhile, the face recognition DB configuration unit 139 can identify one or more object face recognition words from the visual data analysis information, and can build, store, and manage a DB using the identified object face recognition words.

이를 위해, 얼굴 인식 DB 구성부(139)는 사용자 단말(200)등으로부터 사전 설정된 객체 얼굴 인식 단어 및 이에 대응하는 시각 데이터 분석 정보를 수집할 수 있다. 예를 들어, 특정 사용자의 특정 아이에 대응하는 다양한 이미지들로부터 공통적인 시각 데이터 분석 정보가 추출될 수 있으며, 이 경우, 얼굴 인식 DB 구성부(139)는 사용자 단말(200)로부터 입력된 아이 이름 정보를 객체 얼굴 인식 단어로서 설정할 수 있다.To this end, the face recognition DB configuration unit 139 may collect preset object face recognition words and visual data analysis information corresponding thereto from the user terminal 200, etc. For example, common visual data analysis information may be extracted from various images corresponding to a specific child of a specific user. In this case, the face recognition DB component 139 may use the child name input from the user terminal 200. The information can be set as an object face recognition word.

이에 따라, 얼굴 인식 DB 구성부(139)는 상기 캡션 문장 데이터에 대응하여, 상기 캡션 문장 데이터에 대응하는 시각 데이터 분석 정보로부터 식별된 객체 얼굴 인식 단어를 검색 키워드로 설정하고, 상기 검색 키워드를 상기 캡션 문장 데이터의 부가 데이터로서 포함시킬 수 있다. 이에 따라, 객체 얼굴 인식 설정에 의해, 검색 정확도가 각 데이터 특성과 자연어 조건에 맞게 조절될 수 있다.Accordingly, the face recognition DB configuration unit 139 sets, in response to the caption sentence data, an object face recognition word identified from visual data analysis information corresponding to the caption sentence data as a search keyword, and sets the search keyword as the search keyword. It can be included as additional data in the caption sentence data. Accordingly, search accuracy can be adjusted to suit each data characteristic and natural language conditions by setting the object face recognition.

그리고, 객체 분류 검색 정보 구성부(140)는 상기 시각 데이터 분석 정보에서 식별된 객체 식별 정보와 상기 캡션 문장 데이터를 비교하고, 상기 캡션 문장 데이터에 상기 객체 식별 정보가 포함되어 있는지 여부에 따라 상기 객체 분류 검색 정보를 구성할 수 있다. Then, the object classification search information configuration unit 140 compares the object identification information identified in the visual data analysis information and the caption sentence data, and determines the object according to whether the object identification information is included in the caption sentence data. You can configure classification search information.

여기서, 상기 객체 분류 검색 정보는, 이미지 또는 영상 정보를 색인하기 위한 캡션 문장 데이터를 자연어 기반으로 검색함에 있어서의 검색 효율 및 정확도를 향상시키기 위해 부가되는 정보로서, 상기 시각 데이터 분석 정보와 상기 캡션 문장 데이터와의 비교에 따라 조건부로 구성될 수 있다. 그리고, 상기 객체 분류 검색 정보와, 상기 캡션 문장 데이터 및 이에 대응하는 이미지 또는 영상 정보가 각각 매핑됨에 따라, 검색을 위한 검색 그룹 데이터가 구성될 수 있다. 또한, 상기 객체 분류 검색 정보는 상기 캡션 문장 데이터의 분류 정확도를 산출하는 데 이용될 수 있다.Here, the object classification search information is information added to improve search efficiency and accuracy in searching caption sentence data for indexing image or video information based on natural language, and includes the visual data analysis information and the caption sentence. It can be constructed conditionally based on comparison with data. And, as the object classification search information, the caption sentence data, and the corresponding image or video information are respectively mapped, search group data for search may be formed. Additionally, the object classification search information can be used to calculate classification accuracy of the caption sentence data.

이를 위해, 상기 객체 분류 검색 정보 구성부(140)는 캡션 문장 데이터와 시각 데이터 분석 정보로부터 분류된 객체 식별 정보를 비교할 수 있으며, 이에 대응하는 제외 처리 또는 키워드별 가중치 적용 처리 등을 수행할 수 있다.To this end, the object classification search information configuration unit 140 can compare object identification information classified from caption sentence data and visual data analysis information, and perform corresponding exclusion processing or weight application processing for each keyword. .

보다 구체적으로, 제1 실시 예에 따르면, 상기 객체 분류 검색 정보 구성부(140)는 상기 캡션 문장 데이터에 포함되지 않는 객체 식별 정보를 상기 객체 분류 검색 정보에서 제외하고, 상기 캡션 문장 데이터에 포함된 객체 식별 정보를 상기 객체 분류 검색 정보에 추가할 수 있다.More specifically, according to the first embodiment, the object classification search information configuration unit 140 excludes object identification information not included in the caption sentence data from the object classification search information, and excludes object identification information not included in the caption sentence data from the object classification search information. Object identification information can be added to the object classification search information.

예를 들어, 상기 캡션 문장 데이터에 “자전거를 타는 아이”가 포함되고, 상기 시각 데이터 분석 정보로부터 분류된 객체 식별 정보에는 “도로, 자전거, 아이”가 식별된 경우, 상기 제1 실시 예에 따르면, 상기 객체 분류 검색 정보를 구성할 때, “도로”는 제외되고, “자전거, 아이”가 상기 객체 분류 검색 정보로서 구성될 수 있다.For example, if the caption sentence data includes “child riding a bicycle” and the object identification information classified from the visual data analysis information identifies “road, bicycle, child,” according to the first embodiment. , When configuring the object classification search information, “road” may be excluded, and “bicycle, child” may be configured as the object classification search information.

또한, 제2 실시 예에 따르면, 상기 객체 분류 검색 정보 구성부(140)는 상기 캡션 문장 데이터와 상기 시각 데이터 분석 정보로부터 분류된 객체 식별 정보에, 동일한 객체가 존재하는지의 여부에 따라, 상기 객체 분류 검색 정보별 검색 가중치를 상이하게 부여하는 정보별 가중치 설정부(145)를 포함할 수 있다.Additionally, according to the second embodiment, the object classification search information constructor 140 determines whether the same object exists in object identification information classified from the caption sentence data and the visual data analysis information, It may include a weight setting unit 145 for each information that assigns different search weights to each classification search information.

상기 제2 실시 예에 따르면, 정보별 가중치 설정부(145)는 상기 캡션 문장 데이터에서 추출된 제1 객체 정보가, 시각 데이터 분석 정보로부터 추출된 객체 식별 정보에 존재하지 않는 경우, 상기 정보별 가중치 설정부(145)는 상기 제1 객체 정보에 대응하는 검색 키워드를 상기 객체 분류 검색 정보로서 설정하되, 그 검색 키워드의 가중치를 다른 키워드보다 상대적으로 낮은 수치로 부여할 수 있다.According to the second embodiment, the information-specific weight setting unit 145 sets the information-specific weight if the first object information extracted from the caption sentence data does not exist in the object identification information extracted from the visual data analysis information. The setting unit 145 may set a search keyword corresponding to the first object information as the object classification search information, and may give the search keyword a weight relatively lower than that of other keywords.

반대로, 상기 캡션 문장 데이터와, 상기 시각 데이터 분석 정보로부터 추출된 객체 식별 정보에 상기 제1 객체 정보가 모두 존재하는 경우, 상기 정보별 가중치 설정부(145)는, 상기 객체 식별 정보의 검색 가중치를 상대적으로 높게 부여할 수 있다.Conversely, when the first object information is present in both the caption sentence data and the object identification information extracted from the visual data analysis information, the weight setting unit 145 for each information sets the search weight of the object identification information. It can be given relatively high.

이러한 검색 가중치는 상기 캡션 문장 데이터와 상기 시각 데이터 분석 정보로부터 분류된 객체 식별 정보에, 동일한 객체가 존재하는지의 여부에 대응하되, 각 검색 키워드의 데이터 내 분포도나, 키워드 간 유사도 등에 따라 사전 설정된 매핑 테이블 등에 의해 상이하게 결정될 수 있다.This search weight corresponds to whether the same object exists in the object identification information classified from the caption sentence data and the visual data analysis information, and is preset according to the distribution of each search keyword within the data or similarity between keywords, etc. It may be determined differently depending on the table, etc.

예를 들어, 제2 실시 예에 따르면 상기 캡션 문장 데이터에 “자전거를 타는 아이”가 포함되고, 상기 시각 데이터 분석 정보로부터 분류된 객체 식별 정보에는 “도로, 자전거, 아이”가 식별된 경우, 상기 객체 분류 검색 정보를 구성할 때, “도로”는 캡션 문장 데이터상에는 존재하지 않으므로 사전 구성된 테이블에 기초하여 50%의 가중치를 부여하고, “자전거, 아이”의 경우에는 사전 구성된 테이블에 기초하여 상대적으로 높은 80%의 가중치가 부여될 수 있다.For example, according to the second embodiment, if the caption sentence data includes “child riding a bicycle” and the object identification information classified from the visual data analysis information identifies “road, bicycle, child”, When constructing object classification search information, “road” does not exist in the caption sentence data, so a 50% weight is given based on the pre-configured table, and “bicycle, child” is given a relative weight based on the pre-configured table. A weightage of as high as 80% may be assigned.

또한, 예를 들어, 캡션 생성부(130)에서 “집에서 아기가 이유식을 먹고 있습니다.”의 캡션 문장 데이터가 생성되고, 시각 데이터 분석 정보 처리부(120)는 “집”, “아기”, “엄마”, “이유식”의 객체 식별 정보를 출력한 경우, 상기 “엄마”는 상기 캡션 문장 데이터에 존재하지 않으므로, 상기 정보별 가중치 설정부(135)는 “엄마”에 대응하는 상기 객체 분류 검색 정보를 구성하되, 상기 “엄마” 키워드의 검색 가중치를 다른 키워드보다 상대적으로 50% 낮은 수치로 부여할 수 있는 것이다.In addition, for example, the caption generation unit 130 generates caption sentence data of “The baby is eating baby food at home,” and the visual data analysis information processing unit 120 generates caption sentence data such as “home,” “baby,” and “baby.” When object identification information of “mother” and “baby food” is output, “mother” does not exist in the caption sentence data, so the weight setting unit 135 for each information sets the object classification search information corresponding to “mother” However, the search weight of the “mom” keyword can be assigned a value that is relatively 50% lower than that of other keywords.

반대로, 상기 객체 식별 정보에서 “집”, “아기”, “이유식”은 상기 캡션 문장 데이터에도 공통적으로 존재하므로, 상기 정보별 가중치 설정부(135)는 검색 가중치를 80% 등으로 높게 부여할 수 있다.Conversely, since “house,” “baby,” and “baby food” in the object identification information are also commonly present in the caption sentence data, the weight setting unit 135 for each information may assign a search weight as high as 80%. there is.

그리고, 검색 데이터베이스부(160)는 상기 캡션 문장 데이터 및 상기 객체 분류 검색 정보를 이용하여, 문장 분류 정확도를 산출하고, 이미지 또는 영상의 자연어 검색을 위한 검색 데이터베이스를 구성한다. 이를 위해, 검색 데이터베이스부(160)는 문장 분류 정확도 산출부(163) 및 매핑 테이블 구성부(165)를 포함한다.Then, the search database unit 160 uses the caption sentence data and the object classification search information to calculate sentence classification accuracy and configures a search database for natural language search of images or videos. To this end, the search database unit 160 includes a sentence classification accuracy calculation unit 163 and a mapping table configuration unit 165.

보다 구체적으로, 상기 문장 분류 정확도 산출부(163)는, 상기 캡션 문장 데이터와 상기 객체 분류 검색 정보 간 객체의 중첩 비율 또는 객체 분류 검색 정보별 가중치 등에 기초하여, 상기 캡션 문장 데이터의 정확도가 산출될 수 있다.More specifically, the sentence classification accuracy calculation unit 163 calculates the accuracy of the caption sentence data based on the overlap ratio of objects between the caption sentence data and the object classification search information or the weight for each object classification search information. You can.

예를 들어, 상기 문장 분류 정확도는 상기 캡션 문장 데이터의 객체 분류 검색 정보별 중요도 및 가중치의 합산일 수 있고, 그 평균 혹은 일치율을 나타내는 수치일 수도 있는 바, 그 연산 방식으로 본 발명의 실시 예가 제한되는 것은 아니다.For example, the sentence classification accuracy may be the sum of importance and weight for each object classification search information of the caption sentence data, or may be a number representing the average or matching rate, and the embodiment of the present invention is limited by the calculation method. It doesn't work.

또한, 상기 검색 데이터베이스부(160)는, 상기 산출되는 캡션 문장 데이터의 정확도를 기반으로, 영상 이미지 검색 결과의 검색 정확도 범위를 조절할 수 있다.Additionally, the search database unit 160 may adjust the search accuracy range of the video image search result based on the accuracy of the calculated caption sentence data.

그리고, 상기 매핑 테이블 구성부(165)는, 보다 정확한 문장 작성과 정확한 문장 색인을 위하여, 이미지 또는 영상의 영상 정보와, 상기 이미지 또는 영상에 대응하여 상기 캡션 생성부(130)가 생성하는 캡션 문장 데이터와, 상기 객체 분류 검색 정보 구성부(140)를 구성하는 객체 분류 검색 정보 및 상기 문장 분류 정확도를, 하나의 검색 그룹 데이터로서 매핑 저장 및 관리할 수 있다.In addition, the mapping table configuration unit 165 includes image information of an image or video and a caption sentence generated by the caption generator 130 in response to the image or video for more accurate sentence creation and accurate sentence indexing. The data, the object classification search information constituting the object classification search information configuration unit 140, and the sentence classification accuracy can be mapped, stored, and managed as one search group data.

이에 따라, 검색 데이터베이스부(160)가 구성되면, 사용자 단말(200)로부터 입력된 자연어 기반의 검색어가 검색어 입력부(150)를 통해 입력되며, 검색 데이터베이스부(160)는 입력된 자연어 기반 검색어에 대응하여, 상기 매핑 테이블 구성부(165)에 매핑된 상기 캡션 문장 데이터 및 상기 객체 분류 검색 정보의 조합에 따른 색인을 수행하여 검색 그룹 데이터를 식별하고, 식별된 검색 그룹 데이터에 포함된 이미지 또는 영상 정보를 추출하여 검색결과 출력부(170)로 출력한다.Accordingly, when the search database unit 160 is configured, a natural language-based search word input from the user terminal 200 is input through the search word input unit 150, and the search database unit 160 responds to the input natural language-based search word. Thus, search group data is identified by performing an index based on a combination of the caption sentence data and the object classification search information mapped to the mapping table configuration unit 165, and image or video information included in the identified search group data. is extracted and output to the search result output unit 170.

여기서, 검색 데이터베이스부(160)는 검색어 입력부(150)에 입력되는 자연어 기반의 음성 또는 텍스트의 검색 문장에 대응하여, 사전 설정된 정확도 범주에 따라 상기 캡션 문장 데이터 및 상기 객체 분류 검색 정보를 색인할 수 있다. 이러한 색인 방식의 경우, 이미 구성된 캡션 문장 데이터를 텍스트 기반으로 색인하므로, 기존의 이미지 자체의 특징을 키워드로서 색인하는 방식과는 달리 학습과정과 색인과정이 단순화되어 신속하게 처리되며, 자연어에 적합한 검색이 이루어지므로 더욱 더 정확한 검색 결과가 출력된다는 장점이 있다.Here, the search database unit 160 may index the caption sentence data and the object classification search information according to a preset accuracy category in response to a natural language-based voice or text search sentence input to the search word input unit 150. there is. In the case of this indexing method, the caption sentence data already composed is indexed based on text, so unlike the existing method of indexing the characteristics of the image itself as keywords, the learning process and indexing process are simplified and processed quickly, and a search suitable for natural language is performed. This has the advantage of outputting more accurate search results.

이에 따라, 검색결과 출력부(170)는 상기 검색어 입력부(150)에 입력되는 검색어와 상기 검색 데이터베이스부(160)에 포함되는 데이터를 비교하여, 영상 또는 이미지를 사용자 단말(200)로 출력할 수 있게 된다.Accordingly, the search result output unit 170 can compare the search word input to the search word input unit 150 with the data included in the search database unit 160 and output a video or image to the user terminal 200. There will be.

도 4는 본 발명의 실시 예에 따른 영상 이미지의 검색 과정을 나타낸 흐름도이다.Figure 4 is a flowchart showing a video image search process according to an embodiment of the present invention.

먼저, 영상 이미지 제공 장치(100)는, 사용자 단말의 요청에 따라, 캡션 문장 데이터 생성을 위한 영상 정보를 수집한다(S101).First, the video image providing device 100 collects video information for generating caption sentence data according to a request from the user terminal (S101).

여기서, 영상 이미지 제공 장치(100)는, 사용자 단말(200)의 요청에 따라, 상기 사용자 단말(200)로부터 영상 정보를 수집할 수 있도록 인터페이스를 제공할 수 있다.Here, the video image providing device 100 may provide an interface to collect video information from the user terminal 200 in response to a request from the user terminal 200.

그리고, 영상 이미지 제공 장치(100)는, 상기 수집된 영상 정보의 시각 데이터 분석 정보를, 사전 구축된 자연어 처리 모델에 입력하여, 상기 영상 정보를 설명하는 캡션 문장 데이터를 생성한다(S103).Then, the video image providing device 100 inputs the visual data analysis information of the collected video information into a pre-built natural language processing model to generate caption sentence data explaining the video information (S103).

여기서, 영상 이미지 제공 장치(100)의 캡션 생성부(130)는 시각 데이터 분석 정보 처리부(120)에서 처리되는 시각 데이터 분석 정보에서 추출된 객체 식별 정보를, 캡션 문장 데이터를 생성을 위한 사전 구축된 자연어 처리 모델에 입력하여, 상기 영상 정보를 설명하는 캡션 문장 데이터를 생성할 수 있다.Here, the caption generation unit 130 of the video image providing device 100 uses object identification information extracted from the visual data analysis information processed by the visual data analysis information processing unit 120, and pre-built the caption sentence data for generating caption sentence data. Caption sentence data explaining the image information can be generated by inputting it into a natural language processing model.

그리고, 영상 이미지 제공 장치(100)는, 상기 수집된 연상 정보의 시각 데이터 분석 정보와 상기 캡션 문장 데이터를 비교하여, 상기 캡션 문장 데이터에 대응하는 객체 분류 검색 정보를 구성한다(S105).Then, the video image providing device 100 compares the visual data analysis information of the collected association information with the caption sentence data and configures object classification search information corresponding to the caption sentence data (S105).

여기서, 영상 이미지 제공 장치(100)의 상기 객체 분류 검색 정보 구성부(140)는 상기 시각 데이터 분석 정보에서 식별된 객체 식별 정보와 상기 캡션 문장 데이터를 비교하고, 상기 캡션 문장 데이터에 상기 객체 식별 정보가 포함되어 있는지 여부에 따라 상기 객체 분류 검색 정보를 구성할 수 있다.Here, the object classification search information configuration unit 140 of the video image providing device 100 compares the object identification information identified in the visual data analysis information and the caption sentence data, and adds the object identification information to the caption sentence data. The object classification search information can be configured depending on whether is included.

그리고, 영상 이미지 제공 장치(100)는, 상기 캡션 문장 데이터 및 상기 객체 분류 검색 정보를 이용하여, 문장 분류 정확도를 산출하고, 이미지 또는 영상의 자연어 검색을 위한 검색 데이터베이스를 구성한다(S107).Then, the video image providing device 100 uses the caption sentence data and the object classification search information to calculate sentence classification accuracy and configure a search database for natural language search of images or videos (S107).

여기서, 검색 데이터베이스부(160)는 상기 캡션 문장 데이터 및 상기 객체 분류 검색 정보를 이용하여, 문장 분류 정확도를 산출할 수 있다.Here, the search database unit 160 may calculate sentence classification accuracy using the caption sentence data and the object classification search information.

또한, 상기 검색 데이터베이스부(160)는, 이미지 또는 영상의 자연어 검색을 위한 검색 데이터베이스를 구성할 수 있으며, 문장 분류 정확도 산출부(163) 및 매핑 테이블 구성부(165)를 포함할 수 있다.Additionally, the search database unit 160 may configure a search database for natural language search of images or videos, and may include a sentence classification accuracy calculation unit 163 and a mapping table configuration unit 165.

그리고, 사용자 단말(200)은 자연어 기반의 음성 또는 텍스트로 검색 문장을 입력받아 영상 이미지 제공 장치(100)로 전달한다(S109).Then, the user terminal 200 receives a search sentence as a natural language-based voice or text and transmits it to the video image providing device 100 (S109).

여기서, 상기 검색 데이터베이스부(160)는, 상기 사용자 단말(200)로부터 자연어 기반의 음성 또는 텍스트의 검색 문장을 입력하는 검색어 입력부(150)를 더 포함할 수 있다.Here, the search database unit 160 may further include a search word input unit 150 for inputting a natural language-based voice or text search sentence from the user terminal 200.

그리고, 영상 이미지 제공 장치(100)는, 상기 입력된 자연어 기반의 검색 문장에 대응하는 캡션 문장 데이터를, 상기 객체 분류 정보 기반의 정확도를 이용하여 색인한다(S111).Then, the video image providing device 100 indexes caption sentence data corresponding to the input natural language-based search sentence using accuracy based on the object classification information (S111).

여기서, 상기 문장 분류 정확도 산출부(163)에서 산출되는 상기 문장 분류 정확도를 이용하여, 사용자가 원하는 검색 결과에 대응하도록 검색 범위 및 우선순위를 설정할 수 있다.Here, using the sentence classification accuracy calculated by the sentence classification accuracy calculation unit 163, the search range and priority can be set to correspond to the search results desired by the user.

이후, 영상 이미지 제공 장치(100)는, 상기 색인된 캡션 문장 데이터 및 상기 캡션 문장 데이터에 매핑된 영상 정보를 검색 결과로서 출력한다(S113).Thereafter, the video image providing device 100 outputs the indexed caption sentence data and video information mapped to the caption sentence data as a search result (S113).

상기 출력된 영상 정보에 따라, 하나 이상의 이미지 또는 영상이 사용자 단말(200)에서 검색 결과로서 출력될 수 있다.Depending on the output image information, one or more images or videos may be output as a search result from the user terminal 200.

도 5는 본 발명의 실시 예에 따른 검색 결과의 출력을 나타낸 도면이다.Figure 5 is a diagram showing the output of search results according to an embodiment of the present invention.

*도 5를 참조하면, 본 발명의 실시 예에 따른 사용자 단말(200)에서는 별도의 장치로부터 캡션 데이터 구성을 이용한 알람 서비스를 제공받을 수 있다. 알람 서비스의 경우, 캡션 데이터와 이에 대응하는 이미지 또는 영상 정보가 구성될 수 있으며, 영상 이미지 제공 장치(100)는 상기 캡션 데이터와 이에 대응하는 이미지 또는 영상 정보로부터 검색 데이터베이스(140)를 미리 구축할 수 있다.* Referring to FIG. 5, the user terminal 200 according to an embodiment of the present invention can receive an alarm service using caption data configuration from a separate device. In the case of an alarm service, caption data and corresponding image or video information may be configured, and the video image providing device 100 may build a search database 140 in advance from the caption data and the corresponding image or video information. You can.

그리고, 도 5에 도시된 바와 같이, “우리 두 아이가 핸드폰 보는 사진 찾아줘”, “우리 할아버지가 약 드시는 사진 찾아줘”등을 사용자가 입력한 경우, 입력된 자연어 검색 데이터는 영상 이미지 제공 장치(100)의 검색어 입력부(150)에 입력될 수 있다.And, as shown in Figure 5, when the user inputs “Find a picture of my two children looking at their cell phones” or “Find a picture of my grandfather taking medicine”, the input natural language search data is sent to the video image providing device. It can be entered into the search word input unit 150 of (100).

그리고, 영상 이미지 제공 장치(100)는, 상기 검색어에 대응하는 캡션 문장 데이터를 상기 객체 분류 검색 정보 기반의 정확도를 이용하여, 캡션 문장 데이터를 색인하며, 상기 색인된 캡션 문장 데이터에 매핑된 영상 이미지 정보를 검색 결과로서, 각각의 사진 1, 사진 2를 사용자 단말(200)로 출력할 수 있다.Then, the video image providing device 100 indexes the caption sentence data corresponding to the search term using accuracy based on the object classification search information, and video images mapped to the indexed caption sentence data. As information search results, each of Photo 1 and Photo 2 can be output to the user terminal 200.

상술한 본 발명에 따른 방법은 컴퓨터에서 실행되기 위한 프로그램으로 제작되어 컴퓨터가 읽을 수 있는 기록 매체에 저장될 수 있으며, 컴퓨터가 읽을 수 있는 기록 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장장치 등이 있다.The method according to the present invention described above can be produced as a program to be executed on a computer and stored in a computer-readable recording medium. Examples of computer-readable recording media include ROM, RAM, CD-ROM, and magnetic tape. , floppy disks, optical data storage devices, etc.

컴퓨터가 읽을 수 있는 기록 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다. 그리고, 상기 방법을 구현하기 위한 기능적인(function) 프로그램, 코드 및 코드 세그먼트들은 본 발명이 속하는 기술분야의 프로그래머들에 의해 용이하게 추론될 수 있다.The computer-readable recording medium is distributed in a computer system connected to a network, so that computer-readable code can be stored and executed in a distributed manner. And, functional programs, codes, and code segments for implementing the method can be easily deduced by programmers in the technical field to which the present invention pertains.

이상에서는 본 발명의 바람직한 실시예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 다양한 변형 실시가 가능한 것은 물론이고, 이러한 변형 실시들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해되어서는 안될 것이다.In the above, preferred embodiments of the present invention have been shown and described, but the present invention is not limited to the specific embodiments described above, and may be used in the technical field to which the invention pertains without departing from the gist of the invention as claimed in the claims. Of course, various modifications can be made by those skilled in the art, and these modifications should not be understood individually from the technical idea or perspective of the present invention.

100 : 영상 이미지 검색 장치 200 : 사용자 단말
300 : 중계 서버 400 : 웹 서버100: video image search device 200: user terminal
300: relay server 400: web server

Claims

In the video image search device,
A video information collection unit that collects video information;
a caption generator that inputs visual data analysis information of the collected image information into a pre-built natural language processing model to generate caption sentence data describing the image information;
an object classification search information constructor that compares visual data analysis information of the collected image information with the caption sentence data and configures object classification search information corresponding to the caption sentence data; and
It includes a search database unit that configures a search database for natural language search of images or videos using the caption sentence data and the object classification search information,
The object classification search information configuration unit,
It includes a weight setting unit for each information that compares object information extracted from the caption sentence data and object identification information included in the visual data analysis information and assigns different search weights to the object classification search information for each object. doing
Video image retrieval device.

According to paragraph 1,
The caption generator,
a synonym DB component that extracts synonyms for each word included in the caption sentence data and adds them to the object classification search information; and
A face recognition DB component that extracts a face recognition word for a specific object from the visual data analysis information and adds it to the object classification search information; comprising a.
Video image retrieval device.

According to paragraph 1,
The search database unit,
A sentence classification accuracy calculation unit that calculates accuracy for each caption sentence data based on an overlap ratio of objects between the caption sentence data and the object classification search information.
Video image retrieval device.

According to paragraph 3,
The search database unit,
Adjusting the accuracy range of video image search results based on the accuracy of each caption sentence data calculated above.
Video image retrieval device.

According to paragraph 1,
Further comprising a search result output unit that outputs a natural language search result of an image or video using the caption sentence data and the object classification search information.
Video image retrieval device.

In a method of operating a video image search device,
collecting image information;
Inputting visual data analysis information of the collected image information into a pre-built natural language processing model to generate caption sentence data describing the image information;
Comparing visual data analysis information of the collected image information with the caption sentence data, and constructing object classification search information corresponding to the caption sentence data; and
Constructing a search database for natural language search of images or videos using the caption sentence data and the object classification search information,
The step of configuring the object classification search information is,
Comprising object information extracted from the caption sentence data and object identification information included in the visual data analysis information, and assigning different search weights to the object classification search information for each object.
How to operate a video image search device.

According to clause 6,
The step of generating the caption sentence data is,
extracting synonyms for each word included in the caption sentence data and adding them to the object classification search information; and
Comprising: extracting a face recognition word for a specific object from the visual data analysis information and adding it to the object classification search information;
How to operate a video image search device.

According to clause 6,
The step of configuring the search database is,
Comprising: calculating accuracy for each caption sentence data based on an overlap ratio of objects between the caption sentence data and the object classification search information;
How to operate a video image search device.

According to clause 8,
Further comprising: adjusting the accuracy range of the video image search result based on the accuracy of each calculated caption sentence data.
How to operate a video image search device.