WO2021040101A1

WO2021040101A1 - Real-time distributed indexing system and method for high-performance query and response

Info

Publication number: WO2021040101A1
Application number: PCT/KR2019/011163
Authority: WO
Inventors: 박진영; 최병은
Original assignee: 주식회사 나눔기술
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2021-03-04

Abstract

Disclosed are a real-time distributed indexing system and method for high-performance query and response, the system comprising: a conversion unit which analyzes unstructured data including any one of text, an image, audio, and video, and converts the unstructured data into structured data; an index unit which, when pre-stored index data for the converted structured data exists, generates search data for the pre-stored index data; a parallel processing unit which classifies the generated search data as relevant data having relevancy, sorts and merges the classified relevant data, and processes the classified relevant data as correct answer candidate data in a distributed manner; and an extraction unit which filters the correct answer candidate data, that has been processed in a distributed manner, on the basis of any one of a user preference, a relevancy factor, and a search engine, and extracts correct answer data.

Description

Real-time distributed indexing system and method for high-performance query and response

The present invention relates to a real-time distributed index system and method for high-performance query and response, and more particularly, to a real-time distributed index system and method for high-performance query and response based on one or more analysis modules and MapReduce modules configured in parallel. It is about.

As the social network service (SNS) and mobile Internet service are active, a lot of data is created and distributed on the Internet, and a large number of data is collected and processed among companies that operate search engines and web portals to provide Internet users. We have a service that provides Q&A.

However, the existing service that handles data query and response has difficulty in real-time processing of explosively increasing data.

Recently, a lot of research has been conducted on a technology for distributed parallel processing of large-capacity data that can process explosively increasing data in real time. It is a trend that is attracting attention.

However, the distributed parallel processor technology using the MapReduce model is designed for the purpose of processing one-time data.Because it is a method that reads and processes the data allocated to the Map function from start to finish, it scans the entire input data every time. It can be a technology that is applied in a way that provides a cause of performance degradation.

Therefore, it is necessary to provide a system and method that provides a speed capable of effectively processing a large amount of data without deteriorating performance and processing a query response in real time.

In addition, Internet users share and exchange unstructured data related to not only text, but also images, audio and video through an alternating network service. It can be central. Therefore, it is necessary to provide a system and method capable of effectively analyzing queries on unstructured data related to image, audio and video.

An embodiment of the present invention provides a real-time distributed indexing system and method for high-performance inquiries and responses that provide speed to process query responses in real time by performing indexing when processing queries using a MapReduce module. do.

An embodiment of the present invention provides a real-time distributed indexing system and method for high-performance query and response that effectively analyzes queries on unstructured data using an analysis module that processes unstructured data.

An embodiment of the present invention provides a real-time distributed indexing system and method for high-performance queries and responses using filtering means to improve reliability of correct answer data extracted in response to a query.

A real-time distributed indexing system for high-performance query and response according to an embodiment of the present invention is a conversion unit that analyzes and converts unstructured data including any one of text, image, voice, and video into structured data, and the converted structured data When pre-stored index data for is present, an index unit for generating search data for the pre-stored index data, classifies the generated search data as related related data, and sorts the classified related data And a parallel processing unit for merging and distributing processing the correct answer candidate data, and an extracting unit for extracting correct answer data by filtering the distributed correct answer candidate data based on any one of a user preference, a relevance factor, and a search engine.

The conversion unit determines the type of unstructured data for any one of the text, the image, the voice, and the video, and when the type of the unstructured data is determined as the text, the text is A linguistic pattern analysis unit that analyzes, when the type of the unstructured data is determined to be the image, an image pattern analysis unit that analyzes the image using image pattern recognition, and the type of the unstructured data is determined to be the voice, When the type of the unstructured data and the voice pattern analysis unit that analyzes the voice using voice pattern recognition is determined as the image, the image is analyzed using image pattern recognition including the image pattern recognition and the voice pattern recognition. It may include an image pattern analysis unit to analyze.

When there is no pre-stored index data for the converted structured data, the index unit may generate index data for the converted structured data and may generate search data for the generated index data.

The real-time distributed index system for high-performance query and response according to an embodiment of the present invention may further include a storage unit for storing the generated index data and the pre-stored index data.

The parallel processing unit may include a map processing unit for classifying the search data into the relevant related data, and a reduce processing unit for distributing the correct answer candidate data by sorting and merging the related data.

The real-time distributed indexing method for high-performance query and response according to an embodiment of the present invention comprises the steps of analyzing unstructured data including any one of text, image, voice, and video and converting it into structured data. If there is pre-stored index data, generating search data for the pre-stored index data, classifying the generated search data as related related data, and sorting and merging the classified related data Distributing-processing the distributed-processed correct answer candidate data, and filtering the distributed-processed correct answer candidate data based on any one of a user preference, a relevance factor, and a search engine to extract correct answer data.

According to an embodiment of the present invention, when processing a query using the MapReduce module, an index operation may be performed to provide a speed capable of processing a query response in real time.

According to an embodiment of the present invention, a query for unstructured data can be effectively analyzed using an analysis module that processes unstructured data.

According to an embodiment of the present invention, a filtering means may be used to improve the reliability of correct answer data extracted in response to a query.

1 is a block diagram of a real-time distributed indexing system for high performance query and response according to an embodiment of the present invention.

2 is a block diagram showing the configuration of a conversion unit.

3 is a block diagram showing the configuration of a parallel processing unit.

4 is an example of a configuration of a parallel processing unit.

5 is a flowchart illustrating a real-time distributed indexing method for high-performance query and response according to an embodiment of the present invention.

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings and contents described in the accompanying drawings, but the present invention is not limited or limited by the embodiments.

Meanwhile, in describing the present invention, when it is determined that a detailed description of a related known function or configuration may unnecessarily obscure the subject matter of the present invention, a detailed description thereof will be omitted. In addition, terms used in the present specification are terms used to properly express an embodiment of the present invention, which may vary depending on the intention of users or operators, or customs in the field to which the present invention belongs. Therefore, definitions of these terms should be made based on the contents throughout the present specification.

Referring to FIG. 1, a real-time distributed indexing system 100 for high performance query and response according to the present invention includes a conversion unit 110, an index unit 120, a parallel processing unit 130, and an extraction unit 140.

The conversion unit 110 analyzes unstructured data including any one of text, image, audio, and video and converts it into structured data. Hereinafter, the conversion unit 110 for analyzing unstructured data and converting it into structured data will be described in detail with reference to FIG. 2.

2 is a block diagram showing the configuration of a conversion unit. Referring to FIG. 2, the conversion unit 110 includes a determination unit 210, a language pattern analysis unit 220, an image pattern analysis unit 230, a voice pattern analysis unit 240, and an image pattern analysis unit 250. Can include.

According to one aspect of the present invention, the real-time distributed indexing system 100 for high-performance query and response comprises one or more conversion units 110 in parallel to analyze and convert unstructured data in real time, Data can be acquired.

The determination unit 210 may determine the type of unstructured data for any one of text, image, audio, and video.

For example, the unstructured data may be unstructured data such as text, image, voice, and video unlike numeric data having a certain standard or form. For example, unstructured data may be books, magazines, documents, audio information, video information and data, and data generated from an alternate network service including e-mail, Twitter, and blog. Can be

When the type of unstructured data is determined as text, the language pattern analysis unit 220 may analyze the text using language pattern recognition.

In more detail, the language pattern analysis unit 220 may analyze text using language pattern recognition. Language pattern recognition can detect text as a national language, and analyze the text by decomposing it into keyword units. For example, the text "When did Korea president meet the mayor of Seoul?" In this case, the language pattern analysis unit 220 may detect a country-specific language configured in an English form, decompose the text into keyword units, and analyze it in a form such as "Time, Korea president, Meet, Mayor of Seoul".

When the type of unstructured data is determined as an image, the image pattern analysis unit 230 may analyze the image using image pattern recognition.

In more detail, the image pattern analysis unit 230 may analyze characteristics of an image based on statistical information and a priori knowledge extracted from identified patterns in the image. For example, image pattern recognition may be pattern recognition capable of discriminating shades, color relationships, shapes, and the like displayed on an image.

When the type of unstructured data is determined to be speech, the speech pattern analysis unit 240 may analyze speech using speech pattern recognition.

In more detail, the speech pattern analysis unit 240 may analyze and match a speech pattern closest to the coded speech by comparing the coded speech encoding the speech with the selected standard pattern speech. For example, the speech pattern recognition may be any one of a keyword unit of a voice, a phoneme unit of a voice, and a sentence unit of a voice.

The image pattern analysis unit 250 may analyze an image using image pattern recognition including image pattern recognition and voice pattern recognition.

Referring back to FIG. 1, when pre-stored index data for the converted structured data exists, the index unit 120 generates search data for pre-stored index data.

In general, index data for structured data may be generated in a binary format. In more detail, the index data may be expressed as a number of binary bits rather than text or characters to express the contents of the unstructured data, and the index data may be data including an index generated from a specific record of the structured data.

When there is no pre-stored index data for the converted structured data, the index unit 120 may generate index data for the converted structured data and may generate search data for the generated index data.

According to one side of the outbreak, if pre-stored index data does not exist, when generating index data, since a record existing in the converted structured data must be read once, the structured data may become search data.

The real-time distributed indexing system 100 for high-performance query and response according to the present invention may further include a storage unit 150 for storing generated index data and pre-stored index data.

The storage unit 150 may store search data in block units, and the storage unit 150 may be applied in a Hadoop Distributed File System (HDFS) structure.

The Hadoop distributed file system provides quick access to the generated index data, pre-stored index data, and search data by distributing and storing the generated index data, pre-stored index data, and search data.

The parallel processing unit 130 classifies the generated search data into related related data, sorts and merges the classified related data, and distributes the data as correct answer candidate data. Hereinafter, the parallel processing unit 130 will be described in detail with reference to FIGS. 3 and 4.

3 is a block diagram showing the configuration of a parallel processing unit. Referring to FIG. 3, the parallel processing unit 130 may include a map processing unit 310 and a reduce processing unit 320.

According to one aspect of the present invention, the real-time distributed indexing system 100 for high-performance query and response classifies the search data generated by distributing one or more parallel processing units 130 as related data in real time, and By sorting and merging related data, it is possible to obtain correct answer candidate data at high processing speed.

The map processing unit 310 may classify the search data into related related data. More specifically, the map processing unit 310 classifies or divides the search data into related related data based on an intermediate key and value for the search data to improve processing speed and load the system. Can be reduced.

The reduce processing unit 320 may sort and merge related data to distribute the data as correct answer candidate data. In more detail, the reduce processing unit 320 may sort and merge the related data based on an intermediate key and a value for the related data to distribute the data as correct answer candidate data.

4 is an example of a configuration of a parallel processing unit. Referring to FIG. 4, the parallel processing unit 130 may include one or more map processing units 310 configured in parallel and one or more reduce processing units 320 configured in parallel.

According to one aspect of the present invention, the parallel processing unit 130 may include a map processing unit 310 and a reduce processing unit 320 composed of one or more, and has high program portability, such as Java, Ruby, and Python. (Python) and C++ programming languages.

Referring back to FIG. 1, the extraction unit 140 extracts correct answer data by filtering the distributed-processed correct answer candidate data based on any one of a user preference, a relevance factor, and a search engine.

More specifically, the extraction unit 140 may extract correct answer data by filtering based on user preference based on historical data related to user feedback on previous correct answer candidate data, and the relevance of the correct answer candidate data. The correct answer data can be extracted by filtering based on the relevance factor that is sequentially rendered according to, and search engines built in advance on the web (e.g., Google, Yahoo, Naver ( The correct answer data can be extracted by filtering based on (a portal site including Naver) and Daum).

Referring to FIG. 5, in the real-time distributed indexing method for high-performance query and response according to the present invention, in step 510, unstructured data including any one of text, image, voice, and video is analyzed and converted into structured data.

According to one aspect of the present invention, step 510 is a step of determining the type of unstructured data for any one of text, image, voice, and video, and when the type of unstructured data is determined to be text, the text is recognized using language pattern recognition. Analyzing, when the type of unstructured data is determined to be the image, analyzing the image using image pattern recognition, and when the type of unstructured data is determined to be speech, analyzing the speech using speech pattern recognition And when the type of unstructured data is determined as an image, analyzing the image using image pattern recognition including image pattern recognition and voice pattern recognition.

In the real-time distributed indexing method for high-performance query and response of the present invention, when pre-stored index data for the structured data converted in step 520 exists, search data for pre-stored index data is generated.

According to one aspect of the present invention, the real-time distributed indexing method for high-performance query and response of the present invention generates index data for the converted structured data when there is no pre-stored index data for the converted structured data, and generates You can create search data for indexed data.

In the real-time distributed indexing method for high-performance query and response of the present invention, the search data generated in step 530 is classified as related related data, sorted and merged the classified related data, and distributedly processed into correct answer candidate data, step 540 The correct answer data is extracted by filtering the correct answer candidate data distributedly processed by the user based on any one of user preference, relevance factor, and search engine.

The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -A hardware device specially configured to store and execute program instructions such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those produced by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware device described above may be configured to operate as one or more software modules to perform the operation of the embodiment, and vice versa.

Although the embodiments have been described by the limited embodiments and drawings as described above, various modifications and variations can be made from the above description to those of ordinary skill in the art. For example, the described techniques are performed in a different order from the described method, and/or components such as systems, structures, devices, circuits, etc. described are combined or combined in a form different from the described method, or other components Alternatively, even if substituted or substituted by an equivalent, an appropriate result can be achieved.

Therefore, other implementations, other embodiments, and those equivalent to the claims also fall within the scope of the claims to be described later.

Claims

A conversion unit that analyzes unstructured data including any one of text, image, audio, and video and converts it into structured data;

An index unit for generating search data for the pre-stored index data when pre-stored index data for the converted structured data exists;

A parallel processing unit for classifying the generated search data as related related data, sorting and merging the classified related data, and performing distributed processing as correct answer candidate data; And

Extraction unit for extracting correct answer data by filtering the distributedly processed correct answer candidate data based on any one of user preference, relevance factor, and search engine

Real-time distributed indexing system for high-performance inquiries and responses, including.
The method of claim 1,

The conversion unit

A determination unit for determining a type of unstructured data for any one of the text, the image, the audio, and the video;

A language pattern analysis unit that analyzes the text using language pattern recognition when the type of the unstructured data is determined as the text;

An image pattern analysis unit that analyzes the image using image pattern recognition when the type of the unstructured data is determined as the image;

A speech pattern analysis unit that analyzes the speech using speech pattern recognition when the type of the unstructured data is determined as the speech; And

When the type of the unstructured data is determined as the image, an image pattern analysis unit that analyzes the image using image pattern recognition including the image pattern recognition and the voice pattern recognition

Real-time distributed indexing system for high-performance inquiries and responses, including.
The method of claim 1,

The index part

Real-time for high-performance query and response for generating index data for the converted structured data and generating search data for the generated index data when pre-stored index data for the converted structured data does not exist Distributed indexing system.
The method of claim 3,

A storage unit for storing the generated index data and the pre-stored index data

Real-time distributed indexing system for high-performance query and response further comprising a.
The method of claim 1,

The parallel processing unit

A map processing unit for classifying the search data into the relevant related data; And

A reduce processing unit that sorts and merges the related data and distributes them to the correct answer candidate data

Real-time distributed indexing system for high-performance query and response, including.
Analyzing unstructured data including any one of text, image, audio, and video and converting it into structured data;

If pre-stored index data for the converted structured data exists, generating search data for the pre-stored index data;

Classifying the generated search data as related related data, sorting and merging the classified related data, and distributing processing the classified data as candidate answer data; And

Extracting correct answer data by filtering the distributedly processed correct answer candidate data based on any one of a user preference, a relevance factor, and a search engine

Real-time distributed indexing method for high-performance query and response including a.