JP7377915B2

JP7377915B2 - Method, computer device, and computer program for providing personalized data retrieval service

Info

Publication number: JP7377915B2
Application number: JP2022088502A
Authority: JP
Inventors: ジョンホパン; チャンヒョンイ
Original assignee: Naver Corp
Current assignee: Naver Corp
Priority date: 2021-06-02
Filing date: 2022-05-31
Publication date: 2023-11-10
Anticipated expiration: 2042-05-31
Also published as: JP2022185581A; KR20220162963A; KR102592785B1

Description

以下の説明は、個別データ検索サービスを提供する技術に関する。 The following description relates to techniques for providing personalized data search services.

テキスト文書データに対する検索は、極めて基本的で重要な演算であり、情報検索分野において広く使用されている。 Searching for text document data is an extremely basic and important operation, and is widely used in the information retrieval field.

検索エンジンとは、広義ではインターネット上で情報を収集して探索するシステムを意味し、主に、インターネット上のウェブページをクローリング（ｃｒａｗｌｉｎｇ）し、特定の検索語（ｑｕｅｒｙ）が入力されれば、該当の検索語と関連するウェブページを結果値で示すシステムを指す。 A search engine, in a broad sense, refers to a system that collects and searches information on the Internet.It mainly crawls web pages on the Internet, and when a specific search term (query) is input, Refers to a system that shows web pages related to the search term as result values.

例えば、特許文献１（登録日２０１１年３月２日）は、クライアントに対するカスタム検索エンジンを提供する技術を開示している。 For example, Patent Document 1 (registration date: March 2, 2011) discloses a technique for providing a custom search engine to a client.

一般的に、検索には、ターム（ｔｅｒｍ）を索引する転置索引（ｉｎｖｅｒｔｅｄｉｎｄｅｘ）資料構造が使用される。既存の資料構造では１つの主キー（ｐｒｉｍａｒｙｋｅｙ）が複数のフィールドを指定しているとすれば、転置索引では１つの値（ｔｅｒｍ）で該当の値が含まれた文書番号を指定する。 Generally, an inverted index document structure that indexes terms is used for searching. In the existing document structure, one primary key specifies multiple fields, but in an inverted index, one value (term) specifies the document number that includes the corresponding value.

一方、近年は、個人メール、個人ファイル、メッセンジャーチャットルームなどの個別データ内で検索を行うサービスが提供されている。 On the other hand, in recent years, services have been provided that allow searches within individual data such as personal emails, personal files, and messenger chat rooms.

転置索引は検索の応答速度に最適な資料構造ではあるが、個別データ検索サービスでは検索対象が全体文書のうちの極一部であるため、費用と資源を考慮すると転置索引資料構造は相応しくない。 Although an inverted index is the best material structure for search response speed, in an individual data search service, the search target is only a small part of the entire document, so the inverted index material structure is not suitable in terms of cost and resources.

韓国登録特許第１０－１０２１０２２号公報 Korean Registered Patent No. 10-1021022

個別データ検索サービスに特化したエンジンとして、転置索引のない検索エンジンを提供する。 As an engine specializing in individual data search services, we provide a search engine without inverted indexes.

個別データ検索サービスの基本要求事項となる部分一致検索のためにフルスキャン（ｆｕｌｌｓｃａｎ）方式を適用するのと同時に、個別データ検索サービスの応答速度を満たすことのできる検索エンジンを提供する。 To provide a search engine that can apply a full scan method for partial match search, which is a basic requirement of individual data search services, and at the same time satisfy the response speed of individual data search services.

コンピュータ装置で実行される個別データ検索方法であって、前記コンピュータ装置は、メモリに含まれるコンピュータ読み取り可能な命令を実行するように構成された少なくとも１つのプロセッサを含み、前記個別データ検索方法は、前記少なくとも１つのプロセッサにより、ユーザと関連する個別データに該当する検索対象文書をブロック単位のボリュームに圧縮して保存する段階、および前記少なくとも１つのプロセッサにより、検索要請に対応する複数のボリュームを並列にフルスキャン（ｆｕｌｌｓｃａｎ）検索する段階を含む、個別データ検索方法を提供する。 A method for retrieving discrete data performed on a computing device, the computing device including at least one processor configured to execute computer readable instructions contained in a memory, the method for retrieving discrete data comprising: compressing a search target document corresponding to individual data related to a user into a block-based volume by the at least one processor and storing it in a volume; and compressing a plurality of volumes corresponding to the search request in parallel by the at least one processor. The present invention provides a method for searching individual data, including a step of performing a full scan search.

一側面によると、前記保存する段階は、前記検索対象文書を一定サイズのブロック単位で集めて圧縮することによって圧縮ボリュームを生成する段階を含んでよい。 According to one aspect, the storing step may include generating a compressed volume by collecting and compressing the search target documents in blocks of a certain size.

他の側面によると、前記保存する段階は、新規文書が流入する場合、前記新規文書を前記検索対象文書からなる増分ボリューム（ｉｎｃｒｅｍｅｎｔｖｏｌｕｍｅ）に付け足す（ａｐｐｅｎｄ）段階、および前記増分ボリュームを一定サイズのブロック単位に圧縮して圧縮ボリュームを生成する段階を含んでよい。 According to another aspect, when a new document comes in, the storing step includes appending the new document to an increment volume of the search target documents, and adding the increment volume to a predetermined size. The method may include compressing blocks to generate a compressed volume.

また他の側面によると、前記保存する段階は、前記圧縮ボリュームが生成された後に既存の文書が削除される場合、前記既存の文書に対する削除情報をマーキングする段階をさらに含み、前記マーキングされた文書は検索結果から除外してよい。 According to another aspect, when the existing document is deleted after the compressed volume is generated, the storing step further includes marking the existing document with deletion information, and the marked document may be excluded from search results.

また他の側面によると、前記検索する段階は、転置索引（ｉｎｖｅｒｔｅｄｉｎｄｅｘ）資料構造は使用せず、前記ブロック単位の圧縮ボリュームに対するフルスキャン方式によってクエリと部分一致する文書を検索してよい。 According to another aspect, the searching step may search for documents that partially match the query using a full scan method for the block-based compressed volume without using an inverted index document structure.

また他の側面によると、前記検索する段階は、前記複数のボリュームを並列にデコードする段階、および前記デコードされたボリュームを対象に文字列ファインド（ｆｉｎｄ）を並列に実行する段階を含んでよい。 According to another aspect, the searching step may include decoding the plurality of volumes in parallel, and performing a string find on the decoded volumes in parallel.

また他の側面によると、前記保存する段階は、サーバの二重化のために、複数のホストに前記個別データに対する複製ボリューム（ｒｅｐｌｉｃａｖｏｌｕｍｅ）を保存する段階を含んでよい。 According to another aspect, the storing step may include storing a replica volume of the individual data in a plurality of hosts for server duplication.

また他の側面によると、前記検索する段階は、前記検索要請に含まれたクエリと前記複数のボリューム内の文書をユニコード正規化する段階、および正規化された文字列を利用して照合（ｃｏｌｌａｔｉｏｎ）検索を行う段階を含んでよい。 According to another aspect, the searching step includes normalizing the query included in the search request and the documents in the plurality of volumes in Unicode, and collating the query using the normalized character string. ) performing a search.

また他の側面によると、前記保存する段階は、前記検索対象文書をユニコード正規化する段階を含み、前記検索する段階は、前記検索要請に含まれたクエリをユニコード正規化する段階、および正規化された文字列を利用して照合検索を行う段階を含んでよい。 According to another aspect, the storing step includes Unicode normalizing the search target document, and the searching step includes Unicode normalizing the query included in the search request, and normalizing the search target document. The method may include a step of performing a matching search using the character strings obtained.

さらに他の側面によると、前記保存する段階は、変換文字位置を示すオフセットと該当の位置の原本文字を含む変換テーブルを生成する段階をさらに含んでよい。 According to still another aspect, the storing step may further include generating a conversion table including an offset indicating a converted character position and an original character at the corresponding position.

前記検索方法をコンピュータ装置に実行させるためのコンピュータプログラムを提供する。 A computer program for causing a computer device to execute the search method is provided.

コンピュータ装置であって、メモリに含まれるコンピュータ読み取り可能な命令を実行するように構成された少なくとも１つのプロセッサを含み、前記少なくとも１つのプロセッサは、ユーザと関連する個別データに該当する検索対象文書をブロック単位のボリュームに圧縮して保存する文書保存部、および検索要請に対応する複数のボリュームを並列にフルスキャン検索する並列検索部を含む、コンピュータ装置を提供する。 A computing device including at least one processor configured to execute computer-readable instructions contained in a memory, the at least one processor configured to retrieve documents corresponding to individual data associated with a user. A computer device is provided that includes a document storage section that compresses and stores a volume in blocks, and a parallel search section that performs a full scan search in parallel on a plurality of volumes corresponding to a search request.

本発明の実施形態によると、個別データ検索サービスに特化したエンジンとして、転置索引資料構造を使用せずに個別データ検索サービスの応答速度を満たすことのできる検索エンジンを提供することができる。 According to the embodiments of the present invention, it is possible to provide a search engine that is specialized for individual data search services and can satisfy the response speed of individual data search services without using a transposed index material structure.

本発明の実施形態によると、検索対象となる文書をブロック単位の圧縮ボリュームで生成して圧縮ボリュームを並列にフルスキャン検索することにより、検索効率の高い、直観的な検索サービスを提供することができる。 According to an embodiment of the present invention, it is possible to provide an intuitive search service with high search efficiency by generating a search target document as a compressed volume in units of blocks and performing a full scan search on the compressed volume in parallel. can.

本発明の一実施形態における、ネットワーク環境の例を示した図である。1 is a diagram illustrating an example of a network environment in an embodiment of the present invention. FIG. 本発明の一実施形態における、コンピュータ装置の例を示したブロック図である。1 is a block diagram illustrating an example of a computer device in an embodiment of the present invention. FIG. 本発明の一実施形態における、コンピュータ装置のプロセッサが含むことのできる構成要素の例を示した図である。1 is a diagram illustrating an example of components that a processor of a computing device may include in an embodiment of the present invention; FIG. 本発明の一実施形態における、コンピュータ装置が実行することのできる方法の例を示したフローチャートである。1 is a flowchart illustrating an example of a method that may be performed by a computing device in an embodiment of the invention. 本発明の一実施形態における、入力／出力時間を減らす方法を説明するための例示図である。FIG. 3 is an exemplary diagram illustrating a method of reducing input/output time in an embodiment of the present invention. 本発明の一実施形態における、索引の代わりをするボリューム生成過程を説明するための例示図である。FIG. 3 is an exemplary diagram for explaining a volume generation process that replaces an index in an embodiment of the present invention. 本発明の一実施形態における、索引の代わりをするボリューム生成過程を説明するための例示図である。FIG. 3 is an exemplary diagram for explaining a volume generation process that replaces an index in an embodiment of the present invention. 本発明の一実施形態における、ＣＰＵ時間を減らす方法を説明するための例示図である。FIG. 3 is an exemplary diagram for explaining a method of reducing CPU time in an embodiment of the present invention. 本発明の一実施形態における、フルスキャン方式を利用した個別データ検索サービス構造を示した図である。FIG. 2 is a diagram showing the structure of an individual data search service using a full scan method according to an embodiment of the present invention. 本発明の一実施形態における、検索サーバの二重化を説明するための例示図である。FIG. 2 is an exemplary diagram for explaining duplication of search servers in an embodiment of the present invention. 本発明の一実施形態における、照合（ｃｏｌｌａｔｉｏｎ）検索が必要とする正規化過程を説明するための例示図である。FIG. 3 is an exemplary diagram illustrating a normalization process required for a collation search in an embodiment of the present invention.

以下、本発明の実施形態について、添付の図面を参照しながら詳しく説明する。 Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

本発明の実施形態は、個別データ検索サービスを提供する技術に関する。 Embodiments of the present invention relate to techniques for providing individual data search services.

本明細書で具体的に開示する事項を含む実施形態は、個別データ検索サービスに特化したエンジンとして転置索引（ｉｎｖｅｒｔｅｄｉｎｄｅｘ）のない検索エンジンを提供することができ、これにより、検索効率性、サービス直観性、費用節減などの側面において相当な長所を達成することができる。 Embodiments including matters specifically disclosed in this specification can provide a search engine without an inverted index as an engine specialized for individual data search services, thereby improving search efficiency, Considerable advantages can be achieved in aspects such as service intuitiveness and cost savings.

本明細書において、個別データとは検索対象となる文書を意味し、特に、メールサービスで生成された個人メール文書、ドライブサービスで生成された個人ファイル、メッセージングサービスで生成された個人トークメッセージなどのようなユーザの個人文書を包括したものを意味してよい。 In this specification, individual data refers to documents to be searched, particularly personal email documents generated by email services, personal files generated by Drive services, personal talk messages generated by messaging services, etc. It may refer to a comprehensive set of user's personal documents such as:

本発明の実施形態に係る個別データ検索装置は、少なくとも１つのコンピュータ装置によって実現されてよく、本発明の実施形態に係る個別データ検索方法は、個別データ検索装置に含まれる少なくとも１つのコンピュータ装置によって実行されてよい。このとき、コンピュータ装置においては、本発明の一実施形態に係るコンピュータプログラムがインストールされて実行されてよく、コンピュータ装置は、実行されたコンピュータプログラムの制御にしたがって本発明の実施形態に係る個別データ検索方法を実行してよい。上述したコンピュータプログラムは、コンピュータ装置と結合して個別データ検索方法をコンピュータに実行させるためにコンピュータ読み取り可能な記録媒体に記録されてよい。 The individual data search device according to the embodiment of the present invention may be realized by at least one computer device, and the individual data search method according to the embodiment of the present invention may be realized by at least one computer device included in the individual data search device. May be executed. At this time, the computer program according to the embodiment of the present invention may be installed and executed in the computer device, and the computer device may perform the individual data search according to the embodiment of the present invention under the control of the executed computer program. You may carry out the method. The above-mentioned computer program may be recorded on a computer-readable recording medium in order to be coupled to a computer device and cause the computer to execute the individual data retrieval method.

図１は、本発明の一実施形態における、ネットワーク環境の例を示した図である。図１のネットワーク環境は、複数の電子機器１１０、１２０、１３０、１４０、複数のサーバ１５０、１６０、およびネットワーク１７０を含む例を示している。このような図１は、発明の説明のための一例に過ぎず、電子機器の数やサーバの数が図１のように限定されることはない。また、図１のネットワーク環境は、本実施形態に適用可能な環境のうちの一例を説明したものに過ぎず、本実施形態に適用可能な環境が図１のネットワーク環境に限定されることはない。 FIG. 1 is a diagram showing an example of a network environment in an embodiment of the present invention. The network environment of FIG. 1 shows an example including multiple electronic devices 110, 120, 130, 140, multiple servers 150, 160, and a network 170. Such FIG. 1 is only an example for explaining the invention, and the number of electronic devices and the number of servers are not limited as shown in FIG. 1. Furthermore, the network environment in FIG. 1 is merely an example of an environment applicable to this embodiment, and the environment applicable to this embodiment is not limited to the network environment in FIG. 1. .

複数の電子機器１１０、１２０、１３０、１４０は、コンピュータ装置によって実現される固定端末や移動端末であってよい。複数の電子機器１１０、１２０、１３０、１４０の例としては、スマートフォン、携帯電話、ナビゲーション、ＰＣ（ｐｅｒｓｏｎａｌｃｏｍｐｕｔｅｒ）、ノート型ＰＣ、デジタル放送用端末、ＰＤＡ（ＰｅｒｓｏｎａｌＤｉｇｉｔａｌＡｓｓｉｓｔａｎｔ）、ＰＭＰ（ＰｏｒｔａｂｌｅＭｕｌｔｉｍｅｄｉａＰｌａｙｅｒ）、タブレットなどがある。一例として、図１では、電子機器１１０の例としてスマートフォンを示しているが、本発明の実施形態において、電子機器１１０は、実質的に無線または有線通信方式を利用し、ネットワーク１７０を介して他の電子機器１２０、１３０、１４０および／またはサーバ１５０、１６０と通信することができる多様な物理的なコンピュータ装置のうちの１つを意味してよい。 The plurality of electronic devices 110, 120, 130, and 140 may be fixed terminals or mobile terminals realized by computer devices. Examples of the plurality of electronic devices 110, 120, 130, and 140 include smartphones, mobile phones, navigation systems, PCs (personal computers), notebook PCs, digital broadcasting terminals, PDAs (personal digital assistants), and PMPs (portable multimedia platforms). ayer ), tablets, etc. As an example, although FIG. 1 shows a smartphone as an example of the electronic device 110, in the embodiment of the present invention, the electronic device 110 may utilize a substantially wireless or wired communication method to communicate with others via the network 170. electronic devices 120, 130, 140 and/or servers 150, 160.

通信方式が限定されることはなく、ネットワーク１７０が含むことのできる通信網（一例として、移動通信網、有線インターネット、無線インターネット、放送網）を利用する通信方式だけではなく、機器間の近距離無線通信が含まれてもよい。例えば、ネットワーク１７０は、ＰＡＮ（ｐｅｒｓｏｎａｌａｒｅａｎｅｔｗｏｒｋ）、ＬＡＮ（ｌｏｃａｌａｒｅａｎｅｔｗｏｒｋ）、ＣＡＮ（ｃａｍｐｕｓａｒｅａｎｅｔｗｏｒｋ）、ＭＡＮ（ｍｅｔｒｏｐｏｌｉｔａｎａｒｅａｎｅｔｗｏｒｋ）、ＷＡＮ（ｗｉｄｅａｒｅａｎｅｔｗｏｒｋ）、ＢＢＮ（ｂｒｏａｄｂａｎｄｎｅｔｗｏｒｋ）、インターネットなどのネットワークのうちの１つ以上の任意のネットワークを含んでよい。さらに、ネットワーク１７０は、バスネットワーク、スターネットワーク、リングネットワーク、メッシュネットワーク、スター－バスネットワーク、ツリーまたは階層的ネットワークなどを含むネットワークトポロジのうちの任意の１つ以上を含んでもよいが、これらに限定されることはない。 The communication method is not limited, and is not limited to communication methods that utilize communication networks that can be included in the network 170 (for example, mobile communication networks, wired Internet, wireless Internet, and broadcasting networks), as well as communication methods that utilize short distances between devices. Wireless communications may also be included. For example, the network 170 is a PAN (personal area network), a LAN (local area network), a CAN (campus area network), a MAN (metropolitan area network), or a WAN (wide area network). e area network), BBN (broadband network), the Internet, etc. may include any one or more of the networks. Additionally, network 170 may include any one or more of network topologies including, but not limited to, a bus network, a star network, a ring network, a mesh network, a star-bus network, a tree or a hierarchical network, and the like. It will not be done.

サーバ１５０、１６０それぞれは、複数の電子機器１１０、１２０、１３０、１４０とネットワーク１７０を介して通信して命令、コード、ファイル、コンテンツ、サービスなどを提供する１つ以上のコンピュータ装置によって実現されてよい。例えば、サーバ１５０は、ネットワーク１７０を介して接続した複数の電子機器１１０、１２０、１３０、１４０にサービス（一例として、金融サービス）を提供するシステムであってよい。 Each server 150, 160 is implemented by one or more computing devices that communicate with a plurality of electronic devices 110, 120, 130, 140 via a network 170 to provide instructions, code, files, content, services, etc. good. For example, the server 150 may be a system that provides services (financial services, for example) to a plurality of electronic devices 110, 120, 130, and 140 connected via the network 170.

図２は、本発明の一実施形態における、コンピュータ装置の例を示したブロック図である。上述した複数の電子機器１１０、１２０、１３０、１４０それぞれやサーバ１５０、１６０それぞれは、図２に示したコンピュータ装置２００によって実現されてよい。 FIG. 2 is a block diagram illustrating an example of a computer device in an embodiment of the present invention. Each of the plurality of electronic devices 110, 120, 130, and 140 and each of the servers 150 and 160 described above may be realized by the computer device 200 shown in FIG. 2.

このようなコンピュータ装置２００は、図２に示すように、メモリ２１０、プロセッサ２２０、通信インタフェース２３０、および入力／出力インタフェース２４０を含んでよい。メモリ２１０は、コンピュータ読み取り可能な記録媒体であって、ＲＡＭ（ｒａｎｄｏｍａｃｃｅｓｓｍｅｍｏｒｙ）、ＲＯＭ（ｒｅａｄｏｎｌｙｍｅｍｏｒｙ）、およびディスクドライブのような永続的大容量記録装置を含んでよい。ここで、ＲＯＭやディスクドライブのような永続的大容量記録装置は、メモリ２１０とは区分される別の永続的記録装置としてコンピュータ装置２００に含まれてもよい。また、メモリ２１０には、オペレーティングシステムと、少なくとも１つのプログラムコードが記録されてよい。このようなソフトウェア構成要素は、メモリ２１０とは別のコンピュータ読み取り可能な記録媒体からメモリ２１０にロードされてよい。このような別のコンピュータ読み取り可能な記録媒体は、フロッピー（登録商標）ドライブ、ディスク、テープ、ＤＶＤ／ＣＤ－ＲＯＭドライブ、メモリカードなどのコンピュータ読み取り可能な記録媒体を含んでよい。他の実施形態において、ソフトウェア構成要素は、コンピュータ読み取り可能な記録媒体ではない通信インタフェース２３０を通じてメモリ２１０にロードされてもよい。例えば、ソフトウェア構成要素は、ネットワーク１７０を介して受信されるファイルによってインストールされるコンピュータプログラムに基づいてコンピュータ装置２００のメモリ２１０にロードされてよい。 Such a computing device 200 may include a memory 210, a processor 220, a communication interface 230, and an input/output interface 240, as shown in FIG. Memory 210 is a computer readable storage medium and may include permanent mass storage devices such as random access memory (RAM), read only memory (ROM), and disk drives. Here, a permanent large capacity storage device such as a ROM or a disk drive may be included in the computer device 200 as a separate permanent storage device separate from the memory 210. Additionally, an operating system and at least one program code may be recorded in the memory 210. Such software components may be loaded into memory 210 from a computer-readable storage medium separate from memory 210. Such other computer-readable recording media may include computer-readable recording media such as floppy drives, disks, tapes, DVD/CD-ROM drives, memory cards, and the like. In other embodiments, software components may be loaded into memory 210 through communication interface 230 that is not a computer-readable storage medium. For example, software components may be loaded into memory 210 of computing device 200 based on a computer program installed by a file received over network 170.

プロセッサ２２０は、基本的な算術、ロジック、および入出力演算を実行することにより、コンピュータプログラムの命令を処理するように構成されてよい。命令は、メモリ２１０または通信インタフェース２３０によって、プロセッサ２２０に提供されてよい。例えば、プロセッサ２２０は、メモリ２１０のような記録装置に記録されたプログラムコードにしたがって受信される命令を実行するように構成されてよい。 Processor 220 may be configured to process instructions of a computer program by performing basic arithmetic, logic, and input/output operations. Instructions may be provided to processor 220 by memory 210 or communication interface 230. For example, processor 220 may be configured to execute instructions received according to program code recorded on a storage device, such as memory 210.

通信インタフェース２３０は、ネットワーク１７０を介してコンピュータ装置２００が他の装置（一例として、上述した記録装置）と互いに通信するための機能を提供してよい。一例として、コンピュータ装置２００のプロセッサ２２０がメモリ２１０のような記録装置に記録されたプログラムコードにしたがって生成した要求や命令、データ、ファイルなどが、通信インタフェース２３０の制御にしたがってネットワーク１７０を介して他の装置に伝達されてよい。これとは逆に、他の装置からの信号や命令、データ、ファイルなどが、ネットワーク１７０を経てコンピュータ装置２００の通信インタフェース２３０を通じてコンピュータ装置２００に受信されてよい。通信インタフェース２３０を通じて受信された信号や命令、データなどは、プロセッサ２２０やメモリ２１０に伝達されてよく、ファイルなどは、コンピュータ装置２００がさらに含むことのできる記録媒体（上述した永続的記録装置）に記録されてよい。 The communication interface 230 may provide functionality for the computing device 200 to communicate with other devices (eg, the recording device described above) via the network 170. As an example, requests, instructions, data, files, etc. generated by the processor 220 of the computer device 200 according to a program code recorded in a storage device such as the memory 210 may be transmitted to others via the network 170 under the control of the communication interface 230. may be transmitted to the device. Conversely, signals, instructions, data, files, etc. from other devices may be received by the computing device 200 via the network 170 and through the communication interface 230 of the computing device 200 . Signals, instructions, data, etc. received through communication interface 230 may be communicated to processor 220 and memory 210, files, etc. may be transferred to a storage medium (such as a persistent storage device as described above) that computing device 200 may further include. May be recorded.

入力／出力インタフェース２４０は、入力／出力装置２５０とのインタフェースのための手段であってよい。例えば、入力装置は、マイク、キーボード、またはマウスなどの装置を、出力装置は、ディスプレイ、スピーカのような装置を含んでよい。他の例として、入力／出力インタフェース２４０は、タッチスクリーンのように入力と出力のための機能が１つに統合された装置とのインタフェースのための手段であってもよい。入力／出力装置２５０は、コンピュータ装置２００と１つの装置で構成されてもよい。 Input/output interface 240 may be a means for interfacing with input/output device 250. For example, input devices may include devices such as a microphone, keyboard, or mouse, and output devices may include devices such as a display and speakers. As another example, input/output interface 240 may be a means for interfacing with a device that has integrated input and output functionality, such as a touch screen. Input/output device 250 may be configured as one device with computer device 200.

また、他の実施形態において、コンピュータ装置２００は、図２の構成要素よりも少ないか多くの構成要素を含んでもよい。しかし、大部分の従来技術の構成要素を明確に図に示す必要はない。例えば、コンピュータ装置２００は、上述した入力／出力装置２５０のうちの少なくとも一部を含むように実現されてもよいし、トランシーバ、データベースなどのような他の構成要素をさらに含んでもよい。 Also, in other embodiments, computing device 200 may include fewer or more components than those of FIG. However, most prior art components need not be clearly illustrated. For example, computing device 200 may be implemented to include at least some of the input/output devices 250 described above, and may further include other components such as transceivers, databases, and the like.

以下では、個別データ検索サービスを提供する方法およびコンピュータ装置の具体的な実施形態について説明する。 In the following, specific embodiments of a method and a computer device for providing an individual data search service will be described.

図３は、本発明の一実施形態における、コンピュータ装置のプロセッサが含むことのできる構成要素の例を示したブロック図であり、図４は、本発明の一実施形態における、コンピュータ装置が実行することのできる方法の例を示したフローチャートである。 FIG. 3 is a block diagram illustrating an example of components that a processor of a computer device may include, in an embodiment of the invention, and FIG. 2 is a flowchart illustrating an example of a method that can be used.

本実施形態に係るコンピュータ装置２００は、クライアントを対象に、クライアント上にインストールされた専用アプリケーションやコンピュータ装置２００と関連するウェブ／モバイルサイトへの接続によって個別データ検索サービスを提供してよい。コンピュータ装置２００には、コンピュータで実現された個別データ検索装置が構成されてよい。 The computer device 200 according to the present embodiment may provide an individual data search service to a client by connecting to a dedicated application installed on the client or a web/mobile site associated with the computer device 200. The computer device 200 may include a computer-implemented individual data search device.

個別データ検索サービスは、文書の流入量は多いが検索要請は相対的に少ない。さらに、個別データ検索サービスは、基本要求事項としてクエリと部分一致する文書を探索する部分一致検索を要求する。個別データ検索サービスの場合、検索時に実際に検索する文書は、全体文書のうちの極一部に過ぎない。 Individual data search services have a large inflow of documents, but relatively few search requests. Furthermore, the individual data search service requires, as a basic requirement, a partial match search to search for documents that partially match the query. In the case of the individual data search service, the documents that are actually searched during a search are only a small portion of the total documents.

個別データ検索サービスは文書の流入量が多いため、検索に一般的に使用する転置索引資料構造を利用する場合には過多な転置索引費用が発生し、検索サーバの資源浪費に繋がるという問題がある。 Individual data search services receive a large amount of documents, so when using the transposed index material structure commonly used for searches, excessive transposed index costs are incurred, leading to wasted resources on the search server. .

さらに、転置索引資料構造で部分一致検索を提供するためには文書のバイグラム（ｂｉｇｒａｍ）分析が実行されなければならないが、バイグラム分析時にはターム（ｔｅｒｍ）の数やボリューム（ｖｏｌｕｍｅ）などが大きくなり、サービスに困難をきたす。 Furthermore, in order to provide partial match search with a transposed index document structure, bigram analysis of documents must be performed, but during bigram analysis, the number of terms and volume become large. Difficulties in service.

本実施形態では、上述したような個別データ検索サービスの特徴を考慮した上で、個別データ検索サービスに特化したエンジンとして、転置索引資料構造は使用せずに個別データ検索サービスの応答速度を満たすことのできる検索エンジンを提供する。 In this embodiment, in consideration of the characteristics of the individual data search service as described above, as an engine specialized for the individual data search service, the engine satisfies the response speed of the individual data search service without using the transposed index material structure. Provide a search engine that can

コンピュータ装置２００のプロセッサ２２０は、図４に示した個別データ検索方法を実行するための構成要素として、図３に示すように、文書保存部３１０および並列検索部３２０を含んでよい。実施形態によって、プロセッサ２２０の構成要素は、選択的にプロセッサ２２０に含まれても除外されてもよい。また、実施形態によって、プロセッサ２２０の構成要素は、プロセッサ２２０の機能の表現のために分離されても併合されてもよい。 The processor 220 of the computer device 200 may include a document storage section 310 and a parallel search section 320, as shown in FIG. 3, as components for executing the individual data search method shown in FIG. Depending on the embodiment, components of processor 220 may be selectively included or excluded from processor 220. Also, depending on the embodiment, components of processor 220 may be separated or combined to express the functionality of processor 220.

このようなプロセッサ２２０およびプロセッサ２２０の構成要素は、図３の個別データ検索方法に含まれる段階Ｓ４１０～Ｓ４２０を実行するようにコンピュータ装置２００を制御してよい。例えば、プロセッサ２２０およびプロセッサ２２０の構成要素は、メモリ２１０が含むオペレーティングシステムのコードと、少なくとも１つのプログラムのコードとによる命令（ｉｎｓｔｒｕｃｔｉｏｎ）を実行するように実現されてよい。 Such processor 220 and components of processor 220 may control computer device 200 to perform steps S410 to S420 included in the individual data retrieval method of FIG. 3. For example, processor 220 and components of processor 220 may be implemented to execute instructions in accordance with operating system code and at least one program code contained in memory 210.

ここで、プロセッサ２２０の構成要素は、コンピュータ装置２００に記録されたプログラムコードが提供する命令にしたがってプロセッサ２２０によって実行される、互いに異なる機能（ｄｉｆｆｅｒｅｎｔｆｕｎｃｔｉｏｎｓ）の表現であってよい。例えば、コンピュータ装置２００が検索対象となる文書を保存するように上述した命令にしたがってコンピュータ装置２００を制御するプロセッサ２２０の機能的表現として、文書保存部３１０が利用されてよい。 Here, the components of processor 220 may be representations of different functions that are performed by processor 220 according to instructions provided by program code recorded on computer device 200. For example, the document storage unit 310 may be used as a functional representation of the processor 220 that controls the computer device 200 according to the instructions described above so that the computer device 200 stores documents to be searched.

プロセッサ２２０は、コンピュータ装置２００の制御と関連する命令がロードされたメモリ２１０から必要な命令を読み取ってよい。この場合、前記読み取られた命令は、プロセッサ２２０が以下で説明する段階Ｓ４１０～Ｓ４２０を実行するように制御するための命令を含んでよい。 Processor 220 may read the necessary instructions from memory 210 loaded with instructions related to controlling computing device 200 . In this case, the read instructions may include instructions for controlling the processor 220 to perform steps S410 to S420 described below.

以下で説明する段階Ｓ４１０～Ｓ４２０は、図４に示したものとは異なる順序で実行されてもよいし、段階Ｓ４１０～Ｓ４２０のうちの一部が省略されたり追加の過程がさらに含まれたりしてもよい。 The steps S410-S420 described below may be performed in a different order than shown in FIG. 4, and some of the steps S410-S420 may be omitted or additional steps may be included. It's okay.

図４を参照すると、段階Ｓ４１０で、文書保存部３１０は、個人メール、個人ファイル、個人トークメッセージなどのような個別データに該当する検索対象文書をブロック単位のボリュームで保存することによって検索ボリュームを生成してよい。本発明の一実施形態によると、検索ボリュームは、ファイル形態で不揮発性メモリ２１０（例えば、ディスクのような補助記憶装置）に記録され、並列検索部３２０で検索がなされるときに、他のメモリ２１０（例えば、ＲＡＭのような揮発性の主記憶装置）からボリュームファイルを読み込んで処理してよい。このとき、保存されるファイルを圧縮すれば、ファイルの読み込み（ｒｅａｄ）にかかる時間を減らすことができ、検索応答時間を減らすことができる。 Referring to FIG. 4, in step S410, the document storage unit 310 increases the search volume by storing search target documents corresponding to individual data such as personal emails, personal files, personal chat messages, etc. in block units. May be generated. According to an embodiment of the present invention, the search volume is recorded in the nonvolatile memory 210 (for example, an auxiliary storage device such as a disk) in the form of a file, and when the parallel search unit 320 searches, the search volume is The volume file may be read and processed from 210 (eg, volatile main storage such as RAM). At this time, if the saved file is compressed, the time required to read the file can be reduced, and the search response time can be reduced.

文書保存部３１０は、検索対象となる流入文書をブロック単位のボリュームで保存するが、このとき、検索応答時間を最小化するために、ボリュームをブロック単位に圧縮して圧縮ボリュームとして保存する。言い換えれば、文書保存部３１０は、検索過程でボリュームの読み込み時間を減らすために、ボリューム生成段階で検索対象文書を圧縮してから保存してよい。このとき、文書保存部３１０は、検索対象文書を事前に定められた一定サイズのブロック単位で集めて圧縮してよい。各ブロックのサイズは、圧縮率が出るように十分に大きくて並列化が可能な水準の経験値や実験値によって決定されてよく、１００ＫＢ～１０ＭＢの値のうち、例えば、１ＭＢのブロック単位に圧縮ボリュームを生成してよい。転置索引構造を使用せずに検索ボリュームを生成することにより、ボリュームの生成過程（すなわち、パッキング（ｐａｃｋｉｎｇ））が軽くなり、ボリュームの生成費用を大幅に減らすことができる。検索対象文書を圧縮する場合、検索にかかる入力／出力時間（Ｉ／Ｏｔｉｍｅ）を減らすことができる上に、ボリュームのサイズとサーバの資源需要を減らすことができる。 The document storage unit 310 stores incoming documents to be searched in volumes in units of blocks. At this time, in order to minimize search response time, the document storage unit 310 compresses the volumes in units of blocks and stores them as compressed volumes. In other words, the document storage unit 310 may compress and store the search target document during the volume generation stage in order to reduce the time required to read the volume during the search process. At this time, the document storage unit 310 may collect and compress the search target documents in blocks of a predetermined size. The size of each block may be determined based on empirical or experimental values that are large enough to achieve a compression ratio and can be parallelized. May generate volume. By generating a search volume without using an inverted index structure, the process of generating the volume (ie, packing) becomes lighter, and the cost of generating the volume can be significantly reduced. When a document to be searched is compressed, input/output time (I/O time) required for searching can be reduced, as well as volume size and server resource demand.

段階Ｓ４２０で、並列検索部３２０は、検索要請が受信される場合、検索要請に対応する圧縮ボリュームを読み込み、読み込んだ圧縮ボリュームに対するフルスキャン検索を並列実行してよい。このとき、並列検索部３２０は、クエリと部分一致する文書を探索する文字列ファインド（ｆｉｎｄ）方式によってフルスキャン検索を行ってよい。並列検索部３２０は、検索要請が受信されれば、検索要請に対応する圧縮ボリュームを読み込んだ後、文字列ファインドを実行してよい。文字列ファインドを実行するためにはボリューム内の全体文書を読み込む必要があるが、圧縮ボリュームを読み込む過程は並列化が不可能である反面、圧縮ボリュームのデコードと文字列ファインド過程は並列化が可能である。言い換えれば、並列検索部３２０は、検索要請に対応するすべての圧縮ボリュームを並列にデコードした後に文字列ファインドを実行してよい。個別データ検索サービスでの検索対象は全体文書のうちの極一部であるため、転置索引資料構造の代わりにフルスキャン検索を行うことにより、十分な速さの応答速度を保障することができる。フルスキャン検索方式は、検索ボリュームに原本文書をそのまま保存した後にスキャンすることができるため、ボリュームの生成費用を大幅に節減することができ、増分（ｉｎｃｒｅｍｅｎｔ）実現が簡単であり、新規文書の反映も迅速であるという利点がある。特に、本実施形態では、フルスキャン検索時に、ブロック単位の並列化によってボリュームのデコード時間と文字列ファインドの実行時間を含むＣＰＵ時間を減らすことができる。 In step S420, when a search request is received, the parallel search unit 320 may read a compressed volume corresponding to the search request and perform a full scan search on the read compressed volume in parallel. At this time, the parallel search unit 320 may perform a full scan search using a character string find method to search for documents that partially match the query. When the parallel search unit 320 receives a search request, the parallel search unit 320 may perform character string finding after reading a compressed volume corresponding to the search request. In order to perform string finding, it is necessary to read the entire document in the volume, but while the process of reading a compressed volume cannot be parallelized, the decoding of the compressed volume and the string finding process can be parallelized. It is. In other words, the parallel search unit 320 may perform string finding after decoding all compressed volumes corresponding to the search request in parallel. Since the search target in the individual data search service is only a small part of the entire document, a sufficiently fast response speed can be guaranteed by performing a full scan search instead of the transposed index material structure. The full scan search method can store the original document in the search volume as it is and then scan it, so it can greatly reduce the volume generation cost, and it is easy to implement increments, and it is easy to update new documents. It also has the advantage of being quick. In particular, in this embodiment, during full-scan search, CPU time, including volume decoding time and character string finding execution time, can be reduced by block-by-block parallelization.

本実施形態は、転置索引構造は使用せず、フルスキャン方式を利用した個別データ検索サービスを提供する。 This embodiment provides an individual data search service using a full scan method without using a transposed index structure.

フルスキャン検索は、検索要請に対応するボリュームを読み込んだ後、該当のボリューム内の全体文書を読み込んでクエリと部分一致する文書を探索する。このとき、検索応答時間は、ボリュームを読み込む入力／出力時間とフルスキャン検索を行うＣＰＵ時間を含む。 In full scan search, a volume corresponding to a search request is read, and then all documents in the corresponding volume are read to search for documents that partially match the query. At this time, the search response time includes the input/output time for reading the volume and the CPU time for performing the full scan search.

フルスキャン方式を利用した個別データ検索サービスの場合、検索応答時間を最小化するために、入力／出力時間を減らす方法とＣＰＵ時間を減らす方法が適用される。 In the case of an individual data search service using the full scan method, a method of reducing input/output time and a method of reducing CPU time are applied in order to minimize the search response time.

図５は、本発明の一実施形態における、入力／出力時間を減らす方法を説明するための例示図である。 FIG. 5 is an exemplary diagram illustrating a method for reducing input/output time in an embodiment of the present invention.

図５の入力／出力時間を減らす方法は、図４で説明した文書保存段階Ｓ４１０に該当する。 The method of reducing input/output time in FIG. 5 corresponds to the document storage step S410 described in FIG. 4.

プロセッサ２２０は、検索対象文書を一定サイズのブロック単位に圧縮してよい。図５を参照すると、検索対象文書からなる増分ボリューム（ｉｎｃｒｅｍｅｎｔｖｏｌｕｍｅ）５０を一定サイズのブロック単位に圧縮してよい。言い換えれば、プロセッサ２２０は、検索対象文書を分けて圧縮することによってブロック単位の圧縮ボリューム６０を生成してよく、これにより、検索応答時間のうちの入力／出力時間を減らすことができる。例えば、図５に示すように４００％の圧縮率を適用した場合、入力／出力時間と保存空間を１／４に減らすことができる。 The processor 220 may compress the search target document into blocks of a certain size. Referring to FIG. 5, an incremental volume 50 of search target documents may be compressed into blocks of a constant size. In other words, the processor 220 may generate the compressed volume 60 in units of blocks by separately compressing the search target document, thereby reducing the input/output time of the search response time. For example, if a compression rate of 400% is applied as shown in FIG. 5, input/output time and storage space can be reduced to 1/4.

個別データ検索サービスのための資料構造には、検索対象文書を圧縮前に集めておいたボリュームを示す増分ボリューム５０と、一定サイズのブロック単位に圧縮されたボリュームを示す圧縮ボリューム６０が存在する。増分ボリュームは、少量の文書が継続して流入する環境において、ボリュームファイルの読み取りとデコード実行を最小化するための目的に利用されてよい。以下で説明するように、一定のサイズになれば圧縮を行って圧縮ボリュームを生成して、比較的小さなサイズで維持することが好ましい。 The material structure for the individual data search service includes an incremental volume 50 indicating a volume in which documents to be searched are collected before being compressed, and a compressed volume 60 indicating a volume compressed into blocks of a fixed size. Incremental volumes may be utilized for the purpose of minimizing reading and decoding performance of volume files in environments with a continuous influx of small volumes of documents. As explained below, once a certain size is reached, compression is preferably performed to create a compressed volume to maintain a relatively small size.

このとき、プロセッサ２２０は、新規文書が流入する場合、新規文書をサービス可能な資料構造に変換した後、増分ボリューム５０に付け足して（ａｐｐｅｎｄ）反映してよい。図６に示すように、プロセッサ２２０は、検索対象文書からなる増分ボリューム５０が一定のサイズ、例えば１ＭＢに達する時点に、１ＭＢのブロック単位に圧縮して圧縮ボリューム６０に付け足してよい。文書の平均サイズが４ＫＢであってＳＳＤの読み込み速度が５００ＭＢ／ｓであるときには、個別ボリュームの最大５０万個の文書まで１秒内で検索することが可能である。 At this time, when a new document comes in, the processor 220 may convert the new document into a serviceable material structure and then append it to the incremental volume 50 to reflect the new document. As shown in FIG. 6, when the incremental volume 50 consisting of search target documents reaches a certain size, for example 1 MB, the processor 220 may compress it into 1 MB blocks and add them to the compressed volume 60. When the average document size is 4 KB and the SSD read speed is 500 MB/s, it is possible to search up to 500,000 documents in an individual volume within 1 second.

プロセッサ２２０は、ボリューム生成過程を実行すると同時にフルスキャン検索を実行することで、ボリューム生成過程中にサーバがシャットダウン（ｓｈｕｔｄｏｗｎ）したとしてもボリュームを維持することができる。追加された文書だけをボリュームに付け足すことで、ボリュームを再生成したり交換したりせずに増分することができる。 The processor 220 can maintain the volume even if the server is shut down during the volume creation process by performing a full scan search at the same time as the volume creation process. By appending only added documents to a volume, the volume can be incremented without having to be regenerated or replaced.

図７を参照すると、プロセッサ２２０は、新規文書の場合、新規文書を増分ボリューム５０に付け足した後、増分ボリューム５０が一定のサイズに達したときに増分ボリューム５０を圧縮して、圧縮ボリューム６０に反映してよい。 Referring to FIG. 7, in the case of a new document, the processor 220 appends the new document to the incremental volume 50, and then compresses the incremental volume 50 to a compressed volume 60 when the incremental volume 50 reaches a certain size. It may be reflected.

一方、プロセッサ２２０は、圧縮ボリューム６０が生成された後に既存の文書が削除される場合、該当の文書に対する削除情報をマーキングしてよく、検索結果の生成時にマーキングされた文書を検索結果から除外する方式で処理してよい。このとき、削除情報をマーキングするための追加の資料構造が使用されてよい。実施形態によっては、周期的または必要時期に削除文書などを含んだガービッジ（不要データ（ｇａｒｂａｇｅ））を整理して圧縮ボリューム６０を再生成することも可能である。 On the other hand, if an existing document is deleted after the compressed volume 60 is generated, the processor 220 may mark the corresponding document with deletion information, and exclude the marked document from the search results when generating the search results. You can process it by method. At this time, an additional document structure for marking deletion information may be used. Depending on the embodiment, it is also possible to organize garbage (unnecessary data (garbage)) including deleted documents and the like periodically or at necessary times to regenerate the compressed volume 60.

図８は、本発明の一実施形態における、ＣＰＵ時間を減らす方法を説明するための例示図である。 FIG. 8 is an exemplary diagram for explaining a method of reducing CPU time in an embodiment of the present invention.

図８のＣＰＵ時間を減らす方法は、図４で説明した並列検索段階Ｓ４２０に該当する。 The method of reducing CPU time in FIG. 8 corresponds to the parallel search step S420 described in FIG. 4.

図８を参照すると、プロセッサ２２０は、ブロック単位に圧縮されたボリューム、すなわち圧縮ボリューム６０をデコードする過程と、デコードによって圧縮が解除された検索対象文書５０を対象にフルスキャンによる文字列ファインドを実行する過程をそれぞれ並列化してよい。プロセッサ２２０は、ブロック単位で並列デコードと並列文字列ファインドを実行することにより、検索応答時間のうちのＣＰＵ時間を減らすことができる。プロセッサ２２０は、ブロック単位の並列デコードと並列文字列ファインドにより、検索要請に対応する検索結果８０を応答結果として提供してよい。 Referring to FIG. 8, the processor 220 performs a process of decoding a volume compressed in units of blocks, that is, a compressed volume 60, and a character string finding process using a full scan for the search target document 50 that has been decompressed by the decoding. Each process may be parallelized. The processor 220 can reduce the CPU time in the search response time by executing parallel decoding and parallel character string finding in block units. The processor 220 may provide a search result 80 corresponding to the search request as a response result by performing block-by-block parallel decoding and parallel string finding.

図９は、本発明の一実施形態における、フルスキャン方式を利用した個別データ検索サービス構造を示した図である。 FIG. 9 is a diagram illustrating the structure of an individual data search service using a full scan method according to an embodiment of the present invention.

図９を参照すると、本発明に係る個別データ検索装置は、個別データ検索サービスのために、ＳＡＳ（ｓｅａｒｃｈａｐｐｌｉｃａｔｉｏｎｓｅｒｖｅｒ）９１０と検索サーバ（ｓｅａｒｃｈｓｅｒｖｅｒ）９２０を含んでよい。ＳＡＳ９１０は検索要請に対応する検索結果を提供する役割を担い、検索サーバ９２０は検索のための圧縮ボリューム６０を保存する役割を担う。検索サーバ９２０は、転置索引の代わりにボリューム生成過程（パッキング（ｐａｃｋｉｎｇ））を実行してよく、１つの検索サーバ９２０で数万個以上のボリュームを保存してよい。 Referring to FIG. 9, the individual data search device according to the present invention may include a search application server (SAS) 910 and a search server 920 for the individual data search service. The SAS 910 is responsible for providing search results corresponding to a search request, and the search server 920 is responsible for storing a compressed volume 60 for searching. The search server 920 may perform a volume generation process (packing) instead of an inverted index, and one search server 920 may store tens of thousands of volumes or more.

本発明に係る個別データ検索装置は、転置索引を使用しないため索引サーバを要求せず、ブロック単位の圧縮ボリューム６０を適用するため検索サーバ９２０の需要も減らすことができる。 Since the individual data search device according to the present invention does not use an inverted index, it does not require an index server, and since it applies the compressed volume 60 in blocks, it is possible to reduce the demand for the search server 920.

本発明に係る個別データ検索装置は、転置索引の代わりにフルスキャン方式を利用することにより、個別データ検索サービスで要求する部分一致検索を自然に提供することができる。 By using a full scan method instead of a transposed index, the individual data search device according to the present invention can naturally provide a partial match search required by an individual data search service.

また、本発明に係る個別データ検索装置は、サーバの二重化を支援してよい。図１０を参照すると、各ユーザに検索サーバ９２０で２つのホストをランダムに指定し、指定されたホストに該当のユーザの個別データを保存してよい。個別データ検索装置は、各ユーザにマッピングされたサーバ位置を保存するデータベース１０３０を含んでよい。言い換えれば、複数台の検索サーバ９２０に個別データに対する複製ボリューム（ｒｅｐｌｉｃａｖｏｌｕｍｅ）を保存してよく、これは、検索要請の分散の用途ではなく待機複製（ｓｔａｎｄｂｙｒｅｐｌｉｃａｔｉｏｎ）の用途として適用することができる。 Furthermore, the individual data search device according to the present invention may support server duplication. Referring to FIG. 10, two hosts may be randomly designated for each user in the search server 920, and the individual data of the corresponding user may be stored in the designated hosts. The individual data retrieval device may include a database 1030 that stores server locations mapped to each user. In other words, replica volumes for individual data may be stored in a plurality of search servers 920, and this can be used for standby replication rather than for distributing search requests. .

さらに、本発明に係る個別データ検索装置は、同じ意味の文字を互いに区分せずに検索する照合（ｃｏｌｌａｔｉｏｎ）機能を支援してよい。照合機能には、大文字と小文字を区分せずに検索する機能（ｃａｓｅｉｎｓｅｎｓｉｔｉｖｅ）、符号のないアルファベットと符号のあるアルファベット（例えば、
（外１）
など）を区分せずに検索する機能、片仮名と平仮名を区分せずに検索する機能（ＫａｎａＴｙｐｅｉｎｓｅｎｓｉｔｉｖｅ）などが含まれてよい。 Furthermore, the individual data search device according to the present invention may support a collation function that searches for characters with the same meaning without distinguishing them from each other. The matching function includes a function to search without distinguishing between uppercase and lowercase letters (case insensitive), an alphabet without a sign and an alphabet with a sign (for example,
(Outside 1)
), a function to search without distinguishing between katakana and hiragana (Kana Type insensitive), etc. may be included.

本発明に係る個別データ検索装置は、検索のためのボリュームと文書要約（例えば、Ｓｎｉｐｐｅｔなど）のためのボリュームが１つで構成されているため、これを正規化する場合、検索は可能であるが文書要約は不可能であるという限界が生じ得る。 Since the individual data search device according to the present invention consists of one volume for searching and one volume for document summarization (for example, Snippet), it is possible to search if this is normalized. However, there may be a limit that document summarization is not possible.

本発明に係る個別データ検索装置は、照合を提供するために、検索時にユニコード正規化を実行する方式と、ボリューム生成時にユニコード正規化を実行する方式のうちの１つを適用してよい。 In order to provide matching, the individual data search device according to the present invention may apply one of a method of performing Unicode normalization at the time of search and a method of performing Unicode normalization at the time of volume generation.

一例として、プロセッサ２２０は、検索時に検索要請に含まれたクエリとボリューム内の文書をすべてユニコード正規化し、正規化された文字列を一定の形態に変換した（例えば、ｄｏｗｎｃａｓｅ）後に検索を行うことによって照合による検索結果を提供してよい。このとき、プロセッサ２２０は、検索実行時間を最小化するために正規化過程も並列実行してよい。 For example, the processor 220 may perform the search after normalizing the query included in the search request and all documents in the volume during the search, and converting the normalized string into a certain format (e.g., downcase). may provide search results by matching. At this time, the processor 220 may also perform the normalization process in parallel to minimize search execution time.

他の例として、プロセッサ２２０は、ボリューム生成時に前処理方式で文書をユニコード正規化した後に保存してよい。図１１を参照すると、プロセッサ２２０は、ボリューム生成過程において、ボリューム内の文書を正規化すると同時に、正規化された文書を原状復帰するための変換テーブル１１４０をともに生成してよい。変換テーブル１１４０は、文字列内の変換文字位置を示すオフセットと、該当の位置の原本文字で構成されてよい。この後、プロセッサ２２０は、検索時にクエリを正規化し、該当のクエリに対して既に正規化された文書で検索してよい。検索過程ではクエリだけを正規化し、検索対象となる文書は既に正規化されているため、検索負荷を減らすことができる。検索結果による文書要約時には原本文書が必要となるため、変換テーブル１１４０を利用して該当の文書を原状復帰してよい。文書要約は、検索結果によっては一部の文書だけに対して行うため、原状復帰する文書が少なく、検索応答時間に大きな影響を与えない。 As another example, the processor 220 may perform Unicode normalization on the document in a pre-processing manner during volume generation and then save the document. Referring to FIG. 11, during the volume generation process, the processor 220 may normalize the document in the volume and at the same time generate a conversion table 1140 for restoring the normalized document to its original state. The conversion table 1140 may include an offset indicating a converted character position within a character string and an original character at the corresponding position. Thereafter, processor 220 may normalize the query during a search and search for documents that have already been normalized for the query. In the search process, only the query is normalized, and the search target documents are already normalized, so the search load can be reduced. Since the original document is required when summarizing the document based on the search results, the conversion table 1140 may be used to restore the document to its original state. Since document summarization is performed only on some documents depending on the search results, few documents are restored to their original state, and the search response time is not greatly affected.

このように、本発明の実施形態によると、個別データ検索サービスに特化したエンジンとして、転置索引資料構造を使用せずに個別データ検索サービスの応答速度を満たすことのできる検索エンジンを提供することができる。また、本発明の実施形態によると、検索対象となる文書をブロック単位の圧縮ボリュームで生成し、圧縮ボリュームを並列にフルスキャン検索することにより、検索効率が高く、直観的な検索サービスを提供することができる。 As described above, the embodiments of the present invention provide a search engine that is specialized for individual data search services and can satisfy the response speed of individual data search services without using a transposed index material structure. I can do it. Further, according to an embodiment of the present invention, a document to be searched is generated as a compressed volume in units of blocks, and a full scan search is performed on the compressed volume in parallel, thereby providing a highly efficient and intuitive search service. be able to.

上述した装置は、ハードウェア構成要素、ソフトウェア構成要素、および／またはハードウェア構成要素とソフトウェア構成要素との組み合わせによって実現されてよい。例えば、実施形態で説明された装置および構成要素は、プロセッサ、コントローラ、ＡＬＵ（ａｒｉｔｈｍｅｔｉｃｌｏｇｉｃｕｎｉｔ）、デジタル信号プロセッサ、マイクロコンピュータ、ＦＰＧＡ（ｆｉｅｌｄｐｒｏｇｒａｍｍａｂｌｅｇａｔｅａｒｒａｙ）、ＰＬＵ（ｐｒｏｇｒａｍｍａｂｌｅｌｏｇｉｃｕｎｉｔ）、マイクロプロセッサ、または命令を実行して応答することができる様々な装置のように、１つ以上の汎用コンピュータまたは特殊目的コンピュータを利用して実現されてよい。処理装置は、オペレーティングシステム（ＯＳ）およびＯＳ上で実行される１つ以上のソフトウェアアプリケーションを実行してよい。また、処理装置は、ソフトウェアの実行に応答し、データにアクセスし、データを記録、操作、処理、および生成してもよい。理解の便宜のために、１つの処理装置が使用されるとして説明される場合もあるが、当業者であれば、処理装置が複数個の処理要素および／または複数種類の処理要素を含んでもよいことが理解できるであろう。例えば、処理装置は、複数個のプロセッサまたは１つのプロセッサおよび１つのコントローラを含んでよい。また、並列プロセッサのような、他の処理構成も可能である。 The apparatus described above may be realized by hardware components, software components, and/or a combination of hardware and software components. For example, the devices and components described in the embodiments include a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA), a programmable logic unit (PLU), a microprocessor, or may be implemented using one or more general purpose or special purpose computers, such as various devices capable of executing and responding to instructions. A processing device may execute an operating system (OS) and one or more software applications that execute on the OS. The processing device may also be responsive to execution of the software to access, record, manipulate, process, and generate data. For convenience of understanding, one processing device may be described as being used, but those skilled in the art will understand that a processing device may include multiple processing elements and/or multiple types of processing elements. You will understand that. For example, a processing device may include multiple processors or a processor and a controller. Other processing configurations are also possible, such as parallel processors.

ソフトウェアは、コンピュータプログラム、コード、命令、またはこれらのうちの１つ以上の組み合わせを含んでもよく、思うままに動作するように処理装置を構成したり、独立的または集合的に処理装置に命令したりしてよい。ソフトウェアおよび／またはデータは、処理装置に基づいて解釈されたり、処理装置に命令またはデータを提供したりするために、いかなる種類の機械、コンポーネント、物理装置、コンピュータ記録媒体または装置に具現化されてよい。ソフトウェアは、ネットワークによって接続されたコンピュータシステム上に分散され、分散された状態で記録されても実行されてもよい。ソフトウェアおよびデータは、１つ以上のコンピュータ読み取り可能な記録媒体に記録されてよい。 Software may include computer programs, code, instructions, or a combination of one or more of these that configure a processing device or instruct a processing device, independently or collectively, to perform operations as desired. You may do so. The software and/or data may be embodied in a machine, component, physical device, computer storage medium or device of any kind for being interpreted by or providing instructions or data to a processing device. good. The software may be distributed on computer systems connected by a network, and may be recorded or executed in a distributed manner. The software and data may be recorded on one or more computer readable storage media.

実施形態に係る方法は、多様なコンピュータ手段によって実行可能なプログラム命令の形態で実現されてコンピュータ読み取り可能な媒体に記録されてよい。ここで、媒体は、コンピュータ実行可能なプログラムを継続して記録するものであっても、実行またはダウンロードのために一時記録するものであってもよい。また、媒体は、単一または複数のハードウェアが結合した形態の多様な記録手段または格納手段であってよく、あるコンピュータシステムに直接接続する媒体に限定されることはなく、ネットワーク上に分散して存在するものであってもよい。媒体の例としては、ハードディスク、フロッピー（登録商標）ディスク、および磁気テープのような磁気媒体、ＣＤ－ＲＯＭおよびＤＶＤのような光媒体、フロプティカルディスク（ｆｌｏｐｔｉｃａｌｄｉｓｋ）のような光磁気媒体、およびＲＯＭ、ＲＡＭ、フラッシュメモリなどを含み、プログラム命令が記録されるように構成されたものであってよい。また、媒体の他の例として、アプリケーションを配布するアプリケーションストアやその他の多様なソフトウェアを供給または配布するサイト、サーバなどで管理する記録媒体または格納媒体が挙げられる。 Methods according to embodiments may be implemented in the form of program instructions executable by various computer means and recorded on computer-readable media. Here, the medium may be one that continuously records a computer-executable program, or one that temporarily records it for execution or download. Also, the medium may be a variety of recording or storage means in the form of a single or multiple hardware combinations, and is not limited to a medium directly connected to a computer system, but may be distributed over a network. It may also exist. Examples of media include magnetic media such as hard disks, floppy disks, and magnetic tape, optical media such as CD-ROMs and DVDs, magneto-optical media such as floptical disks, It may also include ROM, RAM, flash memory, etc., and may be configured to record program instructions. Further, other examples of the medium include an application store that distributes applications, a site that supplies or distributes various other software, and a recording medium or storage medium managed by a server.

以上のように、実施形態を、限定された実施形態および図面に基づいて説明したが、当業者であれば、上述した記載から多様な修正および変形が可能であろう。例えば、説明された技術が、説明された方法とは異なる順序で実行されたり、かつ／あるいは、説明されたシステム、構造、装置、回路などの構成要素が、説明された方法とは異なる形態で結合されたりまたは組み合わされたり、他の構成要素または均等物によって対置されたり置換されたとしても、適切な結果を達成することができる。 As mentioned above, although the embodiments have been described based on limited embodiments and drawings, those skilled in the art will be able to make various modifications and variations based on the above description. For example, the techniques described may be performed in a different order than in the manner described, and/or components of the systems, structures, devices, circuits, etc. described may be performed in a different form than in the manner described. Even when combined or combined, opposed or replaced by other components or equivalents, suitable results can be achieved.

したがって、異なる実施形態であっても、特許請求の範囲と均等なものであれば、添付される特許請求の範囲に属する。 Therefore, even if the embodiments are different, if they are equivalent to the scope of the claims, they fall within the scope of the appended claims.

２２０：プロセッサ
３１０：文書保存部
３２０：並列検索部 220: Processor 310: Document storage section 320: Parallel search section

Claims

An individual data retrieval method performed on a computer device, the method comprising:
The computer device includes at least one processor configured to execute computer-readable instructions contained in memory;
The individual data search method is
compressing and storing a search target document corresponding to individual data related to a user into a volume in units of blocks by the at least one processor; and compressing a plurality of volumes corresponding to the search request in parallel by the at least one processor. including a full scan search step,
The storing step includes, when a new document comes in, adding the new document to the incremental volume consisting of the search target document, and compressing the incremental volume into blocks of a constant size to generate a compressed volume. Including individual data search methods.

The storing step includes:
If the existing document is deleted after the compressed volume is generated, the method further comprises: marking deletion information for the existing document;
The individual data search method according to claim 1 , wherein the marked document is excluded from search results.

The full scan search step includes:
2. The individual data retrieval method according to claim 1, wherein a document that partially matches the query is retrieved by a full scan method for the compressed volume in units of blocks without using a transposed index material structure.

The full scan search step includes:
The individual data retrieval method according to claim 1, comprising: decoding the plurality of volumes in parallel; and performing character string finding on the decoded volumes in parallel.

The storing step includes:
2. The individual data search method according to claim 1, further comprising the step of: storing replicated volumes for the individual data in a plurality of hosts for server duplication.

The full scan search step includes:
2. The method according to claim 1, comprising: normalizing a query included in the search request and documents in the plurality of volumes in Unicode; and performing a matching search using the normalized character string. Individual data search method.

The storing step includes:
normalizing the search target document in Unicode;
The full scan search step includes:
The individual data search method according to claim 1, comprising: normalizing a query included in the search request in Unicode; and performing a collation search using the normalized character string.

An individual data retrieval method performed on a computer device, the method comprising:
The computer device includes at least one processor configured to execute computer-readable instructions contained in memory;
The individual data search method is
compressing and storing a search target document corresponding to individual data associated with a user into a volume in blocks by the at least one processor;
performing a full-scan search in parallel on a plurality of volumes corresponding to the search request by the at least one processor;
including;
The storing step includes:
normalizing the search target document in Unicode; and
A step of generating a conversion table including an offset indicating the position of the converted character and the original character at the corresponding position.
including;
The full scan search step includes:
Unicode normalizing the query included in the search request; and
Stage of performing a matching search using normalized strings
Individual data search methods, including .

A computer program for causing a computer device to execute the individual data search method according to any one of claims 1 to 8 .

A computer device,
at least one processor configured to execute computer-readable instructions contained in the memory;
The at least one processor includes:
It includes a document storage unit that compresses and stores search target documents corresponding to individual data related to the user into volumes in blocks, and a parallel search unit that performs full scan searches in parallel on multiple volumes corresponding to a search request,
When a new document flows in, the document storage unit adds the new document to the incremental volume consisting of the search target document, and compresses the incremental volume into blocks of a constant size to generate a compressed volume.

The document storage unit is
11. The computer device according to claim 10 , wherein a compressed volume is generated by collecting and compressing the search target documents in blocks of a fixed size.

The document storage unit is
if the existing document is deleted after the compressed volume is generated, marking deletion information for the existing document;
The computer device according to claim 10 , wherein the marked document is excluded from search results.

The parallel search unit includes:
13. The method according to claim 10 , wherein a document that partially matches the query is retrieved by a full scan method for the block-based compressed volume without using a transposed index data structure. computer equipment.

The parallel search unit includes:
decoding the plurality of volumes in parallel;
The computer device according to any one of claims 10 to 12 , characterized in that character string finding is executed in parallel for the decoded volumes.

The parallel search unit includes:
Unicode normalizing a query included in the search request and documents in the plurality of volumes;
The computer device according to any one of claims 10 to 12 , characterized in that a verification search is performed using a normalized character string.

The document storage unit is
Unicode normalize the search target document,
The parallel search unit includes:
Unicode normalize the query included in the search request;
The computer device according to any one of claims 10 to 12 , characterized in that a verification search is performed using a normalized character string.

A computer device,
at least one processor configured to execute computer-readable instructions contained in the memory;
including;
The at least one processor includes:
a document storage unit that compresses and stores search target documents corresponding to individual data related to the user into a volume in blocks, and
Parallel search unit that performs full scan searches of multiple volumes in parallel in response to search requests
including;
The document storage unit is
Unicode normalize the search target document,
Generates a conversion table containing the offset indicating the converted character position and the original character at the corresponding position,
The parallel search unit includes:
Unicode normalize the query included in the search request;
Performing a matching search using normalized strings
A computer device characterized by: