JP7367139B2

JP7367139B2 - Data search method and system

Info

Publication number: JP7367139B2
Application number: JP2022121133A
Authority: JP
Inventors: ギウクキム; ウォンソクファン; ミンジュンソ
Original assignee: Naver Corp
Current assignee: Naver Corp
Priority date: 2021-08-02
Filing date: 2022-07-29
Publication date: 2023-10-23
Anticipated expiration: 2042-07-29
Also published as: KR20230019745A; JP2023021946A

Description

本発明は、異なるカテゴリーのフィールド値を含むデータを検索するデータ検索方法及びシステムに関する。 The present invention relates to a data search method and system for searching data including field values of different categories.

人工知能の辞書的意味は、人間の学習能力、推論能力、知覚能力、自然言語理解能力などをコンピュータプログラムで実現した技術である。このような人工知能は、マシンラーニングに人間の脳を模倣したニューラルネットワークを加えたディープラーニングにより飛躍的に発展してきた。 The dictionary meaning of artificial intelligence is a technology that realizes human learning, reasoning, perceptual, and natural language understanding abilities through computer programs. This kind of artificial intelligence has developed dramatically through deep learning, which combines machine learning with neural networks that imitate the human brain.

ディープラーニング（ｄｅｅｐｌｅａｒｎｉｎｇ）とは、コンピュータが人間のように判断及び学習できるようにし、それにより事物やデータを群集化又は分類する技術をいい、近年、テキストデータだけでなく画像データまで分析できるようになり、非常に多様な産業分野に積極的に活用されている。 Deep learning is a technology that enables computers to judge and learn like humans, thereby grouping or classifying objects and data.In recent years, it has become possible to analyze not only text data but also image data. It is now actively used in a wide variety of industrial fields.

このような人工知能の発達により、オフィス・オートメーション（ｏｆｆｉｃｅａｕｔｏｍａｔｉｏｎ）分野においても様々な自動化が行われている。特に、オフィス・オートメーション分野においては、人工知能を活用した画像データ分析技術に基づいて、紙（ペーパ）に印刷されたコンテンツをデータ化するのに多くの努力をしている。その一環として、オフィス・オートメーション分野においては、紙文書をイメージ化し、イメージに含まれるコンテンツを分析するイメージ分析技術（又は画像データ分析技術）により、文書に含まれるコンテンツをデータ化しており、その場合、文書に含まれるコンテンツのタイプによってイメージを分析する技術が必要である。 With the development of artificial intelligence, various types of automation are being carried out in the field of office automation. In particular, in the field of office automation, much effort is being made to convert content printed on paper into data based on image data analysis technology that utilizes artificial intelligence. As part of this, in the office automation field, the content contained in documents is converted into data using image analysis technology (or image data analysis technology) that converts paper documents into images and analyzes the content contained in the images. , techniques are needed to analyze images according to the type of content contained in the document.

例えば、領収証（レシート）を含む文書をデータ化する場合、領収証の形式、領収証に含まれるテキストの内容、及び領収証に含まれるテキストの位置などのように、領収証に関連する様々な要素についての正確な分析が必要である。 For example, when converting a document containing a receipt into data, the accuracy of various elements related to the receipt, such as the format of the receipt, the content of the text included in the receipt, and the position of the text included in the receipt, is required. A thorough analysis is required.

よって、イメージに含まれる情報を電子機器で処理できる形態のデータに加工するための様々な技術が開発されている。例えば、特許文献１においては、ＯＣＲ（ＯｐｔｉｃａｌＣｈａｒａｃｔｅｒＲｅａｄｅｒ）データベースを構築する方法が開示されているが、これまで開発された方法は、人が経験的に定めた規則に従ってデータを分類するレベルのものであるので、ＯＣＲデータにエラーがある場合、不正確なデータベースが構築されるだけでなく、データベースを用いた検索が円滑に行われないことがある。 Therefore, various techniques have been developed for processing information contained in images into data that can be processed by electronic devices. For example, Patent Document 1 discloses a method for constructing an OCR (Optical Character Reader) database, but the methods developed so far are at the level of classifying data according to rules determined empirically by humans. Therefore, if there is an error in the OCR data, not only will an inaccurate database be constructed, but searches using the database may not be performed smoothly.

一方、近年、各種検索サービスが提供されている。例えば、領収証を用いて当該領収証を使用した場所を検索するサービスが提供されている。よって、イメージ、音声、テキストなどの様々な形式のコンテンツに対してそれに対応するデータを検索する技術の必要性が高まっている。 On the other hand, various search services have been provided in recent years. For example, a service is provided that uses a receipt to search for a place where the receipt was used. Therefore, there is an increasing need for technology that searches for data corresponding to content in various formats such as images, audio, and text.

従来は、ＯＣＲ認識エラーを補正するために、正規表現式などを活用した前処理技術に依存していた。このような方法は、時間とコストが多くかかり、補正性能が高くないという問題があった。 Conventionally, in order to correct OCR recognition errors, preprocessing techniques using regular expressions and the like have been relied upon. Such a method requires a lot of time and cost, and has problems in that the correction performance is not high.

また、複数のフィールド値を含むコンテンツの場合、所望の結果を得るために複数のフィールド情報を検索に用いなければならないが、一般的にはヒューリスティックなルールベースのモデルに依存していた。特に、どのフィールドを選択するかによって検索性能が大きく異なるか所望の検索結果が得られないという問題があった。 Additionally, in the case of content that includes multiple field values, multiple field information must be used in the search to obtain the desired result, which typically relies on a heuristic, rule-based model. In particular, there has been a problem in that search performance varies greatly depending on which field is selected, or desired search results cannot be obtained.

よって、ＯＣＲ認識エラーなどによりテキストが誤って認識された場合にも所望の検索結果が得られるようにする技術が求められている。 Therefore, there is a need for a technique that allows desired search results to be obtained even when text is erroneously recognized due to an OCR recognition error or the like.

韓国登録特許第１０－１１８１２０９号公報Korean Registered Patent No. 10-1181209

本発明は、異なるカテゴリーに属するフィールド値を含むデータを電子機器で活用できる形態のデータに埋め込み（ｅｍｂｅｄｄｉｎｇ）、埋め込みの結果に基づいてデータを検索するための方法及びシステムを提供するものである。 The present invention provides a method and system for embedding data including field values belonging to different categories into data in a form that can be used in an electronic device, and searching the data based on the embedding results.

具体的には、本発明は、データに含まれる異なるカテゴリーの特徴を維持しながらも電子機器で活用できる形態のデータに埋め込むための方法及びシステムを提供する。 Specifically, the present invention provides a method and system for embedding data in a form that can be utilized in electronic devices while maintaining characteristics of different categories contained in the data.

また、本発明は、データベースから異なるフィールド値を含むデータに対応するデータを容易に検索できるようにする方法及びシステムを提供するものである。 The present invention also provides a method and system that allows data corresponding to data containing different field values to be easily retrieved from a database.

さらに、本発明は、テキスト認識エラーや音声認識エラーなどによりエラー値が含まれるデータに対応するデータを検索する場合にも、データベースから所望の結果を高い正確度で検索できるようにする方法及びシステムを提供するものである。 Furthermore, the present invention provides a method and system that allows desired results to be retrieved from a database with high accuracy even when retrieving data corresponding to data containing error values due to text recognition errors, voice recognition errors, etc. It provides:

上記課題を解決するために、本発明は、複数のフィールド値を含むコンテンツを受信するステップと、前記コンテンツに含まれるフィールド値を配列するが、前記フィールド値が属するカテゴリーを区分する複数の区分子を追加してモデル入力値を生成するステップと、前記モデル入力値及び学習されたディープラーニングモデルを用いて前記コンテンツのベクトルを生成するステップと、前記生成されたベクトルと既に保存された複数のデータのそれぞれに対応するベクトル間の類似度に基づいて、前記既に保存された複数のデータから検索対象データに対応するデータを検索するステップとを含む、データ検索方法を提供する。 In order to solve the above problems, the present invention includes the steps of receiving content including a plurality of field values, arranging the field values included in the content, and providing a plurality of classifiers for classifying categories to which the field values belong. generating a model input value by adding a model input value; generating a vector of the content using the model input value and the learned deep learning model; and searching for data corresponding to the search target data from the plurality of already stored data based on the degree of similarity between vectors corresponding to each of the vectors.

また、本発明は、複数のフィールド値を含むコンテンツを受信する通信部と、前記コンテンツに含まれるフィールド値を配列するが、前記フィールド値が属するカテゴリーを区分する複数の区分子を追加してモデル入力値を生成し、前記モデル入力値及び学習されたディープラーニングモデルを用いて前記コンテンツのベクトルを生成し、前記生成されたベクトルと既に保存された複数のデータのそれぞれに対応するベクトル間の類似度に基づいて、前記既に保存された複数のデータから検索対象データに対応するデータを検索する制御部とを含む、データ検索システムを提供する。 Further, the present invention includes a communication unit that receives content including a plurality of field values, and a model that arranges the field values included in the content, but adds a plurality of classifiers that classify the categories to which the field values belong. Generate an input value, generate a vector of the content using the model input value and the learned deep learning model, and determine the similarity between the generated vector and the vector corresponding to each of the plurality of data already stored. and a control unit that searches for data corresponding to the search target data from the plurality of already stored data based on the search target data.

前述したように、本発明は、データに含まれる複数のフィールド値が属するカテゴリーを区分してデータの埋め込みを行うので、データに含まれる異なるカテゴリーの特徴が維持されたベクトルを生成することができる。本発明は、生成された異なるカテゴリーの特徴が維持されたベクトルをデータ検索に活用することにより、既に保存されたデータと同じデータのみを検索することに限定されず、対象文書に含まれる複数のカテゴリーに属する値の類似度を考慮したデータ検索を行うことができる。 As described above, the present invention embeds data by classifying the categories to which multiple field values included in the data belong, so it is possible to generate vectors that maintain the characteristics of different categories included in the data. . By utilizing generated vectors that maintain characteristics of different categories for data retrieval, the present invention is not limited to retrieving only the same data as already stored data, but can search multiple data contained in a target document. Data searches can be performed taking into account the similarity of values belonging to categories.

また、本発明によれば、埋め込みの結果で生成されたベクトル間の類似度に基づいてデータを検索するので、データ検索時に人が定めた検索規則に依存して検索を行う必要がなくなる。 Further, according to the present invention, data is searched based on the degree of similarity between vectors generated as a result of embedding, so there is no need to perform a search depending on a search rule defined by a person when searching for data.

さらに、本発明によれば、データのカテゴリー毎の類似度を考慮した検索が可能であるので、ノイズやエラーが頻繁に発生するデータ（例えば、文字認識データ（ＯＣＲデータ）、音声認識データ）を用いた検索時にも高い正確度でデータ検索を行うことができる。 Furthermore, according to the present invention, it is possible to perform a search that takes into account the degree of similarity for each category of data, so it is possible to search data that frequently contains noise or errors (for example, character recognition data (OCR data), voice recognition data). Data retrieval can be performed with high accuracy even when using this method.

本発明によるデータ検索システムを説明するための概念図である。1 is a conceptual diagram for explaining a data search system according to the present invention. 本発明によるデータ検索方法を示す概念図である。1 is a conceptual diagram showing a data search method according to the present invention. 本発明によるデータ検索方法を説明するためのフローチャートである。3 is a flowchart for explaining a data search method according to the present invention. 本発明によるデータ埋め込みモデルを説明するための概念図である。FIG. 2 is a conceptual diagram for explaining a data embedding model according to the present invention. 本発明によるデータの埋め込みの結果で生成されたベクトルをベクトル空間にフローティングした状態を示す概念図である。FIG. 3 is a conceptual diagram showing a state in which vectors generated as a result of data embedding according to the present invention are floating in a vector space. ＯＣＲデータを用いてデータを検索する一実施形態を示す概念図である。FIG. 2 is a conceptual diagram illustrating an embodiment of searching data using OCR data. ＯＣＲデータを用いてデータを検索する一実施形態を示す概念図である。FIG. 2 is a conceptual diagram illustrating an embodiment of searching data using OCR data. データの埋め込みの結果で生成されたベクトルを用いてデータを検索する一実施形態を示す概念図である。FIG. 2 is a conceptual diagram illustrating an embodiment in which data is searched using a vector generated as a result of data embedding.

以下、添付図面を参照して本発明の実施形態について詳細に説明するが、図面番号に関係なく同一又は類似の構成要素には同一の符号を付し、それについての重複する説明は省略する。以下の説明で用いられる構成要素の接尾辞である「モジュール」や「部」は、明細書の作成を容易にするために付与又は混用されるものであり、それ自体が有意性や有用性を有するものではない。また、本発明の実施形態について説明するにあたり、関連する公知技術についての具体的な説明が本発明の実施形態の要旨を不明にすると判断される場合は、その詳細な説明を省略する。さらに、添付図面は本発明の実施形態の理解を助けるためのものにすぎず、添付図面により本発明の技術的思想が限定されるものではなく、本発明の思想及び技術範囲に含まれるあらゆる変更、均等物乃至代替物を含むものと理解すべきである。 Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings, but the same or similar components will be denoted by the same reference numerals regardless of the drawing numbers, and redundant explanation thereof will be omitted. The suffixes "module" and "part" used in the following explanations are given or mixed to facilitate the preparation of specifications, and they themselves have significance and usefulness. It's not something you have. Furthermore, when describing the embodiments of the present invention, if it is determined that detailed explanation of related known techniques would obscure the gist of the embodiments of the present invention, the detailed explanation will be omitted. Furthermore, the attached drawings are only for helping understanding of the embodiments of the present invention, and the technical idea of the present invention is not limited by the attached drawings, and any changes that fall within the idea and technical scope of the present invention may be made. , should be understood to include equivalents or substitutes.

「第１」、「第２」などのように序数を含む用語は様々な構成要素を説明するために用いられるが、上記構成要素は上記用語により限定されるものではない。上記用語は１つの構成要素を他の構成要素と区別する目的でのみ用いられる。 Terms including ordinal numbers such as "first", "second", etc. are used to describe various components, but the components are not limited to the above terms. These terms are only used to distinguish one component from another.

ある構成要素が他の構成要素に「連結」又は「接続」されていると言及された場合は、他の構成要素に直接連結又は接続されていてもよく、中間にさらに他の構成要素が存在してもよいものと解すべきである。それに対して、ある構成要素が他の構成要素に「直接連結」又は「直接接続」されていると言及された場合は、中間にさらに他の構成要素が存在しないものと解すべきである。 When a component is referred to as being "coupled" or "connected" to another component, it may be directly coupled or connected to the other component, and there may be other components in between. It should be understood that it is permissible to do so. In contrast, when an element is referred to as being "directly coupled" or "directly connected" to another element, there are no intermediate elements present.

単数の表現には、特に断らない限り複数の表現が含まれる。 Singular expressions include plural expressions unless otherwise specified.

本明細書において、「含む」や「有する」などの用語は、本明細書に記載された特徴、数字、段階、動作、構成要素、部品又はそれらの組み合わせが存在することを指定しようとするもので、１つ又はそれ以上の他の特徴、数字、段階、動作、構成要素、部品又はそれらの組み合わせの存在や付加可能性を予め排除するものではないと理解すべきである。 As used herein, terms such as "comprising" and "having" are intended to specify the presence of features, numbers, steps, acts, components, parts, or combinations thereof described herein. It should be understood that this does not exclude in advance the presence or possibility of adding one or more other features, figures, steps, acts, components, parts or combinations thereof.

本明細書において、コンテンツとは、コンピュータで処理可能な各種情報やその内容物を意味し、テキスト、イメージ、音声、ファイルなど、様々な形態であり、特定の形態に限定されない。 In this specification, content refers to various types of information and contents that can be processed by a computer, and is in various forms such as text, images, sounds, files, etc., and is not limited to a specific form.

カテゴリーとは、定義された分類内で任意のレベルにある項目を意味する。特定のカテゴリーに属するデータと他のカテゴリーに属するデータとを区分する基準は、絶対的であるのではなく、カテゴリーを規定する任意の規則によって異なる。このような規則は、原本コンテンツ（例えば、紙文書、紙文書を撮影したイメージ、音声データ）とそれを構造化したデータとに異なって適用される。 A category refers to an item at any level within a defined classification. The criteria for distinguishing data belonging to a particular category from data belonging to other categories is not absolute, but depends on arbitrary rules that define the categories. These rules are applied differently to original content (for example, a paper document, an image of a paper document, audio data) and structured data thereof.

例えば、様々な種類の領収証などの文書は、売場名、事業者番号、売場電話番号、売場住所、注文商品名、注文商品数量など、販売者及び消費者に関連する複数のカテゴリーを含み、データの効率的な処理のために紙文書に含まれるデータを、同一のカテゴリー同士を関連付けてデータ化する必要がある。 For example, documents such as various types of receipts contain multiple categories related to sellers and consumers, such as store name, business number, store phone number, store address, order product name, order product quantity, etc. For efficient processing, it is necessary to convert the data contained in paper documents into data by associating the same categories.

例えば、紙文書である特定の領収証は、「売場名」、「注文商品名」、「注文商品数量」という３つのカテゴリーを含むが、当該紙文書に含まれるデータをデジタル化する際には、「売場名」、「注文商品」という２つのカテゴリーに縮小してもよい。 For example, a specific receipt, which is a paper document, includes three categories: "store name," "ordered product name," and "ordered product quantity," but when digitizing the data contained in the paper document, It may be reduced to two categories: "store name" and "ordered product".

一方、本明細書においては、特定のカテゴリーの属性を示すデータをフィールド名（例えば、「売場名：」、「数量：」、「電話番号」など）と定義し、特定のカテゴリーの値を示すデータをフィールド値（例えば、「ＮＬＰＣＡＦＥ」、「ＳＣｉｔｙ」、「０１－２３４－５６８」など）と定義する。 On the other hand, in this specification, data indicating attributes of a specific category is defined as a field name (for example, "Department name:", "Quantity:", "Telephone number", etc.), and data indicating the value of a specific category is defined as Define data as field values (eg, "NLP CAFE", "S City", "01-234-568", etc.).

一方、コンテンツの種類に関係なく、各コンテンツは、フィールド名及びフィールド値を含んでもよい。例えば、紙文書、紙文書のイメージ、音声データのそれぞれは、フィールド名及びフィールド値を含んでもよい。よって、フィールド名及びフィールド値は、テキスト、イメージ、音声データなど、様々な形態を有する。 On the other hand, each piece of content may include a field name and a field value, regardless of the type of content. For example, each of a paper document, an image of a paper document, and audio data may include a field name and a field value. Therefore, field names and field values have various forms such as text, images, and audio data.

一方、機械で処理可能な形式のデータの観点から、前記フィールド名は「属性」と称され、前記フィールド値は「値」と称される。 On the other hand, from the perspective of data in a machine-processable format, the field name is called an "attribute" and the field value is called a "value."

上記用語の定義によれば、コンテンツは、同一のカテゴリーに含まれる「属性－値」対のデータを含んでもよい。ただし、これに限定されるものではなく、コンテンツは、特定のカテゴリーに対しては、フィールド名を含まず、フィールド値のみを含んでもよい。この場合、前記フィールド名が省略されているだけであり、省略されたフィールド名に対応するフィールド値はフィールド名に関連する意味を含む。 According to the definition of the term above, content may include data of "attribute-value" pairs included in the same category. However, the present invention is not limited to this, and the content may not include field names but only field values for a specific category. In this case, the field name is simply omitted, and the field value corresponding to the omitted field name includes a meaning related to the field name.

一方、コンテンツは、フィールド値を含まず、フィールド名のみを含んでもよい。この場合、特定の項目に割り当てられた値が存在せず、特定の項目に割り当てられた値が存在しなくても、当該特定の項目が存在することがある。 On the other hand, the content may include only field names without field values. In this case, there is no value assigned to a specific item, and even if there is no value assigned to a specific item, the specific item may exist.

上記例示において、紙文書（又は紙文書を撮影したイメージ）は、文字認識により、電子機器で処理可能なテキストに変換され、変換されたテキストは、異なるカテゴリーに分類される。よって、異なるカテゴリーに属するフィールド値を含むデータが生成される。 In the above example, a paper document (or an image of a paper document) is converted into text that can be processed by an electronic device through character recognition, and the converted text is classified into different categories. Therefore, data containing field values belonging to different categories is generated.

一方、異なるカテゴリーに属するフィールド値を含むデータは、ＯＣＲだけでなく、他の方式で収集されたテキストに基づいて生成することもできる。例えば、音声認識結果物により、異なるカテゴリーに属するフィールド値を含むデータを生成することができる。具体的には、音声認識により認識されたユーザの音声がテキストに変換され、変換されたテキストが異なるカテゴリーに分類されて電算化される。 On the other hand, data including field values belonging to different categories can be generated based on text collected not only by OCR but also by other methods. For example, a speech recognition product can generate data that includes field values belonging to different categories. Specifically, the user's voice recognized by voice recognition is converted into text, and the converted text is classified into different categories and computerized.

前述したように、異なるカテゴリーに属するフィールド値を含むデータは、様々な方法で生成することができる。本発明は、既に保存されたデータベースから前記生成されたデータに対応するデータを検索する方法及びシステムを提供する。 As mentioned above, data containing field values belonging to different categories can be generated in a variety of ways. The present invention provides a method and system for retrieving data corresponding to the generated data from an already stored database.

一方、本発明によるデータ検索は、前記コンテンツに対するデータの埋め込みの結果で生成されたベクトルに基づいて行われる。 Meanwhile, data search according to the present invention is performed based on a vector generated as a result of embedding data in the content.

本発明は、異なるカテゴリーに属するフィールド値を含むデータの検索を効率的に行う方法を提供する。具体的には、本発明は、異なるカテゴリーに属するフィールド値を含むデータを機械が理解できる形態の情報に変換する効率的な埋め込みによりデータ検索の正確度を向上させる。 The present invention provides a method for efficiently searching for data containing field values belonging to different categories. Specifically, the present invention improves the accuracy of data retrieval through efficient embedding that converts data containing field values belonging to different categories into information in a machine-understandable form.

本発明は、新たな方式のデータの埋め込みにより生成されたベクトルを用いてデータ検索の正確度を向上させるデータ検索方法及びシステムを提供する。 The present invention provides a data retrieval method and system that improves the accuracy of data retrieval using vectors generated by a new method of data embedding.

以下、新たな方式のデータの埋め込みについて、添付図面を参照して具体的に説明する。 The new method of data embedding will be specifically described below with reference to the accompanying drawings.

図１は本発明によるデータ検索システムを説明するための概念図であり、図２は本発明によるデータ検索方法を示す概念図である。 FIG. 1 is a conceptual diagram for explaining a data retrieval system according to the present invention, and FIG. 2 is a conceptual diagram for explaining a data retrieval method according to the present invention.

本発明によるデータ検索システム１００は、アプリケーション又はソフトウェアの形態で実現されてもよい。本発明によるデータ検索システム１００のソフトウェア的な実現によれば、本明細書で説明されるプロセスや機能などの実施形態は、別途のソフトウェアモジュールで実現されてもよい。ソフトウェアモジュールのそれぞれは、本明細書で説明される１つ以上の機能及び動作を行うことができる。 The data retrieval system 100 according to the present invention may be implemented in the form of an application or software. According to the software implementation of the data retrieval system 100 according to the present invention, embodiments such as processes and functions described herein may be implemented in separate software modules. Each of the software modules may perform one or more functions and operations described herein.

本発明によるソフトウェア的な実現は、図１に示すデータ検索システム１００により実現される。以下、データ検索システム１００の構成についてより具体的に説明する。 A software implementation of the present invention is realized by a data retrieval system 100 shown in FIG. The configuration of the data search system 100 will be described in more detail below.

本発明によるデータ検索システム１００は、複数のフィールド値を含むコンテンツを受信することができる。受信されたコンテンツは、データ検索システム１００に必要な形態でデータ化される。 Data retrieval system 100 according to the present invention can receive content that includes multiple field values. The received content is converted into data in a format required by the data search system 100.

例えば、本発明によるデータ検索システム１００は、紙文書のイメージを受信し、イメージに対するテキスト認識によりＯＣＲデータを生成することができる。本明細書において、ＯＣＲデータは、イメージから抽出されたテキスト及び抽出されたテキストに対応する位置情報を含んでもよい。ここで、位置情報は、抽出されたテキストのイメージ（又は紙文書）内の位置を定義する。本発明によるデータ検索システム１００は、前記抽出されたテキストを異なるカテゴリーに分類することができる。 For example, the data retrieval system 100 according to the present invention can receive an image of a paper document and generate OCR data by performing text recognition on the image. As used herein, OCR data may include text extracted from an image and location information corresponding to the extracted text. Here, the position information defines the position of the extracted text within the image (or paper document). The data retrieval system 100 according to the present invention can classify the extracted text into different categories.

他の例として、本発明によるデータ検索システム１００は、音声データを受信し、音声データをテキストに変換し、その後変換されたテキストを異なるカテゴリーに分類することができる。 As another example, the data retrieval system 100 according to the present invention can receive audio data, convert the audio data to text, and then classify the converted text into different categories.

一方、本発明によるデータ検索システム１００は、図１に示すように、通信部１１０、保存部１２０、ＯＣＲ部１３０、制御部１４０及び音声認識部１５０の少なくとも１つを含む。ただし、これに限定されるものではなく、本発明によるデータ検索システム１００は、上記構成要素より多いか又は少ない構成要素を含んでもよく、上記構成要素の少なくとも一部は物理的に離隔した位置に配置されてもよい。 Meanwhile, the data search system 100 according to the present invention includes at least one of a communication unit 110, a storage unit 120, an OCR unit 130, a control unit 140, and a voice recognition unit 150, as shown in FIG. However, the data retrieval system 100 according to the present invention may include more or fewer components than the above-mentioned components, and at least some of the above-mentioned components are located at physically separated locations. may be placed.

まず、通信部１１０は、紙文書をスキャンしたイメージ１０又は音声データを受信する手段であって、通信部、スキャン部及び入力部の少なくとも１つを含むようにしてもよく、その他のイメージ１０を受信する手段からなるようにしてもよい。 First, the communication unit 110 is a means for receiving an image 10 obtained by scanning a paper document or audio data, and may include at least one of a communication unit, a scanning unit, and an input unit, and receives other images 10. It may also consist of a means.

データ検索システム１００は、通信部１１０を介して受信したイメージ１０又は音声データなどのコンテンツを受信し、コンテンツに対するデータの埋め込みを行うことができる。 The data search system 100 can receive content such as the image 10 or audio data received via the communication unit 110, and can embed data in the content.

次に、保存部１２０は、本発明による様々な情報を保存するようにしてもよい。保存部１２０は、その種類が非常に多様であり、少なくとも一部はＤＢ（データベース）１６０を含んでもよい。ＤＢ１６０は、データ検索システム１００から物理的に離隔した外部サーバ又はクラウドサーバであってもよく、データ検索システム１００は、ＤＢ１６０との通信によりＤＢ１６０を保存部１２０のように活用することができる。 Next, the storage unit 120 may store various information according to the present invention. The storage unit 120 has a wide variety of types, and at least a portion thereof may include a DB (database) 160. The DB 160 may be an external server or a cloud server physically separated from the data search system 100, and the data search system 100 can utilize the DB 160 like the storage unit 120 by communicating with the DB 160.

すなわち、保存部１２０は、本発明に関連する情報が保存される空間であればよく、物理的な空間の制約はない。本明細書においては、保存部１２０とＤＢ１６０を区分せず、ＤＢ１６０に保存されたデータも保存部１２０に保存されたデータとして説明する。 That is, the storage unit 120 may be any space in which information related to the present invention is stored, and there is no physical space restriction. In this specification, the storage unit 120 and the DB 160 are not distinguished, and data stored in the DB 160 will also be described as data stored in the storage unit 120.

保存部１２０には、ｉ）コンテンツの生成に活用されるデータ（紙文書をスキャンしたイメージ１０又は音声データ）及びそれに関連するデータ、ｉｉ）データ埋め込みモデルの機械学習に活用される学習データ、ｉｉｉ）埋め込まれたデータの少なくとも１つが保存される。 The storage unit 120 stores i) data used for content generation (image 10 obtained by scanning a paper document or audio data) and related data, ii) learning data used for machine learning of the data embedding model, and iii. ) at least one of the embedded data is saved.

次に、ＯＣＲ部１３０は、イメージ１０に含まれるテキストを認識する手段であって、様々なテキスト認識アルゴリズムの少なくとも１つによりイメージ１０に含まれるテキストを認識することができる。ＯＣＲ部１３０は、人工知能に基づくアルゴリズムを用いて、テキストを認識することができる。 Next, the OCR unit 130 is a means for recognizing the text included in the image 10, and is capable of recognizing the text included in the image 10 using at least one of various text recognition algorithms. The OCR unit 130 can recognize text using an algorithm based on artificial intelligence.

ＯＣＲ部１３０は、イメージ１０に含まれるテキスト及びテキストの位置情報を抽出することができる。ここで、テキストの位置情報には、イメージ１０内でのテキストの位置に関する情報が含まれる。 The OCR unit 130 can extract text included in the image 10 and location information of the text. Here, the text position information includes information regarding the position of the text within the image 10.

次に、制御部１４０は、本発明に関連するデータ検索システム１００の全般的な動作を制御する。制御部１４０は、人工知能アルゴリズムを処理するプロセッサ（又は人工知能プロセッサ）を含んでもよい。 Next, the control unit 140 controls the overall operation of the data search system 100 related to the present invention. The control unit 140 may include a processor (or an artificial intelligence processor) that processes an artificial intelligence algorithm.

また、制御部１４０は、データの埋め込みのための作業領域を提供し、このような作業領域は、データの埋め込みを行うか、又はデータの埋め込みのための機械学習を行うための「ユーザ環境」又は「ユーザインタフェース」とも命名される。 Further, the control unit 140 provides a work area for data embedding, and such a work area is a "user environment" for performing data embedding or machine learning for data embedding. It is also referred to as a "user interface."

このような作業領域は、電子機器のディスプレイ部に出力（又は提供）されるようにしてもよい。さらに、制御部１４０は、電子機器に備えられるか又は電子機器と連動するユーザ入力部（例えば、タッチスクリーン、マウスなど）を介して受信されるユーザ入力に基づいて、データの埋め込みを行うか、又はデータの埋め込みのための機械学習を行うことができる。さらに、制御部１４０は、コンテンツ１０を受信し、受信したコンテンツ１０に対応するデータを保存部１２０に保存されたデータ２４０ａ、２４０ｂ及び保存部１２０に保存された他のデータから検索することができる。 Such a work area may be output (or provided) on a display unit of an electronic device. Furthermore, the control unit 140 performs data embedding based on a user input received via a user input unit (for example, a touch screen, a mouse, etc.) included in the electronic device or in conjunction with the electronic device; Or machine learning for data embedding can be performed. Further, the control unit 140 can receive the content 10 and search data corresponding to the received content 10 from the data 240a, 240b stored in the storage unit 120 and other data stored in the storage unit 120. .

なお、本発明において、作業領域が出力される電子機器の種類に特に制限はなく、本発明によるアプリケーションを起動できるものであればよい。例えば、電子機器には、スマートフォン、携帯電話、タブレットＰＣ、コンピュータ、ノートブックコンピュータ、デジタル放送用端末、ＰＤＡ（ＰｅｒｓｏｎａｌＤｉｇｉｔａｌＡｓｓｉｓｔａｎｔｓ）、ＰＭＰ（ＰｏｒｔａｂｌｅＭｕｌｔｉｍｅｄｉａＰｌａｙｅｒ）、スマートミラー（ｓｍａｒｔｍｉｒｒｏｒ）及びスマートテレビ（ｓｍａｒｔＴＶ）の少なくとも１つが含まれる。 Note that in the present invention, there is no particular restriction on the type of electronic device to which the work area is output, as long as it can start the application according to the present invention. For example, electronic devices include smartphones, mobile phones, tablet PCs, computers, notebook computers, digital broadcast terminals, PDAs (Personal Digital Assistants), PMPs (Portable Multimedia Players), smart mirrors, and smart TVs ( smart TV).

本発明において、電子機器又は電子機器に備えられるディスプレイ部、ユーザ入力部に対しては符号を付さない。しかし、本発明における作業領域は、電子機器のディスプレイ部に出力され、ユーザ入力が電子機器に備えられるか又は電子機器と連動するユーザ入力部を介して受信されることは、当業者にとって自明である。 In the present invention, no reference numerals are given to the electronic device or the display unit and user input unit provided in the electronic device. However, it is obvious to those skilled in the art that the work area in the present invention is output to a display unit of an electronic device, and that user input is received via a user input unit provided in the electronic device or interlocked with the electronic device. be.

一方、本発明によるデータ検索システムは、既に保存されたデータからコンテンツに対応するデータを検索することができる。 Meanwhile, the data search system according to the present invention can search for data corresponding to content from already stored data.

以下、本発明によるデータ検索システムを用いる例として、紙文書のイメージ１０を受信して既に保存されたデータから当該イメージに対応するデータを検索する過程について説明する。受信されるコンテンツがイメージに限定されないことは前述した通りである。 Hereinafter, as an example of using the data search system according to the present invention, a process of receiving an image 10 of a paper document and searching data corresponding to the image from already stored data will be described. As mentioned above, the received content is not limited to images.

図２に示すように、本発明によるデータ検索システムは、紙文書のイメージ１０を受信し、テキスト認識によりイメージ１０からテキスト及びテキストの位置情報を抽出２２０する。ここで、紙文書の損傷２１０などにより、テキストの文字が正確に認識されないことがあり、このような場合をノイズ２２１があるという。その後、データ検索システムは、抽出されたテキスト２２０を異なる複数のカテゴリーのそれぞれに分類し、構造化されたデータ２３０を生成する。ここで、構造化されたデータは、既に定められた形式（例えば、ＪＳＯＮ、ＸＭＬ）で表される。 As shown in FIG. 2, the data retrieval system according to the present invention receives an image 10 of a paper document and extracts 220 text and text location information from the image 10 through text recognition. Here, characters in the text may not be accurately recognized due to damage 210 to the paper document, and such a case is referred to as noise 221. Thereafter, the data retrieval system classifies the extracted text 220 into each of a plurality of different categories and generates structured data 230. Here, structured data is expressed in a predetermined format (eg, JSON, XML).

データ検索システム１００は、それを用いて、既に保存された複数のデータ２４０ａ、２４０ｂ及び保存部１２０に保存された他のデータからイメージ１０に対応するデータ２４０ａを検索する。 The data retrieval system 100 uses this to retrieve data 240a corresponding to the image 10 from a plurality of previously stored data 240a, 240b and other data stored in the storage unit 120.

ここで、本発明は、異なるカテゴリーに属する複数のフィールド値を含むコンテンツに対する埋め込みを行い、データ検索に活用する。以下、コンテンツに対するデータの埋め込み方法についてより具体的に説明する。 Here, the present invention embeds content including a plurality of field values belonging to different categories and utilizes it for data search. The method of embedding data into content will be explained in more detail below.

以下、前述したデータ検索システムを用いてデータの埋め込みを行う方法についてより具体的に説明する。特に、以下では、フローチャートを参照して、データの埋め込み方法についてまず説明する。 Hereinafter, a method for embedding data using the data search system described above will be explained in more detail. In particular, below, a data embedding method will first be described with reference to a flowchart.

図３は本発明によるデータ検索方法を説明するためのフローチャートであり、図４は本発明によるデータ埋め込みモデルを説明するための概念図であり、図５は本発明によるデータの埋め込みの結果で生成されたベクトルをベクトル空間にフローティングした状態を示す概念図である。 FIG. 3 is a flowchart for explaining the data retrieval method according to the present invention, FIG. 4 is a conceptual diagram for explaining the data embedding model according to the present invention, and FIG. 5 is a flow chart for explaining the data embedding model according to the present invention. FIG. 3 is a conceptual diagram showing a state in which the vectors obtained by the calculation are floated in a vector space.

本発明によるデータ検索方法においては、コンテンツを受信するステップが行われる（Ｓ１１０）。 In the data search method according to the present invention, a step of receiving content is performed (S110).

前記コンテンツは、複数のフィールド値を含み、前記フィールド値は、複数の異なるカテゴリーにそれぞれ対応するようにしてもよい。前述したように、前記コンテンツは、フィールド値のカテゴリーが区分された形態のデータであるか、又はこのような形態に加工される。例えば、複数のフィールド値とそのカテゴリーは、既に定められた形式（例えば、ＪＳＯＮやＸＭＬなど）で表される。 The content may include a plurality of field values, and the field values may each correspond to a plurality of different categories. As described above, the content is data in which field value categories are classified, or is processed into such a format. For example, a plurality of field values and their categories are expressed in a predetermined format (eg, JSON, XML, etc.).

すなわち、本発明によるデータ検索システムは、外部からフィールド値のカテゴリーが区分された形態のデータを受信するか、原本データ（紙文書のイメージ又は音声データ）を受信し、その後原本データに基づいてフィールド名－フィールド値で区分された形態のデータを生成して検索に活用することができる。 That is, the data retrieval system according to the present invention receives data in a form in which field value categories are classified from the outside, or receives original data (image or audio data of a paper document), and then performs field search based on the original data. Data divided by name and field value can be generated and used for searching.

次に、前記コンテンツに含まれるフィールド値を配列するが、前記フィールド値が属するカテゴリーに基づいて前記フィールド値のカテゴリーを区分する複数の区分子を追加してモデル入力値を生成するステップが行われる（Ｓ１２０）。 Next, a step of arranging the field values included in the content and adding a plurality of classifiers for dividing categories of the field values based on the category to which the field values belong to generate model input values is performed. (S120).

本発明によるデータ検索システム１００は、複数のフィールド値を順次配列してデータの埋め込みのためのディープラーニングモデルの入力値を生成する。ここで、データ検索システム１００は、複数のフィールド値のそれぞれが属するカテゴリーを区分する区分子を活用して、異なるカテゴリーに属するフィールド値が区分されるようにモデル入力値を生成することができる。カテゴリーを区分するためのカテゴリー区分子の他にも、データ入力の開始又は終了を示す区分子、該当フィールド値がないことを示す区分子などをさらに用いてもよい。 The data retrieval system 100 according to the present invention sequentially arranges a plurality of field values to generate input values for a deep learning model for data embedding. Here, the data retrieval system 100 can generate model input values such that field values belonging to different categories are classified by using a classifier that classifies categories to which each of the plurality of field values belongs. In addition to the category delimiter for classifying categories, a delimiter indicating the start or end of data input, a delimiter indicating the absence of a corresponding field value, etc. may also be used.

例えば、データ検索システム１００は、コンテンツに含まれるフィールド値を所定の順序で連結して１つのデータを生成し、フィールド値の前部又は後部に区分子を配列する。よって、モデル入力値は、複数のフィールド値と複数の区分子が所定の順序で一列に配列されたデータであってもよい。 For example, the data search system 100 generates one piece of data by concatenating field values included in content in a predetermined order, and arranges delimiters at the front or rear of the field values. Therefore, the model input value may be data in which a plurality of field values and a plurality of delimiters are arranged in a line in a predetermined order.

一方、前記フィールド値は、複数のカテゴリーにそれぞれ対応し、モデル入力値に追加される前記複数の区分子は、前記複数のカテゴリーにそれぞれ対応する。すなわち、モデル入力値には、複数のカテゴリーのそれぞれに対応するフィールド値及び区分子が含まれる。 On the other hand, the field values correspond to a plurality of categories, and the plurality of categorization elements added to the model input value correspond to the plurality of categories, respectively. That is, the model input values include field values and classifiers corresponding to each of the plurality of categories.

ここで、同一のカテゴリーに属するフィールド値及び区分子は、互いに隣接して配列される。すなわち、複数のカテゴリーのうち特定のカテゴリーに対応する特定の区分子及び特定のフィールド値は、互いに隣接して配列される。本明細書においては、同一のカテゴリーに属するフィールド値及び区分子のいずれかを称する際に、フィールド値に対応する区分子及び区分子に対応するフィールド値と説明する。 Here, field values and classifiers belonging to the same category are arranged adjacent to each other. That is, a specific category element and a specific field value corresponding to a specific category among the plurality of categories are arranged adjacent to each other. In this specification, when referring to either a field value or a division molecule belonging to the same category, the field value and the division molecule corresponding to the field value will be referred to as a field value corresponding to the division molecule.

特定のフィールド値に対応する特定の区分子は、特定のフィールド値の前部又は後部に配列されてもよい。よって、モデル入力値に含まれる一部の区分子は、モデル入力値の前部又は後部に配列されてもよく、異なるフィールド値間に配列されてもよい。 A particular section molecule corresponding to a particular field value may be arranged before or after the particular field value. Therefore, some partition elements included in the model input value may be arranged before or after the model input value, or may be arranged between different field values.

図２を参照すると、イメージ１０の入力に対して、モデル入力値は、「［ＣＬＳ］ＮＬＰＣＯＦＦＥＥ［ＳＥＰ＿Ｎａｍｅ］ＳＣｉｔｙ［ＳＥＰ＿Ａｄｄｒｅｓｓ］」のように生成される。ここで、「［］」（「［］」の内部のテキストを含む）は、特定の役割を果たす区分子又は異なるカテゴリーに属するフィールド値を区分する区分子であり、「［］」で区分されないテキストは、コンテンツに含まれるフィールド値を示す。例示において、［ＣＬＳ］は、データ全体を代表するクラス区分子、［ＳＥＰ＿Ｎａｍｅ］は、名称フィールド値の終了を示すカテゴリー区分子、［ＳＥＰ＿Ａｄｄｒｅｓｓ］は、住所フィールド値の終了を示すカテゴリー区分子である。 Referring to FIG. 2, for the input of image 10, model input values are generated as "[CLS]NLP COFFEE[SEP_Name]S City[SEP_Address]". Here, "[]" (including the text inside "[]") is a delimiter that plays a specific role or a delimiter that separates field values belonging to different categories, and is not delimited by "[]". The text indicates field values included in the content. In the example, [CLS] is a class molecule representing the entire data, [SEP_Name] is a category molecule indicating the end of the name field value, and [SEP_Address] is a category molecule indicating the end of the address field value. .

一方、前記コンテンツに特定のカテゴリーを定義するフィールド名が含まれ、前記特定のカテゴリーに対応するフィールド値が含まれない場合、前記モデル入力値は、前記特定のカテゴリーに対応する特殊な区分子（マスク）を含んでもよい。前記マスクは、前記特定のカテゴリーに対応する区分子に隣接して配列されてもよい。 On the other hand, if the content includes a field name that defines a specific category and does not include a field value corresponding to the specific category, the model input value is a special classifier ( mask). The mask may be arranged adjacent to a section molecule corresponding to the particular category.

フィールド値がない場合、前記特定のコンテンツから生成されたモデル入力値は、該当カテゴリーに対応するフィールド値が配列されなければならない位置にマスクを代わりに配列することにより構成してもよい。例えば、特定のコンテンツに「事業者登録番号」を定義するフィールド名が含まれるが、フィールド値は、それに対応するマスク（［ＭＡＳＫ＿ｂｉｚ］）で表され、モデル入力値は、「［ＣＬＳ］ＮＬＰＣＡＦＥ［ＳＥＰ＿ｎａｍｅ］Ｓｃｉｔｙ［ＭＡＳＫ＿ｂｉｚ］［ＳＥＰ＿ｂｉｚ］」のように生成される。 If there is no field value, the model input value generated from the specific content may be constructed by arranging a mask instead at the position where the field value corresponding to the corresponding category should be arranged. For example, certain content contains a field name that defines "Business Registration Number", but the field value is represented by a corresponding mask ([MASK_biz]), and the model input value is "[CLS]NLP CAFE". [SEP_name] S city [MASK_biz] [SEP_biz]".

次に、前記モデル入力値及び学習されたディープラーニングモデルを用いて前記コンテンツのベクトルを生成するステップが行われる（Ｓ１３０）。 Next, a step of generating a vector of the content using the model input value and the learned deep learning model is performed (S130).

ここで、前記モデル入力値に含まれる前記フィールド値のそれぞれを少なくとも１つの第１タイプトークンに変換するステップ、及び前記複数の区分子のそれぞれを第２タイプトークンに変換するステップが行われてもよい。 Here, the step of converting each of the field values included in the model input value into at least one first type token, and the step of converting each of the plurality of delimiters into a second type token may be performed. good.

データ検索システム１００は、前記複数のカテゴリーのそれぞれに対応する第１タイプトークンが互いに区分されるように、特定のカテゴリーに対応するフィールド値及び区分子から変換された第１及び第２タイプトークンを互いに隣接して配列する。 The data retrieval system 100 converts first and second type tokens from field values and delimiters corresponding to a specific category so that the first type tokens corresponding to each of the plurality of categories are classified from each other. Arrange adjacent to each other.

ここで、１つのフィールド値に対応する第１タイプトークンは、１つ以上生成されてもよい。 Here, one or more first type tokens corresponding to one field value may be generated.

一実施形態において、１つのフィールド値から複数の第１タイプトークンが生成されるようにしてもよい。１つの単語又は複数の単語からなるフィールド値は、トークン変換過程で複数のテキストに分割され、分割されたテキストの少なくとも一部には、既に設定されたテキストが結合されるようにしてもよい。例えば、モデル入力値に含まれるフィールド値「ＮＬＰＣＯＦＦＥＥ」は、複数の第１タイプトークン（「ＮＬ」、「♯Ｐ」、「ＣＯＦＦ」、「♯ＥＥ」）に変換される。ここで、第１タイプトークンに含まれるテキスト「♯」は、前のトークンとの間が空白でないことを定義するテキストであって、フィールド値から分割された一部のテキストに結合されるようにしてもよい。 In one embodiment, multiple first type tokens may be generated from a single field value. A field value consisting of one word or a plurality of words may be divided into a plurality of texts in the token conversion process, and an already set text may be combined with at least a portion of the divided text. For example, the field value "NLP COFFEE" included in the model input value is converted into a plurality of first type tokens ("NL", "#P", "COFF", "#EE"). Here, the text "#" included in the first type token is text that defines that there is no blank space between it and the previous token, and is to be combined with a part of the text divided from the field value. It's okay.

前記フィールド値のうち特定のフィールド値に対応する第１タイプトークンは、１つ又はそれ以上から構成されてもよく、前記特定のフィールド値に対応する複数の第１タイプトークンは、互いに隣接して配列されてもよい。例えば、フィールド値「ＮＬＰＣＯＦＦＥＥ」から生成された複数の第１タイプトークン（「ＮＬ」、「♯Ｐ」、「ＣＯＦＦ」、「♯ＥＥ」）は、順次配列されてもよい。 The first type tokens corresponding to a particular field value among the field values may be composed of one or more, and the plurality of first type tokens corresponding to the particular field value are adjacent to each other. May be arranged. For example, the plurality of first type tokens ("NL", "#P", "COFF", "#EE") generated from the field value "NLP COFFEE" may be arranged in sequence.

一方、第２タイプトークンは、モデル入力値に含まれる複数の区分子のそれぞれから変換されたものであってもよい。 On the other hand, the second type token may be one converted from each of a plurality of delimiters included in the model input value.

一実施形態において、第２タイプトークンは、異なるカテゴリーに属するフィールド値を区分するようになっているが、第２タイプトークン自体が特定の意味を含まない形態からなるようにしてもよい。例えば、第２タイプトークンは、［ＳＥＰ１］、［ＳＥＰ２］、［ＳＥＰ３］の形態からなるようにしてもよい。 In one embodiment, the second type token is adapted to distinguish field values belonging to different categories, but the second type token itself may be of a form that does not include a specific meaning. For example, the second type token may be in the form of [SEP1], [SEP2], and [SEP3].

他の一実施形態において、第２タイプトークンのそれぞれは、当該第２タイプトークンに対応するカテゴリーの属性を示す値を含んでもよい。具体的には、複数の第２タイプトークンは、前記複数のカテゴリーを示すテキストをそれぞれ含み、前記第２タイプトークンのうちいずれか１つに含まれるテキストと他の１つに含まれるテキストとは異なるものであってもよい。例えば、第２タイプトークンは、［ＳＥＰ＿Ｎａｍｅ］、［ＳＥＰ＿Ａｄｄｒｅｓｓ］のように、特定のカテゴリーのフィールド名を含んでもよい。 In another embodiment, each of the second type tokens may include a value indicating an attribute of the category corresponding to the second type token. Specifically, the plurality of second type tokens each include text indicating the plurality of categories, and the text included in any one of the second type tokens is different from the text included in the other one. They may be different. For example, the second type token may include field names of a particular category, such as [SEP_Name], [SEP_Address].

一方、特定のカテゴリーに属するフィールド値から変換された第１タイプトークンが複数である場合、前記特定のカテゴリーに対応する第２タイプトークンは、前記複数の第１タイプトークンのうち最初に配列された第１タイプトークンの前部又は前記複数の第１タイプトークンのうち最後に配列された第１タイプトークンの後部に配列されるようにしてもよい。 On the other hand, when there are multiple first type tokens converted from field values belonging to a specific category, the second type token corresponding to the specific category is arranged first among the plurality of first type tokens. The first type tokens may be arranged at the front of the first type token or at the rear of the first type token arranged last among the plurality of first type tokens.

例えば、モデル入力値「［ＣＬＳ］ＮＬＰ＿ＣＯＦＦＥＥ［ＳＥＰ＿Ｎａｍｅ］ＳＣｉｔｙ［ＳＥＰ＿Ａｄｄｒｅｓｓ］」から変換された第１及び第２タイプトークンは、「［ＣＬＳ］／ＮＬ／♯Ｐ／ＣＯＦＦ／♯ＥＥ／［ＳＥＰ１］／Ｓ／Ｃｉ／♯ｔｙ／［ＳＥＰ２］」のように配列される。（「／」は単にトークンを区分するための表示である）なお、モデル入力値がマスクを含む場合、マスクトークンは、マスクトークンに対応する第２タイプトークンに隣接して配列される。 For example, the first and second type tokens converted from the model input value "[CLS]NLP_COFFEE[SEP_Name]S City[SEP_Address]" are "[CLS]/NL/#P/COFF/#EE/[SEP1]". /S/Ci/#ty/[SEP2]". (The "/" is simply a display for distinguishing tokens.) Note that when the model input value includes a mask, the mask token is arranged adjacent to the second type token corresponding to the mask token.

配列された第１、第２タイプトークン及びマスクトークンが既に学習されたディープラーニングモデルに入力され、コンテンツに対応するベクトルが生成される。 The arranged first and second type tokens and mask tokens are input to a trained deep learning model to generate a vector corresponding to the content.

データの埋め込みのためのディープラーニングモデルとしては、シーケンスを埋め込む際に活用できるモデル、具体的にはＲＮＮ又はＴｒａｎｓｆｏｒｍｅｒ類のモデル（例えば、ＢＥＲＴなど）を活用することができる。 As a deep learning model for embedding data, a model that can be used when embedding a sequence, specifically, an RNN or a Transformer type model (for example, BERT) can be used.

学習されたディープラーニングモデルは、異なるカテゴリーに属するフィールド値を含む構造化されたデータのベクトルを生成する。具体的には、前記学習されたディープラーニングモデルは、保存部１２０に保存されたデータのそれぞれのベクトルを生成し、受信したコンテンツのベクトルを生成する。すなわち、コンテンツ及び前記コンテンツを用いて検索しようとする既に保存されたデータをベクトル化する。 The trained deep learning model produces a vector of structured data containing field values belonging to different categories. Specifically, the trained deep learning model generates vectors for each of the data stored in the storage unit 120 and generates a vector for the received content. That is, the content and already stored data to be searched using the content are vectorized.

図５を参照してコンテンツのベクトルを生成する一実施形態について説明すると、例えば、制御部１４０は、対象文書のイメージ５１０に対してＯＣＲ５２０を行ってＯＣＲデータを生成し、ＯＣＲデータから異なるカテゴリーに属する複数のフィールド値を含む構造化されたデータ５３０を生成し、データの埋め込み５４０によりコンテンツ５３０のベクトルを生成する。 An embodiment of generating content vectors with reference to FIG. 5 will be described. For example, the control unit 140 performs OCR 520 on an image 510 of a target document to generate OCR data, and separates the content into different categories from the OCR data. Structured data 530 including a plurality of field values belonging to each other is generated, and a vector of content 530 is generated by data embedding 540.

一方、既に保存されたデータも、学習されたディープラーニングモデルによりベクトル化される。既に保存されたデータのそれぞれに対応するベクトル５５１ａ～５５３ａ、５５１ｂ～５５４ｂ、５５１ｃ～５５４ｃは、ベクトル平面上に示される。既に保存されたデータのベクトルの生成時に既に保存されたデータを構造化されたデータに変換するステップ（例えば、５１０及び５２０）は省略される。 Meanwhile, already stored data is also vectorized using the learned deep learning model. Vectors 551a to 553a, 551b to 554b, and 551c to 554c corresponding to the previously saved data are shown on the vector plane. When generating a vector of already stored data, the steps of converting already stored data into structured data (eg, 510 and 520) are omitted.

図５においては、説明の便宜上、データの埋め込みにより生成されるベクトルを２次元的に示すが、データの埋め込みにより生成されるベクトルは２次元より大きい次元のベクトルであってもよい。 In FIG. 5, for convenience of explanation, vectors generated by data embedding are shown two-dimensionally, but vectors generated by data embedding may have dimensions larger than two dimensions.

一方、図５においては、説明の便宜上、データの埋め込みが行われるデータを２種類のカテゴリー（ｎａｍｅ、ｔｅｌ）のみを含むものとして説明するが、データの埋め込みが行われるデータはそれより多い数のフィールドを含んでもよい。 On the other hand, in FIG. 5, for convenience of explanation, the data to which data will be embedded will be explained as containing only two types of categories (name, tel), but the data to which data will be embedded will include a larger number of categories. May contain fields.

同図に示すように、フローティングされた複数のベクトルのうち、一部のベクトル５５１ａ～５５３ａは、第１領域５５０ａ内で互いに隣室して配置される。なお、他の一部のベクトル５５１ｂ～５５４ｂは、第２領域５５０ｂ内で互いに隣室して配置される。さらに他のベクトル５５１ｃ～５５４ｃは、第３領域５５０ｃ内で互いに隣室して配置される。 As shown in the figure, some of the floating vectors 551a to 553a are arranged adjacent to each other in the first region 550a. Note that some of the other vectors 551b to 554b are arranged adjacent to each other in the second region 550b. Still other vectors 551c to 554c are arranged adjacent to each other within the third region 550c.

ディープラーニングモデルは、データ間の類似度に応じてベクトル間の距離が異なるように訓練される。具体的には、ディープラーニングモデルは、データ間の類似度が高いほど近い位置に配置され、データ間の類似度が低いほど遠い位置に配置されるように訓練される。 Deep learning models are trained to vary the distance between vectors depending on the degree of similarity between the data. Specifically, the deep learning model is trained such that the higher the similarity between data, the closer the data is placed, and the lower the similarity between the data, the farther the data is placed.

ＯＣＲデータに基づいて生成されたコンテンツ５３０に対するデータの埋め込みの結果で生成されたベクトルは、第１領域５５０ａ上にフローティングされる。 A vector generated as a result of embedding data in the content 530 generated based on the OCR data is floated on the first area 550a.

前述したように、本発明は、データに含まれる複数のフィールド値が属するカテゴリーを区分してデータの埋め込みを行うので、データに含まれる異なるカテゴリーの特徴が維持されたベクトルを生成することができる。 As described above, the present invention embeds data by classifying the categories to which multiple field values included in the data belong, so it is possible to generate vectors that maintain the characteristics of different categories included in the data. .

また、本発明は、生成された異なるカテゴリーの特徴が維持されたベクトルをデータ検索に活用することにより、既に保存されたデータと同じデータのみを検索することに限定されず、対象文書に含まれる複数のカテゴリーに属する値の類似度を考慮したデータ検索を行うことができる。以下、前記ベクトルを活用したデータ検索について具体的に説明する。 Furthermore, by utilizing the generated vectors in which characteristics of different categories are maintained for data retrieval, the present invention is not limited to retrieving only the same data as already stored data, but also the data included in the target document. It is possible to perform data searches that take into account the similarity of values belonging to multiple categories. Hereinafter, a data search using the vectors will be specifically explained.

図６ａ及び図６ｂはＯＣＲデータを用いてデータを検索する一実施形態を示す概念図であり、図７はデータの埋め込みの結果で生成されたベクトルを用いてデータを検索する一実施形態を示す概念図である。 6a and 6b are conceptual diagrams showing an embodiment of searching data using OCR data, and FIG. 7 shows an embodiment of searching data using a vector generated as a result of data embedding. It is a conceptual diagram.

ディープラーニングを用いて前記コンテンツのベクトルを生成し、その後前記生成されたベクトルと既に保存された複数のデータのそれぞれに対応するベクトル間の類似度に基づいて、前記既に保存された複数のデータから検索対象データに対応するデータを検索するステップが行われる（Ｓ１４０）。 A vector of the content is generated using deep learning, and then a vector is generated from the plurality of already stored data based on the similarity between the generated vector and the vector corresponding to each of the plurality of already stored data. A step of searching for data corresponding to the search target data is performed (S140).

前述した作業領域には、イメージを用いてデータを検索するためのインタフェース画面を表示することができる。 In the aforementioned work area, an interface screen for searching data using images can be displayed.

図６ａ及び図６ｂを参照すると、作業領域には、検索対象コンテンツ、例えば領収証のイメージ６００が出力される。イメージ６００は、既に保存されたイメージのいずれかであるか、作業領域を表示する電子機器に内蔵されたカメラにより撮影されたイメージであるか、作業領域を表示する電子機器以外の他の装置から受信されたイメージであってもよい。 Referring to FIGS. 6a and 6b, search target content, such as an image 600 of a receipt, is output in the work area. The image 600 is either an already stored image, an image taken by a camera built into the electronic device displaying the work area, or an image from another device other than the electronic device displaying the work area. It may also be a received image.

一方、作業領域には、イメージ６００に対するＯＣＲの結果で抽出されたテキストを表示することができる。ＯＣＲの結果で抽出されたテキストのうちフィールド名に分類されたデータは、前記抽出されたテキストがそのまま表示されるのではなく、既に保存されたテキストが表示されるようにしてもよく、また、抽出されたテキストに存在しなくても作業領域上に表示されるようにしてもよい。 Meanwhile, text extracted as a result of OCR on the image 600 can be displayed in the work area. For data classified into field names among the text extracted as a result of OCR, the extracted text may not be displayed as is, but already saved text may be displayed. It may be displayed on the work area even if it does not exist in the extracted text.

例えば、図６ａを参照すると、イメージ６００には、売場名に関するカテゴリーが存在するが、当該カテゴリーに関するフィールド名は含まれていない。制御部１４０は、フィールド値「ＨＬＰＣｏｆｆｅｅ」の意味に基づいて、イメージ６００に売場名に関するカテゴリーが存在すると判断し、作業領域に既に保存されたフィールド名（「ｎａｍｅ」６１１）を表示することができる。 For example, referring to FIG. 6a, image 600 includes a category related to store names, but does not include field names related to the category. The control unit 140 determines that a category related to the store name exists in the image 600 based on the meaning of the field value "HLP Coffee", and displays the field name ("name" 611) already saved in the work area. can.

一方、制御部１４０は、第２タイプのデータ「カフェラッテ（ｈｏｔ）」に基づいて、イメージ６００に商品名に関するカテゴリーが存在すると判断する。ここで、制御部１４０は、抽出されたテキストに商品名に関するカテゴリーに対応するフィールド名「商品名」が存在するが、既に保存されたフィールド名「ｉｔｅｍ１」を作業領域上に表示することができる。 On the other hand, the control unit 140 determines that a category related to the product name exists in the image 600 based on the second type of data "cafe latte (hot)." Here, the control unit 140 can display the already saved field name "item1" on the work area, although there is a field name "product name" corresponding to a category related to the product name in the extracted text. .

前述したように、作業領域には、ＯＣＲの結果で抽出されたテキストを、フィールド名６１１～６１５及びフィールド値６２１～６２５に区分して表示することができる。ここで、制御部１４０は、同一のカテゴリーに属するフィールド名及びフィールド値をマッチングさせ、そのマッチングの結果に基づいてデータを表示することができる。例えば、同一のカテゴリーに属するフィールド名「ｎａｍｅ」及びフィールド値「ＨＬＰＣｏｆｆｅｅ」がマッチングされ、作業領域上で互いに隣接して表示される。 As described above, in the work area, the text extracted as a result of OCR can be divided into field names 611 to 615 and field values 621 to 625 and displayed. Here, the control unit 140 can match field names and field values belonging to the same category, and display data based on the matching result. For example, a field name "name" and a field value "HLP Coffee" belonging to the same category are matched and displayed adjacent to each other on the work area.

制御部１４０は、抽出されたテキストの意味に基づいて同一のカテゴリーに属するデータをマッチングさせてコンテンツを生成し、生成されたコンテンツ及びディープラーニングモデルを用いてベクトルを生成する。その後、生成されたベクトルと既に保存された複数のデータのベクトル間の距離を比較し、その比較の結果に基づいて、前記既に保存された複数のデータのベクトルから少なくとも１つを選択する。 The control unit 140 generates content by matching data belonging to the same category based on the meaning of the extracted text, and generates a vector using the generated content and a deep learning model. Thereafter, the distance between the generated vector and the plurality of already stored data vectors is compared, and based on the comparison result, at least one is selected from the plurality of already stored data vectors.

制御部１４０は、学習されたディープラーニングモデルを用いてデータベースに既に保存された「属性－値」形式のデータを前述したベクトルに変換する。データベースに既に保存されたデータは、図５で説明したように、ベクトル空間に表すことができる。 The control unit 140 converts the "attribute-value" format data already stored in the database into the above-mentioned vector using the learned deep learning model. Data already stored in the database can be represented in a vector space as explained in FIG.

制御部１４０は、前記コンテンツのベクトルと既に保存された他のベクトル間の距離を算出し、既に保存された他のベクトルから少なくとも１つのベクトルをベクトル間の距離が小さい順に選択することができる。 The control unit 140 may calculate the distance between the vector of the content and other already stored vectors, and select at least one vector from the other already stored vectors in order of decreasing distance between the vectors.

その後、制御部１４０は、前記選択されたベクトルに対応するデータを出力することができる。前記選択されたベクトルに対応するデータは、前述した作業領域上に表示することができる。 Thereafter, the control unit 140 may output data corresponding to the selected vector. Data corresponding to the selected vector can be displayed on the aforementioned work area.

このために、本発明によるデータ埋め込みモデルは、前述したベクトルを「属性－値」対からなる形式のデータに変換するデコーダ（ｄｅｃｏｄｅｒ）を含んでもよい。前記デコーダは、特定のベクトルの生成時に入力データとして活用されたデータと同じ形態のデータに変換するように機械学習される。よって、前記デコーダは、複数のフィールド値が属するカテゴリーが区分された形態のデータ（例えば、ＪＳＯＮ、ＸＭＬ）を出力する。 To this end, the data embedding model according to the invention may include a decoder that converts the aforementioned vector into data in the form of "attribute-value" pairs. The decoder is machine-trained to convert data into the same form of data that was used as input data when generating a particular vector. Therefore, the decoder outputs data (eg, JSON, XML) in which categories to which a plurality of field values belong are classified.

例えば、図６ｂを参照すると、作業領域には、図６ａで説明したイメージから生成されたコンテンツに対応するベクトルからの距離が最も近い第１ベクトル及び２番目に近い第２ベクトルのそれぞれに対応する「属性－値」対からなるデータ６３１、６３２が表示される。前記データのうち、図６ａで説明したイメージ６００に対応するデータ６３１が含まれる。 For example, referring to FIG. 6b, the work area includes a first vector and a second vector that are the closest in distance from the vector corresponding to the content generated from the image described in FIG. 6a, respectively. Data 631 and 632 consisting of "attribute-value" pairs are displayed. Among the data, data 631 corresponding to the image 600 described in FIG. 6a is included.

より具体的には、図７を参照すると、制御部１４０は、既に保存された複数のデータに対応するベクトル５５１ａ～５５３ｃと前記コンテンツに対応するベクトル５６０間の距離を算出する。その結果、既に保存された複数のデータに対応するベクトル５５１ａ～５５３ｃのそれぞれに対する距離ｄ１～ｄ３が算出される。制御部１４０は、既に保存されたデータに対応するベクトル５５１ａ～５５３ｃからコンテンツに対応するベクトル５６０からの距離が最も近いベクトル５５２ａを選択し、ベクトル５５２ａを「属性－値」対のデータ（Ｎａｍｅ：ＮＬＰＣＯＦＦＥＥ、Ｔｅｌ：０１－２３４－５６７）に変換して出力することができる。 More specifically, referring to FIG. 7, the control unit 140 calculates the distance between vectors 551a to 553c corresponding to a plurality of previously stored data and a vector 560 corresponding to the content. As a result, distances d1 to d3 are calculated for each of the vectors 551a to 553c corresponding to a plurality of already saved data. The control unit 140 selects the vector 552a having the closest distance from the vector 560 corresponding to the content from the vectors 551a to 553c corresponding to already saved data, and converts the vector 552a into "attribute-value" pair data (Name: NLP COFFEE, Tel: 01-234-567) and output.

前述したように、本発明によれば、埋め込みの結果で生成されたベクトル間の類似度に基づいてデータを検索するので、データ検索時に人が定めた検索規則に依存して検索を行う必要がなくなる。 As described above, according to the present invention, data is searched based on the similarity between vectors generated as a result of embedding, so there is no need to rely on search rules determined by a person when searching for data. It disappears.

また、本発明によれば、データのカテゴリー毎の類似度を考慮した検索が可能であるので、ノイズやエラーが頻繁に発生するデータ（例えば、ＯＣＲデータ、音声認識データ）を用いた検索時にも高い正確度でデータ検索を行うことができる。 Furthermore, according to the present invention, since it is possible to perform a search that takes into account the degree of similarity for each category of data, it is possible to perform a search using data that frequently contains noise or errors (for example, OCR data, voice recognition data). Data searches can be performed with high accuracy.

一方、前述した本発明は、コンピュータで１つ以上のプロセスにより実行され、コンピュータ可読媒体（又は記録媒体）に格納可能なプログラムとして実現することができる。 On the other hand, the present invention described above can be implemented as a program that is executed by one or more processes on a computer and can be stored on a computer-readable medium (or recording medium).

また、前述した本発明は、プログラム記録媒体にコンピュータ可読コード又はコマンドとして実現することができる。すなわち、本発明は、プログラムの形態で提供することができる。 Further, the present invention described above can be implemented as computer readable codes or commands on a program recording medium. That is, the present invention can be provided in the form of a program.

一方、コンピュータ可読媒体は、コンピュータシステムにより読み取り可能なデータが記録されるあらゆる種類の記録装置を含む。コンピュータ可読媒体の例としては、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｉｓｋ）、ＳＤＤ（ＳｉｌｉｃｏｎＤｉｓｋＤｒｉｖｅ）、ＲＯＭ、ＲＡＭ、ＣＤ－ＲＯＭ、磁気テープ、フロッピー（登録商標）ディスク、光データ記憶装置などが挙げられる。 On the other hand, the computer-readable medium includes any type of recording device on which data that can be read by a computer system is recorded. Examples of computer readable media include HDD (Hard Disk Drive), SSD (Solid State Disk), SDD (Silicon Disk Drive), ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage. Examples include equipment.

また、コンピュータ可読媒体は、ストレージを含み、電子機器が通信によりアクセスできるサーバ又はクラウドストレージであり得る。この場合、コンピュータは、有線又は無線通信により、サーバ又はクラウドストレージから本発明によるプログラムをダウンロードすることができる。 The computer-readable medium also includes storage and can be a server or cloud storage that the electronic device can access via communication. In this case, the computer can download the program according to the invention from the server or cloud storage by wired or wireless communication.

さらに、本発明において、前述したコンピュータは、プロセッサ、すなわち中央処理装置（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ，ＣＰＵ）が搭載された電子機器であり、その種類は特に限定されない。 Furthermore, in the present invention, the computer described above is an electronic device equipped with a processor, that is, a central processing unit (Central Processing Unit, CPU), and its type is not particularly limited.

一方、本発明の詳細な説明は例示的なものであり、あらゆる面で限定的に解釈されてはならない。本発明の範囲は添付の特許請求の範囲の合理的解釈により定められるべきであり、本発明の均等の範囲内でのあらゆる変更が本発明の範囲に含まれる。 On the other hand, the detailed description of the present invention is illustrative and should not be construed as limiting in any respect. The scope of the present invention should be determined by reasonable interpretation of the appended claims, and all modifications within the scope of equivalents of the present invention are included within the scope of the present invention.

１０イメージ
１００データ検索システム
１１０通信部
１２０保存部
１３０ＯＣＲ部
１４０制御部
１５０音声認識部
１６０ＤＢ（データベース） 10 Image 100 Data Search System 110 Communication Department 120 Storage Department 130 OCR Department 140 Control Department 150 Voice Recognition Department 160 DB (Database)

Claims

receiving content including a plurality of field values;
arranging the field values included in the content, and adding a plurality of division molecules for classifying categories to which the field values belong to generate model input values;
generating a vector of the content using the model input value and the trained deep learning model;
searching for data corresponding to the search target data from the plurality of already stored data based on the degree of similarity between the generated vector and vectors corresponding to each of the plurality of already stored data. ,Data retrieval methods.

The field values each correspond to a plurality of categories, and
2. The data search method according to claim 1, wherein the plurality of division molecules correspond to the plurality of categories, respectively.

3. The data search method according to claim 2, wherein a specific segment element and a specific field value corresponding to a specific category among the plurality of categories are arranged adjacent to each other.

4. The data retrieval method according to claim 3, wherein the specific segment is arranged before or after the specific field value.

converting each of the field values included in the model input value into at least one first type token;
and converting each of the plurality of delimiters into a second type token,
The first and second type tokens converted from the field value and the delimiter corresponding to the specific category such that the first type tokens corresponding to each of the plurality of categories are classified from each other, The data retrieval method according to claim 4, wherein the data retrieval method is arranged adjacent to each other.

The second type token is
each including text indicating the plurality of categories;
6. The data search method according to claim 5, wherein the text included in one of the second type tokens is different from the text included in the other one.

The first type tokens corresponding to the specific field value among the field values are plural, and the plurality of first type tokens corresponding to the specific field value are arranged adjacent to each other. The data retrieval method according to claim 5, characterized in that:

The second type token corresponding to the category to which the specific field value belongs is located at the front of the first type token arranged first among the plurality of first type tokens or among the plurality of first type tokens. 8. The data retrieval method according to claim 7, wherein the data retrieval method is arranged after the first type token arranged last.

If the content includes a field name defining a particular category and does not include the field value corresponding to the particular category, the model input value includes a mask token corresponding to the particular category;
2. The data search method according to claim 1, wherein the mask token is arranged adjacent to a section molecule corresponding to the specific category.

The step of searching for data corresponding to the search target data from the plurality of already saved data,
Comparing the distance between the vector of the content and the vector corresponding to each of the plurality of already stored data;
selecting at least one vector from the vectors corresponding to each of the plurality of already stored data based on the comparison result;
The data search method according to claim 1, further comprising the step of outputting data corresponding to the selected vector.

The data corresponding to the selected vector is
Contains multiple field values belonging to different categories,
11. The data retrieval method according to claim 10, wherein the plurality of field values are classified into categories.

The content is
2. The data retrieval method according to claim 1, wherein the data is in a form in which categories to which the plurality of field values belong are classified.

13. The data search method according to claim 12, wherein the content is generated from either OCR data or audio data related to images.

a communication unit that receives content including a plurality of field values;
arranging field values included in the content, adding a plurality of division molecules for classifying categories to which the field values belong, and generating model input values;
generating a vector of the content using the model input value and the learned deep learning model;
a control unit that searches data corresponding to the search target data from the plurality of already stored data based on the similarity between the generated vector and the vector corresponding to each of the plurality of already stored data; Including, data retrieval system.

A computer program including a plurality of instructions,
When the instruction is executed,
receiving content including a plurality of field values;
arranging the field values included in the content, adding a plurality of division molecules for classifying categories to which the field values belong, and generating model input values;
generating a vector of the content using the model input value and the trained deep learning model;
searching for data corresponding to the search target data from the plurality of already stored data based on the similarity between the generated vector and the vector corresponding to each of the plurality of already stored data;
A computer program that runs on a computer.