JP2023507881A

JP2023507881A - Domain-based text extraction method and system

Info

Publication number: JP2023507881A
Application number: JP2022525481A
Authority: JP
Inventors: マドゥスダンシン，; カウシクハルダー，; ベンカタラメッシュラユル，ニルマルバナパリ; ダスティダル，アリトラゴーシュ; アジェイシャ，
Original assignee: エルアンドティーテクノロジーサービシズリミテッド
Priority date: 2019-12-30
Filing date: 2020-12-30
Publication date: 2023-02-28
Also published as: CA3156204A1; AU2020418619A1; EP4085343A4; EP4085343A1; WO2021137166A1

Abstract

The present disclosure relates to methods and systems for extracting information from the content of input files. The method includes identifying text data from an input file, receiving text input from a user to identify related text entities from a plurality of text entities, and automatically generating search patterns corresponding to the text input. May include doing. The method may further include determining patterns associated with each of the plurality of textual entities, and matching search patterns corresponding to the text input with the patterns associated with the plurality of textual entities. The method includes identifying one or more matching patterns from patterns associated with a plurality of text entities based on the correspondence; and extracting from.
[Selection drawing] Fig. 2

Description

本開示は、概してデータ抽出に関し、より詳細には、１つまたは複数のデータ抽出アプローチを用いて入力ファイルのコンテンツからテキスト情報を抽出するための方法およびシステムに関する。 TECHNICAL FIELD This disclosure relates generally to data extraction, and more particularly to methods and systems for extracting textual information from the content of input files using one or more data extraction approaches.

最近、テキスト抽出技法が重要性を帯びてきている。例えば、光学文字認識（Optical Character Recognition; ＯＣＲ）などの抽出技法は、ユーザが画像またはポータブル・ドキュメント・フォーマット（Portable Document Format; ＰＤＦ）ファイルなどのファイルからテキストデータを抽出することを可能とし得る。さらに、関連情報を抽出し、抽出された関連情報を用いて表現を生成することが望ましい場合がある。 Recently, text extraction techniques have gained importance. For example, extraction techniques such as Optical Character Recognition (OCR) may allow users to extract text data from files such as images or Portable Document Format (PDF) files. Additionally, it may be desirable to extract relevant information and generate a representation using the extracted relevant information.

いくつかの利用可能な技法によれば、入力ファイルがタグを含む場合、または識別されたテキストにおいてパターンを識別することができる場合に、入力ファイルからテキスト情報を抽出し、抽出された情報を用いて表現を生成することが可能であり得る。しかしながら、タグが存在しない場合があるまたはパターンを識別することができない複雑な文書においては、関連情報を抽出したり、表現を生成したりすることが難しい。 According to some available techniques, textual information is extracted from the input file if the input file contains tags or if patterns can be identified in the identified text, and the extracted information is used. It may be possible to generate a representation using However, in complex documents where tags may not exist or patterns cannot be identified, it is difficult to extract relevant information or generate representations.

本発明による方法およびシステムは、入力ファイルのコンテンツから情報を抽出するものであって、入力ファイルからテキストデータを識別することと、複数のテキストエンティティから関連テキストエンティティを識別するために、ユーザからテキスト入力を受け取ることと、テキスト入力に対応する検索パターンを自動的に生成することと、複数のテキストエンティティの各々に関連付けられたパターンを決定することと、テキスト入力に対応する前記検索パターンを、前記複数のテキストエンティティに関連付けられたパターンと対応付けることと、対応付けに基づいて、複数のテキストエンティティに関連付けられたパターンから１つまたは複数の合致パターンを識別することと、１つまたは複数の合致パターンに対応する関連テキストエンティティを複数のテキストエンティティから抽出することとを含む。識別されたテキストデータは、複数のテキストエンティティを含む。 A method and system according to the present invention extract information from the content of an input file to identify text data from the input file and text data from a user to identify related text entities from a plurality of text entities. receiving input; automatically generating a search pattern corresponding to the text input; determining a pattern associated with each of a plurality of text entities; matching patterns associated with a plurality of textual entities; identifying one or more matching patterns from the patterns associated with the plurality of textual entities based on the matching; and one or more matching patterns. extracting related text entities corresponding to from the plurality of text entities. The identified text data includes multiple text entities.

その他、本願が開示する課題とその解決方法は、発明を実施するための形態の欄、および図面の記載によって明らかにされる。 In addition, the problems disclosed by the present application and their solutions will be clarified by the description in the section of the description of the mode for carrying out the invention and the drawings.

本開示に組み込まれてその一部を構成する添付の図面は、例示的実施形態を示し、説明と共に開示の原理を解説する役割を果たす。 The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the principles of the disclosure.

本開示のいくつかの実施形態に係る、入力ファイルのコンテンツから情報を抽出するための例示的システムの機能ブロック図である。1 is a functional block diagram of an exemplary system for extracting information from content of an input file, in accordance with some embodiments of the present disclosure; FIG. 本開示のいくつかの実施形態に係る、自然言語処理（Natural Language Processing; ＮＬＰ）メタデータ抽出フレームワークの機能ブロック図である。2 is a functional block diagram of a Natural Language Processing (NLP) metadata extraction framework, according to some embodiments of the present disclosure; FIG. 本開示のいくつかの実施形態に係る、スペルチェックシステムのブロック図である。1 is a block diagram of a spell checking system, according to some embodiments of the present disclosure; FIG. 本開示のいくつかの実施形態に係る、推奨システムのブロック図である。1 is a block diagram of a recommendation system, according to some embodiments of the present disclosure; FIG. 本開示のいくつかの実施形態に係る、メタデータ更新システムのブロック図である。1 is a block diagram of a metadata update system, according to some embodiments of the present disclosure; FIG. 本開示の様々な実施形態に係る、入力ファイルのコンテンツから情報を抽出する方法のフローチャートである。4 is a flowchart of a method for extracting information from the contents of an input file, according to various embodiments of the present disclosure; 本開示の様々な実施形態に係る、入力ファイルのコンテンツから情報を抽出する方法のフローチャートである。4 is a flowchart of a method for extracting information from the contents of an input file, according to various embodiments of the present disclosure; 本開示の様々な実施形態に係る、入力ファイルのコンテンツから情報を抽出する方法のフローチャートである。4 is a flowchart of a method for extracting information from the contents of an input file, according to various embodiments of the present disclosure; 本開示のいくつかの実施形態に係る、情報を抽出する対象であるノイジーなエンティティを有する例示的な入力ファイルのスナップショットである。4 is a snapshot of an exemplary input file with noisy entities from which information is to be extracted, according to some embodiments of the present disclosure; 本開示の様々な実施形態に係る、情報を抽出する対象である別の例示的な入力ファイルのスナップショットである。5 is a snapshot of another exemplary input file from which information is extracted, in accordance with various embodiments of the present disclosure;

添付の図面を参照して、例示的実施形態を説明する。都合の良い場合には、同じまたは同様の部分を示すために、図面の全体を通じて同じ参照番号が用いられる。開示の原理の例および特徴を本明細書で説明するが、開示の実施形態の趣旨および範囲から逸脱しない限りにおいて、修正、改変、および他の実装が可能である。以下の詳細な説明は単に例示的なものとみなされることが意図されており、真の範囲および趣旨は、以下の特許請求の範囲によって示される。 Exemplary embodiments are described with reference to the accompanying drawings. Where convenient, the same reference numbers will be used throughout the drawings to designate the same or like parts. While examples and features of the disclosed principles are described herein, modifications, variations, and other implementations are possible without departing from the spirit and scope of the disclosed embodiments. It is intended that the following detailed description be considered exemplary only, with a true scope and spirit being indicated by the following claims.

ここで図１を参照すると、本開示のいくつかの実施形態に係る、入力ファイルのコンテンツから情報を抽出するための例示的システム１００の機能ブロック図が示されている。システム１００は、テキスト識別デバイス１０２を含んでよい。いくつかの実施形態において、テキスト識別デバイス１０２は、入力ファイルのコンテンツからテキストデータを識別してよい。例えば、入力ファイルは、画像またはポータブル・ドキュメント・フォーマット（Portable Document Format; ＰＤＦ）ファイルであってよい。識別されたテキストデータは、複数のテキストエンティティを含んでよい。したがって、いくつかの実施形態において、テキスト識別デバイス１０２は、入力ファイルからテキストデータを識別するために、光学文字認識（Optical Character Recognition; ＯＣＲ）技法を用いてよい。代替的な実施形態において、テキスト識別デバイス１０２は、入力ファイルからテキストデータを識別するために、当技術分野において知られている任意の他の技法を用いてよい。 Referring now to FIG. 1, illustrated is a functional block diagram of an exemplary system 100 for extracting information from content of an input file, according to some embodiments of the present disclosure. System 100 may include a text identification device 102 . In some embodiments, text identification device 102 may identify text data from the content of input files. For example, the input file may be an image or a Portable Document Format (PDF) file. The identified text data may include multiple text entities. Accordingly, in some embodiments, the text identification device 102 may employ Optical Character Recognition (OCR) techniques to identify text data from input files. In alternative embodiments, text identification device 102 may use any other technique known in the art to identify text data from input files.

システム１００は、テキスト識別デバイス１０２と通信可能な情報抽出デバイス１０４をさらに含んでよい。例えば、情報抽出デバイス１０４は、テキスト識別デバイス１０２によって入力ファイルから識別された複数のテキストエンティティから、関連情報を抽出するように構成されてよい。いくつかの実施形態において、情報を抽出するべく、情報抽出デバイス１０４は、複数のテキストエンティティから関連テキストエンティティを識別するためにユーザからテキスト入力を受け取り、テキスト入力に対応する検索パターンを自動的に生成してよい。情報抽出デバイス１０４はさらに、複数のテキストエンティティの各々に関連付けられたパターンを決定し、テキスト入力に対応する検索パターンを複数のテキストエンティティに関連付けられたパターンと対応付けてよい。情報抽出デバイス１０４はさらに、対応付けに基づいて複数のテキストエンティティに関連付けられたパターンから１つまたは複数の合致パターンを識別し、１つまたは複数の合致パターンに対応する関連テキストエンティティを複数のテキストエンティティから抽出してよい。 System 100 may further include an information extraction device 104 in communication with text identification device 102 . For example, the information extraction device 104 may be configured to extract relevant information from multiple textual entities identified from the input file by the text identification device 102 . In some embodiments, to extract information, the information extraction device 104 receives text input from a user to identify related text entities from a plurality of text entities and automatically generates search patterns corresponding to the text input. may be generated. The information extraction device 104 may further determine patterns associated with each of the plurality of textual entities, and match search patterns corresponding to the text input with the patterns associated with the plurality of textual entities. The information extraction device 104 further identifies one or more matching patterns from patterns associated with the plurality of text entities based on the correspondence, and extracts associated text entities corresponding to the one or more matching patterns from the plurality of texts. May be extracted from an entity.

システム１００は、ディスプレイ１１４をさらに含んでよい。いくつかの実施形態において、情報抽出デバイス１０４は、１つまたは複数のプロセッサ１１０およびコンピュータ可読媒体（例えばメモリ）１１２を含んでよい。コンピュータ可読記憶媒体１１２は、１つまたは複数のプロセッサ１１０により実行されたとき、本開示の態様により、入力ファイルのコンテンツに基づいて表現を生成することを１つまたは複数のプロセッサ１１０に行わせる命令を記憶してよい。コンピュータ可読記憶媒体１１２はまた、システム１００によって取り込まれ、処理され、かつ／または必要とされ得る様々なデータ（例えば、識別されたテキストデータ、値エンティティ、データ、パターンエンティティデータなど）を記憶してよい。システム１００は、ディスプレイ１１４を介してアクセス可能なユーザインターフェース１１６を介してユーザとやり取りしてよい。システム１００はまた、様々なデータを送るまたは受け取るために、通信ネットワーク１０８を介して１つまたは複数の外部デバイス１０６とやり取りしてよい。外部デバイス１０６は、リモートサーバ、デジタルデバイス、または別のコンピューティングシステムを含んでよいが、これらに限定されないものであってよい。 System 100 may further include display 114 . In some embodiments, information extraction device 104 may include one or more processors 110 and computer-readable media (eg, memory) 112 . Computer readable storage medium 112 contains instructions that, when executed by one or more processors 110, cause one or more processors 110 to generate a representation based on the contents of an input file, in accordance with aspects of this disclosure. can be stored. Computer-readable storage medium 112 also stores various data (eg, identified text data, value entities, data, pattern entity data, etc.) that may be captured, processed, and/or required by system 100. good. System 100 may interact with a user via user interface 116 accessible via display 114 . System 100 may also interact with one or more external devices 106 via communication network 108 to send or receive various data. External device 106 may include, but is not limited to, a remote server, digital device, or another computing system.

ここで図２を参照すると、本開示のいくつかの実施形態に係る自然言語処理（Natural Language Processing; ＮＬＰ）メタデータ抽出フレームワーク２００の機能ブロック図が示されている。ＮＬＰメタデータ抽出フレームワーク２００は、システム１００に、特に情報抽出デバイス１０４に実装されてよい。ＮＬＰメタデータ抽出フレームワーク２００は、入力ファイルのコンテンツから情報を抽出するための様々な機能を行う様々なモジュールを含んでよい。いくつかの実施形態において、抽出エンジン２０２が、入力ファイル２０４のコンテンツからテキストデータを抽出／識別してよい。入力ファイル２０４は、フラットファイル２０６、またはＰＤＦファイル２０８、またはデータベースファイル２１０、またはＯＣＲ出力２１２であってよい。 Referring now to FIG. 2, a functional block diagram of a Natural Language Processing (NLP) metadata extraction framework 200 according to some embodiments of the present disclosure is shown. The NLP metadata extraction framework 200 may be implemented in the system 100 and specifically in the information extraction device 104 . The NLP metadata extraction framework 200 may include various modules that perform various functions to extract information from the content of input files. In some embodiments, extraction engine 202 may extract/identify text data from the content of input file 204 . The input file 204 may be a flat file 206, or a PDF file 208, or a database file 210, or an OCR output 212.

抽出エンジン２０２はさらに、ドメインルールデータベース２１４から１つまたは複数のドメインルールを受け取ってよい。さらに、いくつかの実施形態において、ドメインルールデータベース２１４は、（従来の手続き型コードによるものではなく）「ｉｆ－ｔｈｅｎ」ルールに基づくものであってよい。ドメインルールは、複数のテキストエンティティの各々に関連付けられたドメインベースの名前を決定するために用いられてよいことに留意されたい。ドメインベースの名前は、複数のテキストエンティティの各々を、データデポジトリ（図１では不図示）に記憶されたドメインベースの名前のリストと対応付けることに基づいて決定されてよい。例えば、識別されたデータが、ある国のある都市のＰＩＮコードに属するテキストエンティティを含む場合、ドメインベースのデータベースは、例えばＰＩＮコードをデータリポジトリに記憶されたドメインベースの名前のリストと対応付けることにより、ドメイン名、すなわちそのＰＩＮコードを有する都市の名前を決定してよい。 Extraction engine 202 may also receive one or more domain rules from domain rules database 214 . Further, in some embodiments, the domain rules database 214 may be based on "if-then" rules (as opposed to traditional procedural code). Note that domain rules may be used to determine the domain-based name associated with each of multiple textual entities. The domain-based name may be determined based on associating each of the plurality of textual entities with a list of domain-based names stored in a data repository (not shown in FIG. 1). For example, if the identified data includes text entities belonging to a PIN code for a city in a country, the domain-based database may, for example, associate the PIN code with a list of domain-based names stored in the data repository. , the domain name, ie the name of the city that has that PIN code.

抽出エンジン２０２はさらに、位置ルールデータベース２１６から１つまたは複数の位置ルールを受け取ってよい。いくつかの実施形態において、位置ルールは、ＯＣＲ出力に基づいて、入力ファイルにおけるテキストデータからのエンティティに関連付けられた位置を決定するために用いられてよいことに留意されたい。位置は、入力ファイルに関連付けられたタイプを最初に識別し、入力ファイルに関連付けられたタイプに基づいて入力ファイルにおける１つまたは複数の関連テキストエンティティの位置を決定したときに、決定されてよい。関連テキストエンティティの位置が決定されると、関連テキストエンティティが位置に基づいて入力ファイルから抽出されてよい。１つまたは複数の位置ルールは、表を含む入力データ、すなわち表形式で配置された入力データの場合にはより関連性が高いものであり得るが、１つまたは複数の位置ルールは、任意の他のタイプの入力データ、例えばフリーテキストを有する文書にも同様に適用可能であってよいことに留意されたい。 Extraction engine 202 may also receive one or more location rules from location rules database 216 . Note that in some embodiments, location rules may be used to determine locations associated with entities from text data in an input file based on OCR output. The position may be determined when first identifying the type associated with the input file and determining the position of one or more associated textual entities in the input file based on the type associated with the input file. Once the locations of the associated text entities are determined, the associated text entities may be extracted from the input file based on location. The location rule(s) may be more relevant for input data that includes tables, i.e. input data that is arranged in tabular form, but the location rule(s) may be any Note that it may be equally applicable to documents with other types of input data, eg free text.

例として、入力ファイルのタイプは、「Ａａｄｈａｒ」カードのものであってよい。「Ａａｄｈａｒ」カードのような文書においては、データが特定の形式で存在していてよいことをさらに理解されたい。したがって、関連テキストエンティティの位置は、データリポジトリに記憶された、入力ファイルに関連付けられたタイプに対応するテンプレートに基づいて予め決定されてよい。例えば、（対象の人物の）テキストエンティティ「名前」および「住所」の位置は、文書の中程であってよく、この情報は、システム１００のデータベースにおいて取得可能であり記憶されていてよい。特に、位置は、（ｘ，ｙ）座標に基づくものであってよい。上記の例において、テキストエンティティ「名前」および「住所」の位置は、（ｘ，ｙ）座標（５，１０）～（１５，１５）によって画定される領域内であってよい。したがって、文書のタイプが「Ａａｄｈａｒ」カードとして識別されると、システム１００は、上記位置にアクセスして、その位置から関連テキストエンティティ（すなわち「名前」および「住所」）を抽出してよい。 As an example, the input file type may be of an "Aadhar" card. It should further be appreciated that in a document such as an "Aadhar" card, data may exist in a particular format. Accordingly, the location of the relevant textual entity may be predetermined based on a template stored in the data repository and corresponding to the type associated with the input file. For example, the location of the textual entities "name" and "address" (of the person of interest) may be in the middle of the document, and this information may be available and stored in the system 100 database. In particular, the position may be based on (x,y) coordinates. In the example above, the location of the text entities "name" and "address" may be within the region defined by (x,y) coordinates (5,10) to (15,15). Thus, when the type of document is identified as an "Aadhar" card, system 100 may access the location and extract the relevant textual entities (ie, "name" and "address") from the location.

抽出エンジン２０２はさらに、第１のＮＬＰルールデータベース２１８から１つまたは複数の第１のＮＬＰルールを受け取ってよい。第１のＮＬＰルールデータベース２１８は、正解値／属性について提供されるＮＬＰルールに基づいて正規表現（Regular Expression; Ｒｅｇｅｘ）を作成してよいことに留意されたい。当業者には理解されるように、Ｒｅｇｅｘは、検索パターンを記述するための特別な文字列であってよい。 Extraction engine 202 may also receive one or more first NLP rules from first NLP rule database 218 . Note that the first NLP rules database 218 may create a Regular Expression (Regex) based on the NLP rules provided for the correct value/attribute. As those skilled in the art will appreciate, a Regex may be a special string to describe a search pattern.

抽出エンジン２０２はさらに、第２のＮＬＰルールデータベース２２０から１つまたは複数の第２のＮＬＰルールを受け取ってよい。いくつかの実施形態において、第２のＮＬＰルールデータベース２２０は、正解値／属性について品詞（part of speech; ＰＯＳ）を定義してよいことに留意されたい。例えば、複数のテキストエンティティに関連付けられたＰＯＳが識別されてよい。当業者には理解されるように、ＰＯＳは、名詞、代名詞、動詞、副詞、形容詞、接続詞、前置詞、および間投詞のうちの１つであってよい。ＰＯＳを識別すると、ＰＯＳが名詞である複数のテキストエンティティから１つまたは複数のユニークなテキストエンティティが選択されてよい。ＰＯＳが名詞として識別された１つまたは複数の（選択された）ユニークなテキストエンティティは、関連データとして抽出されてよい。 Extraction engine 202 may also receive one or more second NLP rules from second NLP rule database 220 . Note that in some embodiments, the second NLP rules database 220 may define parts of speech (POS) for correct values/attributes. For example, POS associated with multiple textual entities may be identified. As will be appreciated by those skilled in the art, POS may be one of nouns, pronouns, verbs, adverbs, adjectives, conjunctions, prepositions, and interjections. Upon identifying POS, one or more unique text entities may be selected from multiple text entities for which POS is a noun. One or more (selected) unique textual entities where POS is identified as a noun may be extracted as relevant data.

例えば、給与明細文書（入力データ）は、対象の人物の名前を含む複数のテキストエンティティを含んでよい。名前を抽出することが、抽出エンジン２０２の目的であってよい。さらに、名前は、名詞のＰＯＳカテゴリに属するものであることが理解されよう。したがって、ＰＯＳが名詞であるテキストエンティティが選択されることになる。したがって、必要とされる名前が、選択されたエンティティから抽出されることになる。 For example, a payslip document (input data) may contain a number of text entities including the name of the person of interest. Extracting names may be the purpose of the extraction engine 202 . Further, it will be appreciated that names belong to the POS category of nouns. Therefore, textual entities where POS is a noun will be selected. Therefore, the required name will be extracted from the selected entity.

いくつかの実施形態において、複数のテキストエンティティに関連付けられたＰＯＳを識別する試行が行われると、ＰＯＳが識別されない１つまたは複数のユニークなテキストエンティティが識別されてよい。１つまたは複数のユニークなテキストエンティティの各々に関連付けられたエンティティタイプが決定されてよい。エンティティタイプは、値エンティティまたはパターンエンティティであってよい。各値エンティティは関連付けられた値を有していてよく、各パターンエンティティは関連付けられたパターンを有する。 In some embodiments, when an attempt is made to identify POS associated with multiple textual entities, one or more unique textual entities for which no POS is identified may be identified. An entity type associated with each of the one or more unique textual entities may be determined. An entity type may be a value entity or a pattern entity. Each value entity may have an associated value and each pattern entity has an associated pattern.

例えば、識別されたテキストエンティティは、電子メールＩＤまたは油田名（例えばそれぞれ「ｊｏｈｎ＠ｇｍａｉｌ．ｃｏｍ」または「ＡＢ－１２３」）のようなコードネームを示すテキスト文字の組み合わせから構成されていてよい。したがって、そのようなテキストエンティティについては、ＰＯＳが識別されない場合がある。したがって、そのようなテキストエンティティは、値エンティティまたはパターンエンティティのいずれかとして識別されてよい。したがって、電子メールＩＤ（「ｊｏｈｎ＠ｇｍａｉｌ．ｃｏｍ」）は値エンティティとして決定されてよく、値は電子メールＩＤである。これは、電子メールＩＤのパターンが通常は典型的かつ既知のパターンであり、全ての電子メールＩＤが同じまたは類似のパターンを共有し得るためであり得る。一方、油田名に対応する「ＡＢ－１２３」のようなテキストエンティティは、パターンエンティティとして決定されてよい。これは、油田名が同じパターンを共有し得るが、そのようなパターンは一般的に知られていない場合がある、すなわちＮＬＰメタデータ抽出フレームワーク２００がそのようなパターンを予め記憶していない場合があるためである。値エンティティを識別すると、値エンティティの関連付けられた値と同じ関連付けられた値を有する入力ファイルにおけるテキストエンティティが、自動的に識別されてよい。換言すると、全ての電子メールＩＤテキストエンティティが、識別および抽出されてよい。 For example, an identified textual entity may consist of a combination of text characters that indicate a codename such as an email ID or oil field name (eg, "john@gmail.com" or "AB-123", respectively). Therefore, POS may not be identified for such text entities. Accordingly, such text entities may be identified as either value entities or pattern entities. Thus, the email ID ("john@gmail.com") may be determined as the value entity, where the value is the email ID. This may be because email ID patterns are usually typical and known patterns, and all email IDs may share the same or similar patterns. On the other hand, a text entity such as "AB-123" corresponding to an oil field name may be determined as a pattern entity. This is because while field names may share the same pattern, such patterns may not be commonly known, i.e., if the NLP metadata extraction framework 200 does not pre-store such patterns. This is because Upon identifying a value entity, text entities in the input file that have an associated value that is the same as the value entity's associated value may be automatically identified. In other words, all email ID text entities may be identified and extracted.

パターンエンティティを識別すると、パターンエンティティに関連付けられたパターン（Ｒｅｇｅｘ）に対応する検索パターンが、自動的に生成されてよい。さらに、入力ファイルにおける複数のテキストエンティティの各々に関連付けられたパターンが決定されてよい。パターンエンティティに対応する検索パターン（Ｒｅｇｅｘ）は、複数のテキストエンティティに関連付けられたパターンと対応付けられてよく、対応付けに基づいて、複数のテキストエンティティに関連付けられたパターンからの１つまたは複数の合致パターンが決定されてよい。１つまたは複数の合致パターンに対応する合致テキストエンティティが、複数のテキストエンティティから抽出されてよい。 Upon identifying a pattern entity, search patterns corresponding to patterns (Regex) associated with the pattern entity may be automatically generated. Additionally, patterns associated with each of the plurality of text entities in the input file may be determined. A search pattern (Regex) corresponding to a pattern entity may be matched with the pattern associated with the multiple text entities, and based on the matching, one or more of the patterns from the patterns associated with the multiple text entities. A matching pattern may be determined. Matching text entities corresponding to one or more matching patterns may be extracted from multiple text entities.

例えば、油田に対応するテキストエンティティ「ＡＢ－１２３」について、検索パターン（Ｒｅｇｅｘ）「ａｐｄ」（ここで、「ａ」は１つまたは複数のアルファベットを意味し、「ｐ」は１つまたは複数の句読点を意味し、「ｄ」は１つまたは複数の数字を意味する）が自動的に生成されてよい。さらに、入力ファイルにおける他のテキストエンティティに関連付けられたパターンが決定され、検索パターン「ａｐｄ」が他のテキストエンティティに関連付けられたパターンと対応付けられてよく、合致パターンが識別されてよい。その後、合致パターンに対応する全てのテキストエンティティが抽出される。これにより、入力ファイルにおける全ての油田名が抽出されてよい。 For example, for the text entity "AB-123" corresponding to an oil field, the search pattern (Regex) "apd" (where "a" means one or more alphabets and "p" means one or more punctuation marks, and "d" means one or more digits) may be automatically generated. Additionally, patterns associated with other textual entities in the input file may be determined, the search pattern "apd" may be matched with patterns associated with other textual entities, and matching patterns may be identified. All text entities corresponding to the matching pattern are then extracted. This may extract all oil field names in the input file.

いくつかの実施形態において、入力ファイルからのテキストデータが識別されると、第１のＮＬＰルールは、複数のテキストエンティティから関連テキストエンティティを識別するためにユーザからテキスト入力を受け取り、テキスト入力に対応する検索パターンを自動的に生成することを行わせてよい。第１のＮＬＰルールはさらに、複数のテキストエンティティの各々に関連付けられたパターンを決定し、テキスト入力に対応する検索パターンを複数のテキストエンティティに関連付けられたパターンと対応付けることを行わせてよい。第１のＮＬＰルールはさらに、対応付けに基づいて複数のテキストエンティティに関連付けられたパターンから１つまたは複数の合致パターンを識別し、１つまたは複数の合致パターンに対応する関連テキストエンティティを複数のテキストエンティティから抽出することを行わせてよい。 In some embodiments, once text data from an input file is identified, a first NLP rule receives text input from a user to identify related text entities from a plurality of text entities, and responds to the text input. automatically generate a search pattern for The first NLP rule may further cause a pattern associated with each of the plurality of text entities to be determined and a search pattern corresponding to the text input to be matched with the patterns associated with the plurality of text entities. The first NLP rule further identifies one or more matching patterns from the patterns associated with the plurality of text entities based on the matching, and identifies the associated text entities corresponding to the one or more matching patterns as a plurality of matching patterns. Extracting from text entities may be done.

例えば、ユーザは、入力ファイルから抽出された複数のテキストエンティティから関連テキストエンティティを識別するために、テキスト入力「ＡＢ－１２３」を提供してよい。検索パターン（Ｒｅｇｅｘ）「ａｐｄ」が、入力されたテキストエンティティ「ＡＢ－１２３」について識別されてよい。さらに、入力ファイルにおける他のテキストエンティティに関連付けられたパターンが決定されてよく、検索パターン（Ｒｅｇｅｘ）「ａｐｄ」が他のテキストエンティティに関連付けられたパターンと対応付けられてよく、合致パターンが識別される。その後、合致パターンに対応する全てのテキストエンティティが抽出されてよい。これにより、入力ファイルにおける全ての油田名が抽出されてよい。 For example, a user may provide a text input "AB-123" to identify a related text entity from multiple text entities extracted from an input file. A search pattern (Regex) “apd” may be identified for the entered text entity “AB-123”. Additionally, patterns associated with other textual entities in the input file may be determined, a search pattern (Regex) "apd" may be matched with the patterns associated with the other textual entities, and matching patterns may be identified. be. All text entities corresponding to the matching pattern may then be extracted. This may extract all oil field names in the input file.

いくつかの実施形態において、ユーザからのテキスト入力は、「ＭｉｃｒｏｓｏｆｔＥｘｃｅｌ（登録商標）」シートにおけるエントリの形態で受け取られてよいことに留意されたい。テキスト入力に対応する検索パターン（Ｒｅｇｅｘ）を自動的に生成し得る１つまたは複数のルールが定義されてよい。その後、１つまたは複数のルールは、入力ファイルの複数のテキストエンティティの各々に関連付けられたパターンを決定し、テキスト入力に対応する検索パターンを複数のテキストエンティティに関連付けられたパターンと対応付け、対応付けに基づいて複数のテキストエンティティに関連付けられたパターンから１つまたは複数の合致パターンを識別し、１つまたは複数の合致パターンに対応する関連テキストエンティティを複数のテキストエンティティから抽出してよい。 Note that in some embodiments, text input from a user may be received in the form of entries in a Microsoft Excel® sheet. One or more rules may be defined that may automatically generate search patterns (Regex) that correspond to text input. One or more rules then determine patterns associated with each of the plurality of text entities of the input file, match search patterns corresponding to the text input with patterns associated with the plurality of text entities, and One or more matching patterns may be identified from the patterns associated with the plurality of textual entities based on the attachment, and related textual entities corresponding to the one or more matching patterns may be extracted from the plurality of textual entities.

したがって、いくつかの実施形態において、「ＭｉｃｒｏｓｏｆｔＥｘｃｅｌ（登録商標）」データは、１つまたは複数の「ＭｉｃｒｏｓｏｆｔＥｘｃｅｌ（登録商標）」シート２２２から受け取られてよい。様々なルール（ドメインルール、ＮＬＰルール、位置ルール等）が適用されることで、抽出されたデータ２２６が得られてよい。いくつかの実施形態において、「ＭｉｃｒｏｓｏｆｔＥｘｃｅｌ（登録商標）」マクロが抽出に用いられてよい。さらに、いくつかの実施形態において、同マクロがＰｙｔｈｏｎ（言語；登録商標）に埋め込まれてよい。ここで、抽出されたデータ２２６は、複数のテキストエンティティから抽出された関連テキストエンティティ、または、様々なルールを用いて抽出された関連テキストエンティティに基づいて生成され得る表現を指すものであってよい。いくつかの追加的な実施形態において、上記のアプローチ、すなわちドメインベース、位置ベース、ＰＯＳベース、およびＲｅｇｅｘベースのアプローチがテキスト抽出を提供することに失敗した場合、機械学習（Machine Learning; ＭＬ）ベースのアプローチが用いられてよい。ＭＬベースのアプローチには、人工知能（ＡＩ）モデル（すなわちＭＬモデル）２２４が用いられてよい。 Thus, in some embodiments, Microsoft Excel® data may be received from one or more Microsoft Excel® sheets 222 . Various rules (domain rules, NLP rules, location rules, etc.) may be applied to obtain extracted data 226 . In some embodiments, a "Microsoft Excel®" macro may be used for extraction. Further, in some embodiments, the same macros may be embedded in Python (language; registered trademark). Here, extracted data 226 may refer to related text entities extracted from multiple text entities or representations that may be generated based on related text entities extracted using various rules. . In some additional embodiments, when the above approaches, i.e., domain-based, location-based, POS-based, and Regex-based approaches fail to provide text extraction, Machine Learning (ML)-based approach may be used. An artificial intelligence (AI) model (or ML model) 224 may be used for ML-based approaches.

ここで図３を参照すると、本開示のいくつかの実施形態に係るスペルチェックシステム３００のブロック図が示されている。いくつかの実施形態において、抽出モジュール３０２（テキスト識別デバイス１０２に対応する）により抽出されたテキストエンティティが、スペルチェックシステム３００により受け取られてよい。抽出モジュール３０２により生成されたこれらの表現は、スペルチェックシステム３００への入力テキスト３０４として働いてよい。入力テキスト３０４は、前処理モジュール３０６により前処理されてよい。 Referring now to FIG. 3, shown is a block diagram of a spell checking system 300 according to some embodiments of the present disclosure. In some embodiments, textual entities extracted by extraction module 302 (corresponding to text identification device 102 ) may be received by spell checking system 300 . These expressions generated by extraction module 302 may serve as input text 304 to spell checking system 300 . Input text 304 may be preprocessed by preprocessing module 306 .

前処理が行われると、コサイン類似度モジュール３０８により、前処理されたデータに対してコサイン類似度解析が行われてよい。一例として、コサイン類似度モジュール３０８は、閾値を用いてコサイン類似度解析を行ってよい。最小編集距離モジュール３１０が、コサイン類似度モジュール３０８から受け取られたデータに対して最小編集距離解析を行ってよい。いくつかの実施形態において、最小編集距離解析は、フィルタリング済みの単語に対して行われてよい。 Once pre-processing is performed, cosine similarity analysis may be performed on the pre-processed data by cosine similarity module 308 . As an example, cosine similarity module 308 may perform cosine similarity analysis using a threshold. A minimum edit distance module 310 may perform a minimum edit distance analysis on data received from cosine similarity module 308 . In some embodiments, minimum edit distance analysis may be performed on filtered words.

最大コサイン類似度モジュール３１２が、最小編集距離モジュール３１０から受け取られたデータに対して最大コサイン類似度解析を行ってよい。最大コサイン類似度解析は、最小編集距離単語内の単語に基づくものであってよいことに留意されたい。置換モジュール３１４は、最大コサイン類似度モジュール３１２により行われた解析に基づいて、不正確な単語を修正された単語で置換してよい。 A maximum cosine similarity module 312 may perform a maximum cosine similarity analysis on the data received from the minimum edit distance module 310 . Note that the maximum cosine similarity analysis may be based on words within the minimum edit distance words. A replacement module 314 may replace incorrect words with corrected words based on the analysis performed by the maximum cosine similarity module 312 .

ここで図４を参照すると、本開示のいくつかの実施形態に係る推奨システム４００のブロック図が示されている。例えば、推奨システム４００は、エンティティタイプ、すなわち値エンティティまたはパターンエンティティを識別するために用いられてよい。いくつかの実施形態において、推奨システム４００は、分類ベースの機械学習（ＭＬ）モデル４０２を用いてよい。推奨システム４００は、入力データ（ＮＬＰ抽出・スペルチェックデータ）４０６を受け取ってよい。入力データは、ＮＬＰおよびスペルチェックが行われたデータであってよいことに留意されたい。推奨システム４００は、設定（ｃｏｎｆｉｇ）ファイル４０８を含んでよい。ｃｏｎｆｉｇファイル４０８は、入力データについて値が推奨され得るか、またはパターンが推奨され得るかを決定することをトリガしてよい。 Referring now to FIG. 4, a block diagram of a recommendation system 400 is shown in accordance with some embodiments of the present disclosure. For example, recommendation system 400 may be used to identify entity types, ie, value entities or pattern entities. In some embodiments, recommendation system 400 may use classification-based machine learning (ML) models 402 . The recommendation system 400 may receive input data (NLP extraction and spell check data) 406 . Note that the input data can be NLP and spell checked data. Recommendation system 400 may include a config file 408 . The config file 408 may trigger determining whether values or patterns may be recommended for the input data.

推奨システム４００は、入力データをＭＬモデル４０２に供給してよい。ＭＬモデル４０２は、アーカイブデータベース４０４からの履歴データを用いてよい。履歴データに基づいて、ＭＬモデル４０２は、予測データ４１０を提供してもよく、または推奨データ４１２を提供してもよい。したがって、推奨値４１４が予測データ４１０に基づいて生成されてもよく、または推奨パターン４１６が推奨データ４１２に基づいて生成されてもよい。 Recommendation system 400 may provide input data to ML model 402 . ML model 402 may use historical data from archive database 404 . Based on historical data, ML model 402 may provide predictive data 410 or may provide recommended data 412 . Thus, recommended values 414 may be generated based on prediction data 410 or recommended patterns 416 may be generated based on recommended data 412 .

ここで図５を参照すると、本開示のいくつかの実施形態に係るメタデータ更新システム５００のブロック図が示されている。いくつかの実施形態において、メタデータ更新システム５００は、ＮＬＰ抽出済みデータ５０２を受け取ってよい。メタデータ更新システム５００は、システム日時モジュール５０４を含んでよい。メタデータ更新システム５００は、信頼度モジュール５０６および確率モジュール５０８をさらに含んでよい。信頼度モジュール５０６は、抽出されたデータおよびＮＬＰルールにより生成された表現の正確度に関連付けられた信頼度スコアを決定してよい。確率モジュール５０８は、信頼度スコアに基づいて、抽出されたデータおよびＮＬＰルールにより生成された表現の精度に関連付けられた正確度スコアを決定してよい。換言すると、確率モジュール５０８は、抽出されたデータまたはＮＬＰルールを用いて生成された表現がどれだけ正確／有用であるかについての確率を決定してよい。 Referring now to FIG. 5, a block diagram of metadata update system 500 is shown in accordance with some embodiments of the present disclosure. In some embodiments, metadata update system 500 may receive NLP extracted data 502 . Metadata update system 500 may include system date and time module 504 . Metadata update system 500 may further include confidence module 506 and probability module 508 . Confidence module 506 may determine a confidence score associated with the accuracy of the extracted data and the representations generated by the NLP rules. Probability module 508 may determine accuracy scores associated with the accuracy of the extracted data and representations generated by the NLP rules based on the confidence scores. In other words, the probability module 508 may determine a probability of how accurate/useful the extracted data or the representation generated using the NLP rules is.

確率モジュール５０８により決定される確率が十分に高い（すなわち閾値よりも大きい）場合、抽出されたデータおよびＮＬＰルールを用いて生成された表現、すなわちルールベース値５１４が提供されてよい。一方、確率モジュール５０８により決定される確率が十分に高くない（すなわち閾値よりも小さい）場合、抽出されたデータおよびＭＬモデルを用いて生成された表現、すなわち機械学習（ＭＬ）値５１６が提供されてよい。前述の通り、ＭＬ値は、アーカイブデータベース５１０に記憶され得る履歴データを用いてＭＬモデルにより生成されてよい。抽出されたデータおよび生成された表現は、最終データベース５１２に記憶されてよい。 If the probability determined by the probability module 508 is high enough (ie, greater than a threshold), the extracted data and the representation generated using the NLP rules, ie, the rule-based value 514, may be provided. On the other hand, if the probability determined by the probability module 508 is not sufficiently high (i.e., less than the threshold), a representation generated using the extracted data and the ML model, i.e., machine learning (ML) value 516 is provided. you can As previously mentioned, ML values may be generated by ML models using historical data that may be stored in archive database 510 . The extracted data and generated representations may be stored in final database 512 .

ここで図６を参照すると、本開示の一実施形態に係る、入力ファイルのコンテンツから情報を抽出する方法６００のフローチャートが示されている。理解されるように、方法６００は、テキスト抽出のＲｅｇｅｘベースの方法を提供してよい。ステップ６０２において、入力ファイルからテキストデータが識別されてよい。識別されたテキストデータは、複数のテキストエンティティを含んでよい。いくつかの例において、入力ファイルは、画像ファイルおよびポータブル・ドキュメント・フォーマット（ＰＤＦ）ファイルのうちの少なくとも１つを含んでよいことに留意されたい。 Referring now to FIG. 6, a flowchart of a method 600 for extracting information from the content of an input file is shown, according to one embodiment of the present disclosure. As will be appreciated, method 600 may provide a Regex-based method of text extraction. At step 602, text data may be identified from the input file. The identified text data may include multiple text entities. Note that in some examples, the input files may include at least one of image files and Portable Document Format (PDF) files.

ステップ６０４において、複数のテキストエンティティから関連テキストエンティティを識別するために、テキスト入力がユーザから受け取られてよい。例えば、ユーザは、テキストを「ＭｉｃｒｏｓｏｆｔＥｘｃｅｌ（登録商標）」シートにおけるエントリとして提供してよい。ステップ６０６において、テキスト入力に対応する検索パターン（Ｒｅｇｅｘ）が、自動的に生成されてよい。ステップ６０８において、複数のテキストエンティティの各々に関連付けられたパターンが決定されてよい。ステップ６１０において、テキスト入力に対応する検索パターンが、複数のテキストエンティティに関連付けられたパターンと対応付けられてよい。ステップ６１２において、複数のテキストエンティティに関連付けられたパターンからの１つまたは複数の合致パターンが、対応付けに基づいて識別されてよい。ステップ６１４において、１つまたは複数の合致パターンに対応する関連テキストエンティティが、複数のテキストエンティティから抽出されてよい。 At step 604, text input may be received from a user to identify related text entities from a plurality of text entities. For example, a user may provide text as an entry in a Microsoft Excel® sheet. At step 606, a search pattern (Regex) corresponding to the text input may be automatically generated. At step 608, patterns associated with each of the plurality of text entities may be determined. At step 610, a search pattern corresponding to a text input may be matched with patterns associated with a plurality of text entities. At step 612, one or more matching patterns from patterns associated with multiple text entities may be identified based on the correspondence. At step 614, related text entities corresponding to one or more matching patterns may be extracted from the plurality of text entities.

加えて、いくつかの実施形態においては、複数のテキストエンティティから１つまたは複数の合致パターンに対応する関連テキストエンティティを抽出すると、抽出された関連テキストエンティティを用いて、表現が生成されてよい。 Additionally, in some embodiments, upon extracting related text entities corresponding to one or more matching patterns from the plurality of text entities, a representation may be generated using the extracted related text entities.

さらに、例えば方法６００では関連エンティティを抽出することができない場合に、関連エンティティを抽出するために機械学習（ＭＬ）モデルが用いられてよい。換言すると、データ（関連テキストエンティティ）を抽出することができない場合、ＭＬベースのアプローチが開始してよい。 Additionally, a machine learning (ML) model may be used to extract related entities, for example, if the method 600 fails to extract related entities. In other words, if the data (relevant textual entities) cannot be extracted, the ML-based approach may start.

ここで図７を参照すると、本開示の別の実施形態に係る、入力ファイルのコンテンツから情報を抽出する方法７００のフローチャートが示されている。ステップ７０２において、入力ファイルからテキストデータが識別されてよい。識別されたテキストデータは、複数のテキストエンティティを含んでよい。入力ファイルは、画像ファイルおよびポータブル・ドキュメント・フォーマット（ＰＤＦ）ファイルのうちの少なくとも１つを含んでよいことに留意されたい。 Referring now to FIG. 7, shown is a flowchart of a method 700 for extracting information from the content of an input file, according to another embodiment of the present disclosure. At step 702, text data may be identified from the input file. The identified text data may include multiple text entities. Note that the input files may include at least one of image files and Portable Document Format (PDF) files.

ステップ７０４において、複数のテキストエンティティの各々に関連付けられたドメインベースの名前が決定されてよい。ドメインベースの名前は、複数のテキストエンティティの各々を、データデポジトリに記憶されたドメインベースの名前のリストと対応付けることに基づいて決定されてよい。ステップ７０６において、入力ファイルに関連付けられたタイプが識別されてよい。ステップ７０８において、入力ファイルに関連付けられたタイプに基づいて、入力ファイルにおける１つまたは複数の関連テキストエンティティの位置が決定されてよい。ステップ７１０において、位置に基づいて、１つまたは複数の関連テキストエンティティが入力ファイルから抽出されてよい。 At step 704, domain-based names associated with each of the plurality of textual entities may be determined. The domain-based name may be determined based on associating each of the plurality of textual entities with a list of domain-based names stored in the data repository. At step 706, a type associated with the input file may be identified. At step 708, the location of one or more related textual entities in the input file may be determined based on the type associated with the input file. At step 710, one or more relevant textual entities may be extracted from the input file based on location.

ステップ７１２において、複数のテキストエンティティに関連付けられた品詞（ＰＯＳ）が識別されてよい。ＰＯＳは、名詞、代名詞、動詞、副詞、形容詞、接続詞、前置詞、および間投詞のうちの１つであってよいことを理解されたい。ステップ７１４において、ＰＯＳが名詞として識別される１つまたは複数のテキストエンティティが決定されてよい。 At step 712, parts of speech (POS) associated with the plurality of text entities may be identified. It should be appreciated that POS may be one of nouns, pronouns, verbs, adverbs, adjectives, conjunctions, prepositions, and interjections. At step 714, one or more textual entities for which POS is identified as a noun may be determined.

ステップ７１６において、ＰＯＳが識別されない１つまたは複数のユニークなテキストエンティティが選択されてよい。ステップ７１８において、１つまたは複数のユニークなテキストエンティティの各々に関連付けられたエンティティタイプが決定されてよい。エンティティタイプは、値エンティティおよびパターンエンティティのうちの１つであってよい。さらに、各値エンティティは関連付けられた値を有していてよく、各パターンエンティティは関連付けられたパターンを有する。図４に関連して説明したように、１つまたは複数のユニークなテキストエンティティの各々に関連付けられたエンティティタイプを決定するために、（ＭＬベースの）推奨システム４００が用いられてよい。 At step 716, one or more unique textual entities for which no POS is identified may be selected. At step 718, an entity type associated with each of the one or more unique text entities may be determined. An entity type may be one of a value entity and a pattern entity. Furthermore, each value entity may have an associated value and each pattern entity has an associated pattern. As described in connection with FIG. 4, a (ML-based) recommendation system 400 may be used to determine the entity type associated with each of the one or more unique textual entities.

ステップ７２０において、各値エンティティについて、値エンティティの関連付けられた値と同じ関連付けられた値を有する１つまたは複数のテキストエンティティが、自動的に識別されてよい。各パターンエンティティについて、ステップ７２２～７２８が行われてよい。したがって、ステップ７２２において、パターンエンティティに関連付けられたパターンに対応する検索パターン（Ｒｅｇｅｘ）が、自動的に生成されてよい。ステップ７２４において、パターンエンティティに対応する検索パターンが、複数のテキストエンティティに関連付けられたパターンと対応付けられてよい。ステップ７２６において、対応付けに基づいて、複数のテキストエンティティに関連付けられたパターンから１つまたは複数の合致パターンが決定されてよい。ステップ７２８において、１つまたは複数の合致パターンに対応する合致テキストエンティティが、複数のテキストエンティティから抽出されてよい。 At step 720, for each value entity, one or more text entities having an associated value that is the same as the value entity's associated value may be automatically identified. Steps 722-728 may be performed for each pattern entity. Accordingly, at step 722, a search pattern (Regex) corresponding to the pattern associated with the pattern entity may be automatically generated. At step 724, a search pattern corresponding to the pattern entity may be matched with patterns associated with multiple text entities. At step 726, one or more matching patterns may be determined from the patterns associated with the plurality of text entities based on the matching. At step 728, matching text entities corresponding to one or more matching patterns may be extracted from the plurality of text entities.

方法７００は、方法６００と組み合わせて行われてよいことに留意されたい。例示的シナリオにおいて、方法６００は、方法７００のステップ７１０が完了した後、方法７００のステップ７１２が始まる前に開始してよい。 Note that method 700 may be performed in combination with method 600. In an exemplary scenario, method 600 may begin after step 710 of method 700 is completed but before step 712 of method 700 begins.

ここで図８を参照すると、本開示の別の実施形態に係る、入力ファイルのコンテンツから情報を抽出する方法８００のフローチャートが示されている。ステップ８０２において、入力ファイルのコンテンツからテキストデータが抽出されてよい。抽出されたテキストデータは、複数のテキストエンティティを含んでよい。 Referring now to FIG. 8, shown is a flowchart of a method 800 for extracting information from the content of an input file, according to another embodiment of the present disclosure. At step 802, text data may be extracted from the contents of the input file. The extracted text data may include multiple text entities.

ステップ８０４において、複数のテキストエンティティの各々に関連付けられたドメインベースの名前が決定されてよい。ドメインベースの名前は、複数のテキストエンティティの各々を、データデポジトリに記憶されたドメインベースの名前のリストと対応付けることに基づいて決定されてよい。例えば、ＰＩＮコードが取得可能な場合、例えばＰＩＮコードをナレッジベースに記憶された（ＰＩＮコードおよび関連付けられた都市名の）リストと対応付けることにより、ドメインベースのアプローチを用いてＰＩＮコードから都市が決定されてよい。同様に、都市名（例えばデリー）が識別された場合、国名（例えばインド）が決定されてよい。いくつかの実施形態においては、それに応じて、第２の複数のエンティティの各々にタグが割り当てられてよい。 At step 804, domain-based names associated with each of the plurality of textual entities may be determined. The domain-based name may be determined based on associating each of the plurality of textual entities with a list of domain-based names stored in the data repository. For example, if the PIN code is retrievable, the city is determined from the PIN code using a domain-based approach, e.g., by matching the PIN code with a list (of PIN codes and associated city names) stored in a knowledge base. may be Similarly, if a city name (eg Delhi) is identified, a country name (eg India) may be determined. In some embodiments, a tag may be assigned to each of the second plurality of entities accordingly.

ステップ８０６において、ドメインベースの名前を決定することに成功したか否かを確認するためのチェックが行われてよい。ドメインベースの名前を決定することに成功した場合、本方法はステップ８２２（「Ｙｅｓ」の経路）に進んでよく、このとき、識別されたテキストエンティティが抽出されてよい。さらに、抽出されたテキストエンティティを用いて、表現が生成されてよい。ステップ８０６において、ドメインベースの名前が決定されない場合、本方法はステップ８０８（「Ｎｏ」の経路）に進んでよい。換言すると、ドメインベースの名前を識別することができない場合、本方法は、次の代替的アプローチの実行に進んでよい。 At step 806, a check may be made to see if the domain-based name was successfully resolved. If the domain-based name is successfully determined, the method may proceed to step 822 (“Yes” path), at which time the identified textual entity may be extracted. Additionally, a representation may be generated using the extracted text entities. In step 806, if the domain-based name is not determined, the method may proceed to step 808 (“No” path). In other words, if the domain-based name cannot be identified, the method may proceed to perform the next alternative approach.

ステップ８０８において、複数のテキストエンティティの各々の位置に基づいて、所定のフィールドの値が決定されてよい。位置は、（ｘおよびｙ）座標の形態であってよい。一例として、位置は、データリポジトリに記憶されたテンプレートに基づいて予め決定されてよい。例えば、いくつかのシナリオにおいて、特定の名前の値がその名前の横または特定の名前ヘッダの下に記載されていてよく、これは（ｘおよびｙ）座標に基づいて識別可能であってよい。例えば、「Ａａｄｈａｒカード」、または「ＰＡＮカード」、または運転免許証において、エンティティ、例えばユーザの名前が特定の位置において取得可能であってよく、そこから名前を識別することができる。位置の技法は、表の場合に特に有用であり得る。これは、表におけるデータは構造化された形式で利用可能であり、エンティティの位置が容易かつ正確に特定され得るためである。したがって、入力ファイルに関連付けられたタイプ、すなわち入力ファイルが「Ａａｄｈａｒカード」であるか、または「ＰＡＮカード」であるか、または運転免許証であるかが、まず識別されてよい。さらに、入力ファイルに関連付けられたタイプに基づいて、入力ファイルにおける１つまたは複数の関連テキストエンティティの位置が決定されてよい。 At step 808, a value for a given field may be determined based on the position of each of the plurality of text entities. The position may be in the form of (x and y) coordinates. As an example, the locations may be predetermined based on templates stored in the data repository. For example, in some scenarios, the value of a particular name may be written next to that name or under a particular name header, which may be identifiable based on (x and y) coordinates. For example, in an "Aadhar card", or a "PAN card", or a driver's license, the name of an entity, eg a user, may be obtainable at a particular location, from which the name can be identified. The location technique can be particularly useful for tables. This is because the data in the table is available in a structured format so that entities can be easily and accurately located. Therefore, the type associated with the input file may first be identified, ie whether the input file is an "Aadhar card" or a "PAN card" or a driver's license. Additionally, the location of one or more associated textual entities in the input file may be determined based on the type associated with the input file.

ステップ８１０において、入力ファイルにおける１つまたは複数の関連テキストエンティティの位置を決定することに成功したか否かを確認するためのチェックが行われてよい。位置を決定することに成功した場合、方法８００は再びステップ８２２（「Ｙｅｓ」の経路）に進んでよく、このとき、入力ファイルからの１つまたは複数の関連テキストエンティティが位置に基づいて抽出されてよく、抽出されたテキストエンティティを用いて表現が生成されてよい。位置が決定されないと認められる場合、ステップ８１０において、方法８００はステップ８１２に進んでよい。 At step 810, a check may be made to see if it was successful in locating one or more associated text entities in the input file. If the location was successfully determined, method 800 may again proceed to step 822 (“Yes” path), at which time one or more related text entities from the input file are extracted based on location. and a representation may be generated using the extracted text entities. If it is determined that the position is not determined, at step 810 the method 800 may proceed to step 812 .

ステップ８１２において、関連テキストエンティティを抽出するために、入力ファイルに対してＲｅｇｅｘベースのアプローチが試行されてよい。この目的で、複数のテキストエンティティから関連テキストエンティティを識別するためにユーザからテキスト入力が受け取られてよく、テキスト入力に対応する検索パターンが自動的に生成されてよい。換言すると、Ｒｅｇｅｘが生成されてよい。さらに、複数のテキストエンティティの各々に関連付けられたパターンが生成されてよく、テキスト入力に対応する検索パターン（Ｒｅｇｅｘ）が、複数のテキストエンティティに関連付けられたパターンと対応付けられてよく、対応付けに基づいて、複数のテキストエンティティに関連付けられたパターンからの１つまたは複数の合致パターンが識別されてよい。いくつかの実施形態において、方法８００は、Ｒｅｇｅｘに対応する、抽出されたテキストデータから関連エンティティを決定する確率を算出することを含んでよい。 At step 812, a Regex-based approach may be attempted on the input file to extract relevant textual entities. To this end, text input may be received from a user to identify related text entities from a plurality of text entities, and search patterns corresponding to the text input may be automatically generated. In other words, a Regex may be generated. Further, a pattern associated with each of the plurality of text entities may be generated, a search pattern (Regex) corresponding to the text input may be associated with the pattern associated with the plurality of text entities, and the correspondence may include: Based on this, one or more matching patterns from patterns associated with multiple text entities may be identified. In some embodiments, the method 800 may include calculating probabilities of determining related entities from the extracted text data corresponding to Regex.

ステップ８１４において、Ｒｅｇｅｘベースのアプローチを適用することに成功したかを決定するためのチェックが行われてよい。Ｒｅｇｅｘベースのアプローチを適用することに成功した場合、方法８００はステップ８２２（「Ｙｅｓ」の経路）に進んでよく、このとき、１つまたは複数の合致パターンに対応する関連テキストエンティティが複数のテキストエンティティから抽出されてよく、抽出されたテキストエンティティを用いて表現が生成されてよい。Ｒｅｇｅｘベースのアプローチが成功しない場合、方法８００はステップ８１６（「Ｎｏ」の経路）に進んでよい。 At step 814, a check may be made to determine if the Regex-based approach was successfully applied. If the Regex-based approach is successfully applied, method 800 may proceed to step 822 (“Yes” path), at which time associated text entities corresponding to one or more match patterns are multiple text Entities may be extracted, and expressions may be generated using the extracted textual entities. If the Regex-based approach is not successful, method 800 may proceed to step 816 (“No” path).

ステップ８１６において、品詞（ＰＯＳ）ベースのアプローチが試行されてよい。例えば、複数のテキストエンティティの各々に関連付けられたＰＯＳが決定されてよい。換言すると、ステップ８１６において、エンティティがいずれのＰＯＳと関連付けられ得るか、例えばＰＯＳが名詞、副詞、形容詞等であり得るかを決定するための試行が行われてよい。さらに、ステップ８１６において、１つまたは複数のユニークなテキストエンティティの各々に関連付けられたエンティティタイプを決定するための試行が行われてよい。この目的で、ＰＯＳが識別されない１つまたは複数のユニークなテキストエンティティが選択されてよい。例えば、アルファベット、数字、句読点等の組み合わせを有するユニークなエンティティ（上記で述べた電子メールＩＤまたは油田名など）については、ＰＯＳが識別されない場合がある。そのようなテキストエンティティについて、１つまたは複数のユニークなテキストエンティティの各々に関連付けられたエンティティタイプが決定されてよい。エンティティタイプは、値エンティティおよびパターンエンティティのうちの１つであってよい。各値エンティティは関連付けられた値を有していてよく、各パターンエンティティは関連付けられたパターンを有していてよい。さらに、各値エンティティについて、値エンティティの関連付けられた値と同じ関連付けられた値（例えば電子メールＩＤ）を有する１つまたは複数のテキストエンティティを自動的に識別するための試行が行われてよい。 At step 816, a part-of-speech (POS) based approach may be tried. For example, a POS associated with each of a plurality of textual entities may be determined. In other words, at step 816 an attempt may be made to determine which POS the entity may be associated with, eg, whether the POS may be a noun, an adverb, an adjective, or the like. Additionally, at step 816, an attempt may be made to determine the entity type associated with each of the one or more unique text entities. For this purpose, one or more unique textual entities with no POS identification may be selected. For example, a POS may not be identified for a unique entity (such as the email ID or field name mentioned above) that has a combination of alphabets, numbers, punctuation marks, and the like. For such text entities, entity types associated with each of the one or more unique text entities may be determined. An entity type may be one of a value entity and a pattern entity. Each value entity may have an associated value and each pattern entity may have an associated pattern. Further, for each value entity, an attempt may be made to automatically identify one or more text entities that have the same associated value (eg, email ID) as the value entity's associated value.

ステップ８１８において、試行が成功したか（すなわちＰＯＳベースのアプローチが成功したか）を決定するためのチェックが行われてよい。試行が成功したと認められる場合、方法８００はステップ８２２（「Ｙｅｓ」の経路）に進んでよく、このとき、値エンティティの関連付けられた値と同じ関連付けられた値を有する１つまたは複数のテキストエンティティ（関連テキストエンティティ）が抽出されてよく、抽出されたテキストエンティティを用いて表現が生成されてよい。 A check may be made at step 818 to determine if the attempt was successful (ie, if the POS-based approach was successful). If the attempt is deemed successful, method 800 may proceed to step 822 (“Yes” path), at which point one or more texts having associated values that are the same as the associated value of the value entity. Entities (related text entities) may be extracted and representations may be generated using the extracted text entities.

さらに、各パターンエンティティについて、パターンエンティティに関連付けられたパターンに対応する検索パターンを自動的に生成し、パターンエンティティに対応する検索パターンを複数のテキストエンティティに関連付けられたパターンと対応付け、対応付けに基づいて複数のテキストエンティティに関連付けられたパターンから１つまたは複数の合致パターンを決定するための試行が行われてよい。ステップ８１８において、試行が成功したか（すなわちＰＯＳベースのアプローチが成功したか）を決定するためのチェックが行われてよい。試行が成功したと認められる場合、方法８００はステップ８２２（「Ｙｅｓ」の経路）に進んでよく、このとき、１つまたは複数の合致パターンに対応する合致テキストエンティティが複数のテキストエンティティから抽出されてよく、抽出されたテキストエンティティを用いて表現が生成されてよい。ステップ８１８において、ＰＯＳベースのアプローチが成功しないと認められる場合、方法８００はステップ８２０（「Ｎｏ」の経路）に進んでよい。 Further, for each pattern entity, automatically generate a search pattern corresponding to the pattern associated with the pattern entity, match the search pattern corresponding to the pattern entity with the patterns associated with multiple text entities, and An attempt may be made to determine one or more matching patterns from patterns associated with a plurality of text entities based on. A check may be made at step 818 to determine if the attempt was successful (ie, if the POS-based approach was successful). If the attempt is deemed successful, method 800 may proceed to step 822 (“Yes” path), at which matching text entities corresponding to one or more matching patterns are extracted from the plurality of text entities. and a representation may be generated using the extracted text entities. In step 818, if the POS-based approach is determined to be unsuccessful, method 800 may proceed to step 820 (“No” path).

ステップ８２０において、機械学習（ＭＬ）モデルを用いて複数のテキストエンティティから関連テキストエンティティ（情報）を抽出するための試行が行われてよい。ＭＬモデルは、関連テキストエンティティのタイプを抽出するようにまず訓練されてよいことが理解されよう。したがって、ＭＬモデルの訓練に基づいて、ＭＬベースの分類が、関連テキストエンティティを抽出するように入力ファイルに適用されてよい。したがって、ステップ８２２において、ＭＬベースの分類に基づいて関連テキストエンティティが抽出されてよく、抽出されたテキストエンティティを用いて表現が生成されてよい。 At step 820, an attempt may be made to extract relevant text entities (information) from multiple text entities using a machine learning (ML) model. It will be appreciated that the ML model may first be trained to extract the types of relevant textual entities. Therefore, based on training the ML model, ML-based classification may be applied to the input files to extract relevant textual entities. Accordingly, at step 822, relevant textual entities may be extracted based on the ML-based taxonomy, and representations may be generated using the extracted textual entities.

いくつかの実施形態において、様々なステップ（すなわちステップ８０４、８０８、８１２、８１６、および８２０）は、上記で説明した順序で行われてもよく、または任意の他の順序で行われてもよいことに留意されたい。 In some embodiments, the various steps (i.e. steps 804, 808, 812, 816, and 820) may be performed in the order described above or may be performed in any other order. Please note that

さらに、下記のシナリオにおいてＮＬＰベースのアプローチが用いられてよいことに留意されたい。
（ｉ）検索キー（すなわちユーザからのテキスト入力）が利用可能であり、かつそれに関連付けられたテキスト抽出（すなわち関連エンティティ）が入力ファイルにおいて予期される場合。
（ｉｉ）検索キー（すなわちユーザからのテキスト入力）が利用可能でないが、それに関連付けられたテキスト抽出（すなわち関連エンティティ）が入力ファイルにおいて予期される場合。
（ｉｉｉ）検索キー（すなわちユーザからのテキスト入力）が利用可能であるが、それに関連付けられたテキスト抽出（すなわち関連エンティティ）が入力ファイルにおいて予期されない場合。 Further, note that NLP-based approaches may be used in the scenarios below.
(i) If the search key (ie text input from the user) is available and its associated text extraction (ie related entity) is expected in the input file.
(ii) if the search key (ie text input from the user) is not available, but the associated text extraction (ie related entity) is expected in the input file.
(iii) if the search key (ie text input from the user) is available but the associated text extraction (ie related entity) is not expected in the input file.

さらに、各々のアプローチにおいて信頼度スコア（確率スコア）が算出されてよいことに留意されたい。例えば、「ドメインベース」のアプローチについては、検索キー（すなわちユーザからのテキスト入力）が入力ファイルにおける残りのテキストエンティティと厳密に合致する場合に、最大の確率スコア（例えば１）が得られてよい。さらに、「位置ベース」のアプローチでは、確率スコア（ｐ）は、検索キーの合致の確率（ｗ）および位置（ｌ）の識別、すなわちｐ＝ｗ＊ｌに基づいていてよい。Ｒｅｇｅｘベースのアプローチについては、位置識別についての確率は０または１のいずれかであってよく、抽出の確率スコア（ｅ）は１または０であり得る。さらに、抽出される値は複数（ｎ）であり得、それらの値のうちの１つが選択されてよい。ゆえに、選択される値の確率は「１／ｎ」であってよく、信頼度スコアはｅ＊（１／ｎ）であってよい。「位置ベース」および「Ｒｅｇｅｘベース」のアプローチの両方が用いられる場合、各アプローチに５０％の重みが与えられてよい。ゆえに、信頼度スコアは（ｗ＊ｌ）／２＋（ｅ＊（１／ｎ））／２であってよい。ＭＬベースのアプローチについては、確率スコアが混同行列（正解率）に基づいて導出されてよい。 Additionally, note that a confidence score (probability score) may be calculated for each approach. For example, for a "domain-based" approach, the highest probability score (e.g., 1) may be obtained when the search key (i.e., text input from the user) exactly matches the remaining text entities in the input file. . Further, in a "location-based" approach, the probability score (p) may be based on the probability of matching the search key (w) and the identity of the location (l), ie, p=w*l. For the Regex-based approach, the probability for location identification can be either 0 or 1, and the probability score (e) for extraction can be 1 or 0. Further, there may be multiple (n) values to be extracted and one of those values may be selected. Therefore, the probability of the selected value may be "1/n" and the confidence score may be e*(1/n). If both "location-based" and "Regex-based" approaches are used, each approach may be given 50% weight. Therefore, the confidence score may be (w*l)/2+(e*(1/n))/2. For ML-based approaches, a probability score may be derived based on the confusion matrix (accuracy rate).

さらに、以下のアプローチ、すなわち純粋な単語合致アプローチ、ルールベースのアプローチ（すなわちドメインベース、位置ベース、ＰＯＳベース、およびＲｅｇｅｘベースのアプローチのうちの１つまたは複数）、およびＭＬベースのアプローチが用いられる場合、複合確率スコア（Ｐ）が算出されてよい。
したがって、
複合確率スコア（Ｐ）＝（Ｐ１＊０．３）＊（Ｐ３＊０．７）、または
複合確率スコア（Ｐ）＝（Ｐ１＊０．３）＊（Ｐ２＊０．７）
式中、
Ｐ１＝純粋な単語合致アプローチの確率スコア、
Ｐ２＝ルールベースのアプローチの確率スコア、および
Ｐ３＝ＭＬベースのアプローチの確率スコア。 Additionally, the following approaches are used: a pure word matching approach, a rule-based approach (i.e. one or more of domain-based, location-based, POS-based, and Regex-based approaches), and an ML-based approach. If so, a composite probability score (P) may be calculated.
therefore,
Composite Probability Score (P)=(P1*0.3)*(P3*0.7), or Composite Probability Score (P)=(P1*0.3)*(P2*0.7)
During the ceremony,
P1 = probability score for pure word matching approach,
P2 = probability score for rule-based approach, and P3 = probability score for ML-based approach.

＜ケースシナリオ１＞
ここで図９を参照すると、ノイジーなエンティティ９０２が入力ファイル９００に存在する例示的な入力ファイル９００のスナップショットが示されている。例えば、エンティティの第１の部分は汚れまたはかすれ等により判読できない場合があり、エンティティの第２の部分は「Ｕｎｉｔｅｄ」という単語を含む場合がある。したがって、実際のエンティティは「ＵｎｉｔｅｄＡｉｒｌｉｎｅｓ（登録商標）」または「ＵｎｉｔｅｄＳｔａｔｅｓ」等である可能性がある。このエンティティを決定するために、１つまたは複数のアプローチ（すなわちドメインベース、位置ベース、ＰＯＳベース、Ｒｅｇｅｘベース、またはＭＬベース）が実際のエンティティの決定に用いられてよい。例えば、ドメインベース、位置ベース、ＰＯＳベース、およびＲｅｇｅｘベースのアプローチが抽出を提供することに失敗した場合、ＭＬベースのアプローチが開始してよく、分類ベースの機械学習（ＭＬ）モデルを用いて履歴データに基づいてテキスト抽出を提供してよい。ＭＬモデルは、決定論的モデルおよび／または確率論的モデルであってよい。 <Case scenario 1>
Referring now to FIG. 9, a snapshot of an exemplary input file 900 with noisy entities 902 present in the input file 900 is shown. For example, a first portion of an entity may be illegible, such as smudged or smudged, and a second portion of the entity may contain the word "United." Thus, the actual entity could be "United Airlines" or "United States" and so on. To determine this entity, one or more approaches (ie domain-based, location-based, POS-based, Regex-based, or ML-based) may be used to determine the actual entity. For example, if domain-based, location-based, POS-based, and Regex-based approaches fail to provide extraction, an ML-based approach may start, using classification-based machine learning (ML) models to Text extraction may be provided based on the data. ML models may be deterministic and/or probabilistic models.

＜ケースシナリオ２＞
ここで図１０を参照すると、入力ファイル「ＬｏｃａｌＯｒｄｅｒ」１０００のスナップショットが示されており、そこから様々なテキストエンティティを抽出することが必要とされ得る。図１０に示すように、入力ファイル「ＬｏｃａｌＯｒｄｅｒ」１０００は表形式構造である。 <Case scenario 2>
Referring now to Figure 10, a snapshot of the input file "Local Order" 1000 is shown from which various textual entities may need to be extracted. As shown in Figure 10, the input file "Local Order" 1000 has a tabular structure.

例えば、属性「ＲＯＮｕｍｂｅｒ」についての値を抽出するために、ＲＯｄａｔｅの属性を対応付けることにより、「ＲＯＮｕｍｂｅｒ」がドメインディクショナリデータベース（ルックアップテーブル）１００２から抽出されてよい。 For example, to extract the value for the attribute "RO Number", "RO Number" may be extracted from the domain dictionary database (lookup table) 1002 by mapping the attribute of RO date.

属性ＰａｒｔＮｕｍｂｅｒを抽出するために、入力ファイル１０００に存在するキー「ＰａｒｔＮｕｍｂｅｒ」が識別され、またはユーザ入力として受け取られてよい。その後、位置ベースのアプローチにより、データベースに記憶されたテンプレート情報を用いて、ＰａｒｔＮｕｍｂｅｒの可能性のある位置が決定されてよい。例えば、位置ベースのアプローチにより、ＰａｒｔＮｕｍｂｅｒが入力ファイル１０００の「右」側に存在し得ることが示唆されてよい。したがって、必要とされるＰａｒｔＮｕｍｂｅｒ「３２１４５６４３」が、入力ファイル「ＬｏｃａｌＯｒｄｅｒ」１０００の表形式構造の右のセル座標から抽出されてよい。 To extract the attribute Part Number, the key "Part Number" present in the input file 1000 may be identified or received as user input. A location-based approach may then determine the possible locations of the Part Number using the template information stored in the database. For example, a position-based approach may suggest that the Part Number may be on the “right” side of the input file 1000 . Therefore, the required Part Number '32145643' may be extracted from the right cell coordinates of the tabular structure of the input file 'Local Order' 1000 .

入力ファイル１０００に存在する電子メールＩＤの属性を抽出するために、Ｒｅｇｅｘベースのアプローチが用いられてよい。したがって、電子メールの属性に関連付けられたキー（Ｒｅｇｅｘ）が識別され、またはデータベースから受け取られてよい。したがって、Ｒｅｇｅｘは、電子メールＩＤ「ａｊａｙ．ｔｈａｋｕｒ＠ｇｍａｉｌ．ｃｏｍ」に対応する「ａｐａｐａｐａ」（「ａ」は１つまたは複数のアルファベットを意味し、「ｐ」は１つまたは複数の句読点を意味し、「ｄ」は１つまたは複数の数字を意味する）であってよい。したがって、合致パターン（Ｒｅｇｅｘ）を有するテキストエンティティ、すなわち電子メールＩＤのテキストエンティティが識別されてよく、値「ａｊａｙ．ｔｈａｋｕｒ＠ｇｍａｉｌ．ｃｏｍ」が入力ファイル１０００の表形式構造の対応するセルから抽出されてよい。 A Regex-based approach may be used to extract the email ID attributes present in the input file 1000 . Accordingly, a key (Regex) associated with the attributes of the email may be identified or received from a database. Therefore, Regex is "apapapa" (where "a" means one or more alphabetic characters and "p" means one or more punctuation marks) corresponding to the email ID "ajay.thakur@gmail.com". and "d" means one or more digits). Thus, a text entity with a matching pattern (Regex), namely the email ID text entity, may be identified and the value "ajay.thakur@gmail.com" extracted from the corresponding cell of the tabular structure of input file 1000. you can

入力ファイル１０００から属性Ｎａｍｅを抽出するために、ＰＯＳベースのアプローチ用いられてよい。例えば、まず属性（Ｎａｍｅ）について存在するキーが入力ファイル１０００において識別されてよい。その後、例えば固有表現抽出（Named Entity Recognizer; ＮＥＲ）タガーを用いて、入力ファイル１０００における識別されたテキストエンティティのＰＯＳが決定されてよい。属性Ｎａｍｅについてデータベースにおいて指定されるＮＥＲタガーが検索されてよい。したがって、名詞のＰＯＳを有する属性Ｎａｍｅ（「ＡｊａｙＴｈａｋｕｒ」）に対応するテキストエンティティが抽出されてよい。したがって、抽出されたテキストエンティティ「ＡｊａｙＴｈａｋｕｒ」を用いる表現が生成されてよい。 A POS-based approach may be used to extract the attribute Name from the input file 1000 . For example, first the keys that exist for the attribute (Name) may be identified in the input file 1000 . The POS of the identified text entities in the input file 1000 may then be determined using, for example, Named Entity Recognizer (NER) taggers. A NER tagger specified in the database for the attribute Name may be retrieved. Therefore, the text entity corresponding to the attribute Name (“Ajay Thakur”) with the noun POS may be extracted. Therefore, an expression using the extracted text entity "Ajay Thakur" may be generated.

入力ファイルのコンテンツから情報を抽出するための１つまたは複数のテキスト抽出技法が、上記に開示されている。上記の技法は、従来の技法に対して１つまたは複数の利点を提供する。例えば、これらの技法は、テキスト抽出を行うための様々なアプローチ、すなわちドメインベース、位置ベース、ＰＯＳベース、Ｒｅｇｅｘベースのアプローチ、およびＭＬベースのアプローチを提供する。したがって、単一のアプローチ、または任意の順序で適用される複数のアプローチを用いて、テキスト抽出を行うことができる。さらに、これらの技法は、抽出された情報を用いて表現を生成することを可能とする。さらに、これらの技法は、タグが存在しない、または抽出された情報においてパターンを識別することができない複雑な文書においても、情報を抽出し、抽出された情報を用いて表現を生成することを可能とする。 One or more text extraction techniques are disclosed above for extracting information from the content of input files. The techniques described above provide one or more advantages over conventional techniques. For example, these techniques provide various approaches to text extraction: domain-based, location-based, POS-based, Regex-based, and ML-based approaches. Thus, text extraction can be performed using a single approach, or multiple approaches applied in any order. In addition, these techniques allow representations to be generated using the extracted information. Furthermore, these techniques are capable of extracting information and generating representations using the extracted information, even in complex documents where tags are absent or patterns cannot be discerned in the extracted information. and

本開示と整合する実施形態の実装において、１つまたは複数のコンピュータ可読記憶媒体が利用されてよい。コンピュータ可読記憶媒体とは、プロセッサにより読み取り可能な情報またはデータが記憶され得る任意のタイプの物理メモリを指す。よって、コンピュータ可読記憶媒体は、本明細書に記載の実施形態と整合するステップまたは段階をプロセッサに実行させるための命令を含む、１つまたは複数のプロセッサが実行するための命令を記憶してよい。「コンピュータ可読媒体」という用語は、有形の物品を含み、搬送波および過渡信号を除外する、すなわち非一時的なものであると理解されるべきである。例としては、ランダムアクセスメモリ（ＲＡＭ）、リードオンリメモリ（ＲＯＭ）、揮発性メモリ、不揮発性メモリ、ハードドライブ、ＣＤ‐ＲＯＭ、ＤＶＤ、フラッシュドライブ、ディスク、および任意の他の既知の物理記憶媒体が挙げられる。 One or more computer-readable storage media may be utilized in implementing embodiments consistent with this disclosure. Computer-readable storage medium refers to any type of physical memory in which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing a processor to perform steps or stages consistent with embodiments described herein. . The term "computer-readable medium" is to be understood to include tangible items and exclude carrier waves and transients, ie, non-transitory. Examples include random access memory (RAM), read only memory (ROM), volatile memory, non-volatile memory, hard drives, CD-ROMs, DVDs, flash drives, disks, and any other known physical storage medium. is mentioned.

開示および例は単に例示的なものとみなされることが意図されており、開示の実施形態の真の範囲および趣旨は、以下の特許請求の範囲によって示される。 It is intended that the disclosure and examples be considered as exemplary only, with a true scope and spirit of the disclosed embodiments being indicated by the following claims.

Claims

A method for extracting information from the content of an input file, comprising:
identifying text data from the input file, wherein the identified text data includes a plurality of text entities;
receiving text input from a user to identify related text entities from the plurality of text entities;
automatically generating a search pattern corresponding to the text input;
determining a pattern associated with each of the plurality of textual entities;
associating the search pattern corresponding to the text input with patterns associated with the plurality of text entities;
identifying one or more matching patterns from the patterns associated with the plurality of textual entities based on the correspondence;
extracting from the plurality of text entities related text entities corresponding to the one or more matching patterns.

2. The method of claim 1, comprising generating a representation using the extracted relevant text entities.

2. The method of claim 1, wherein the input file comprises at least one of an image file and a Portable Document Format (PDF) file.

further comprising determining a domain-based name associated with each of the plurality of textual entities;
the domain-based name is determined based on associating each of the plurality of textual entities with a list of domain-based names stored in a data repository;
The method of claim 1.

identifying a type associated with the input file;
determining locations of one or more associated textual entities in the input file based on the types associated with the input file;
2. The method of claim 1, further comprising: extracting the one or more related textual entities from the input file based on the location.

3. The position is based on (x,y) coordinates, the position being predetermined based on a template stored in the data repository corresponding to a type associated with the input file. 5. The method described in 5.

identifying a part of speech (POS) associated with the plurality of textual entities;
extracting one or more textual entities that the POS identifies as nouns;
said POS is one of nouns, pronouns, verbs, adverbs, adjectives, conjunctions, prepositions, and interjections;
The method of claim 1.

identifying one or more unique textual entities for which no POS is identified;
determining an entity type associated with each of the one or more unique textual entities;
the entity type is one of a value entity and a pattern entity, each value entity having an associated value and each pattern entity having an associated pattern;
8. The method of claim 7.

For each value entity,
9. The method of claim 8, further comprising: automatically identifying one or more text entities that have the same associated value as the associated value of the value entity.

For each pattern entity,
automatically generating search patterns corresponding to patterns associated with the pattern entity;
associating the search pattern corresponding to the pattern entity with patterns associated with the plurality of text entities;
determining one or more matching patterns from the patterns associated with the plurality of textual entities based on the correspondence;
9. The method of claim 8, further comprising: extracting matching text entities corresponding to the one or more matching patterns from the plurality of text entities.

2. The method of claim 1, comprising using a classification-based machine learning (ML) model to generate recommendations based on historical data.

A system for extracting information from the content of an input file, comprising:
a processor;
a memory communicatively coupled to the processor;
The memory stores processor-executable instructions that, when executed by the processor, cause the processor to:
identifying text data from the input file, wherein the identified text data includes a plurality of text entities;
receiving text input from a user to identify related text entities from the plurality of text entities;
automatically generating a search pattern corresponding to the text input;
determining a pattern associated with each of the plurality of textual entities;
associating the search pattern corresponding to the text input with patterns associated with the plurality of text entities;
identifying one or more matching patterns from the patterns associated with the plurality of textual entities based on the correspondence; and
causing extraction of related text entities corresponding to the one or more matching patterns from the plurality of text entities;
system.

13. The system of claim 12, wherein the input files comprise at least one of image files and Portable Document Format (PDF) files.

The processor-executable instructions, when executed by the processor, cause the processor to:
A domain-based extraction, wherein performing the domain-based extraction comprises:
determining a domain-based name associated with each of the plurality of textual entities, wherein the domain-based name associates each of the plurality of textual entities with domain-based names stored in a data repository; domain-based extraction, including determining, determined based on matching with the list;
location-based extraction, wherein performing the location-based extraction comprises:
identifying a type associated with the input file;
determining the location of one or more associated textual entities in the input file based on the type associated with the input file; and
extracting the one or more related textual entities from the input file based on the location, the location corresponding to a type associated with the input file stored in the data repository; extracting, predetermined based on a template;
location-based extraction, including
Part-of-Speech (POS)-based extraction, wherein performing the Part-of-Speech (POS)-based extraction comprises:
identifying a POS associated with the plurality of text entities, wherein the POS is one of nouns, pronouns, verbs, adverbs, adjectives, conjunctions, prepositions, and interjections; ,
extracting one or more textual entities that POS identifies as nouns;
identifying one or more unique textual entities for which no POS is identified;
determining an entity type associated with each of the one or more unique textual entities, wherein the entity type is one of a value entity and a pattern entity, each value entity associated with determining that each pattern entity has an associated pattern;
For each value entity,
automatically identifying one or more text entities that have an associated value that is the same as the associated value of the value entity;
For each pattern entity,
automatically generating search patterns corresponding to patterns associated with the pattern entity;
associating the search pattern corresponding to the pattern entity with patterns associated with the plurality of text entities;
determining one or more matching patterns from the patterns associated with the plurality of text entities based on the correspondence; and
part-of-speech (POS)-based extraction, comprising: extracting from the plurality of text entities matching text entities corresponding to the one or more matching patterns; and
Machine learning (ML)-based extraction, wherein performing the ML-based extraction comprises:
further causing at least one of machine learning (ML)-based extraction, including using a classification-based ML model to generate recommendations based on historical data;
13. The system of claim 12.

identifying text data from an input file, wherein the identified text data includes a plurality of text entities;
receiving text input from a user to identify related text entities from the plurality of text entities;
automatically generating a search pattern corresponding to the text input;
determining a pattern associated with each of the plurality of textual entities;
associating the search pattern corresponding to the text input with patterns associated with the plurality of text entities;
identifying one or more matching patterns from the patterns associated with the plurality of textual entities based on the correspondence;
extracting from the plurality of text entities associated text entities corresponding to the one or more matching patterns; non-transitory computer-readable storage medium.

The set of computer-executable instructions instructs the computer, including the one or more processors, to:
A domain-based extraction, wherein performing the domain-based extraction comprises:
determining a domain-based name associated with each of the plurality of textual entities, wherein the domain-based name associates each of the plurality of textual entities with domain-based names stored in a data repository; domain-based extraction, including determining, determined based on matching with the list;
location-based extraction, wherein performing the location-based extraction comprises:
identifying a type associated with the input file;
determining the location of one or more associated textual entities in the input file based on the type associated with the input file; and
extracting the one or more related textual entities from the input file based on the location, the location corresponding to a type associated with the input file stored in the data repository; extracting, predetermined based on a template;
location-based extraction, including
Part-of-Speech (POS)-based extraction, wherein performing the Part-of-Speech (POS)-based extraction comprises:
identifying a POS associated with the plurality of text entities, wherein the POS is one of nouns, pronouns, verbs, adverbs, adjectives, conjunctions, prepositions, and interjections; ,
extracting one or more textual entities that POS identifies as nouns;
identifying one or more unique textual entities for which no POS is identified;
determining an entity type associated with each of the one or more unique textual entities, wherein the entity type is one of a value entity and a pattern entity, each value entity associated with determining that each pattern entity has an associated pattern;
For each value entity,
automatically identifying one or more text entities that have an associated value that is the same as the associated value of the value entity;
For each pattern entity,
automatically generating search patterns corresponding to patterns associated with the pattern entity;
associating the search pattern corresponding to the pattern entity with patterns associated with the plurality of text entities;
determining one or more matching patterns from the patterns associated with the plurality of text entities based on the correspondence; and
part-of-speech (POS)-based extraction, comprising: extracting from the plurality of text entities matching text entities corresponding to the one or more matching patterns; and
Machine learning (ML)-based extraction, wherein performing the ML-based extraction comprises:
using a classification-based ML model to generate recommendations based on historical data;
16. The non-transitory computer-readable storage medium of claim 15.