JP6253352B2

JP6253352B2 - Document analysis support system

Info

Publication number: JP6253352B2
Application number: JP2013227045A
Authority: JP
Inventors: 大島　修; 修大島; 績央渡邊; 由守谷
Original assignee: Nomura Research Institute Ltd
Current assignee: Nomura Research Institute Ltd
Priority date: 2013-10-31
Filing date: 2013-10-31
Publication date: 2017-12-27
Anticipated expiration: 2033-10-31
Also published as: JP2015088022A

Description

本発明はデータ処理技術に関し、特に文書データの分析を支援するための技術に関する。 The present invention relates to a data processing technique, and more particularly to a technique for supporting analysis of document data.

企業が研究開発の方向性を決定するにあたっては、自社のコア技術の位置付けを明確にするとともに、競合他社の技術開発動向を把握することが重要であり、これには自社や競合他社の特許出願の分析を行うのが有用である。従来では、分析対象となる特許文献を取得し、各特許文献の要約や請求項に対して形態素解析や係り受け解析等の解析処理を実行し、解析結果に統計的な分析を施す技術が知られている（例えば特許文献１）。 When a company decides the direction of R & D, it is important to clarify the position of its core technology and to grasp the technological development trends of competitors. It is useful to conduct an analysis of Conventionally, a technology is known in which patent documents to be analyzed are acquired, analysis processing such as morphological analysis and dependency analysis is performed on the summary and claims of each patent document, and the analysis results are statistically analyzed. (For example, Patent Document 1).

特開２０１１−２５７８１７JP2011-257817A

形態素解析や係り受け解析等の解析処理は比較的負荷が高い処理である。このため、解析対象の特許文献が大量である場合は、この解析処理が長時間に及ぶこともある。これは上述の技術をＡＳＰ型のサービスとして提供する際の足かせとなる。 Analysis processing such as morphological analysis and dependency analysis is a processing with a relatively high load. For this reason, when there are a large number of patent documents to be analyzed, this analysis process may take a long time. This hinders the provision of the above-described technology as an ASP-type service.

本発明はこうした課題に鑑みてなされたものであり、その目的は、比較的短時間で特許文献を分析することを可能とする文書分析支援システムの提供にある。 The present invention has been made in view of these problems, and an object thereof is to provide a document analysis support system that can analyze patent documents in a relatively short time.

上記課題を解決するために、本発明のある態様の文書分析支援システムは、データベースに保持される複数の文書データのそれぞれを形態素に分割した形態素情報を保持する形態素情報保持部と、データベースに保持される文書データのうち、分析対象の文書データのリストを取得するリスト取得部と、形態素情報保持部から、リストに含まれる文書データの形態素情報を抽出する解析情報抽出部と、を備える。 In order to solve the above problems, a document analysis support system according to an aspect of the present invention includes a morpheme information holding unit that holds morpheme information obtained by dividing each of a plurality of document data held in a database into morphemes, and the database holds A list acquisition unit that acquires a list of document data to be analyzed from the document data to be analyzed, and an analysis information extraction unit that extracts morpheme information of the document data included in the list from the morpheme information holding unit.

なお、以上の構成要素の任意の組合せ、本発明の表現を方法、システム、プログラム、プログラムを格納した記録媒体などの間で変換したものもまた、本発明の態様として有効である。 It should be noted that any combination of the above-described constituent elements, and the expression of the present invention converted between a method, a system, a program, a recording medium storing the program, and the like are also effective as an aspect of the present invention.

本発明によれば、比較的短時間で特許文献を分析することが可能となる。 According to the present invention, patent documents can be analyzed in a relatively short time.

実施の形態の文書分析支援システムの構成を示す図である。It is a figure which shows the structure of the document analysis assistance system of embodiment. 図１の文書分析支援装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the document analysis assistance apparatus of FIG. 形態素情報を示すデータ構造図である。It is a data structure figure which shows morpheme information. 係り受け情報を示すデータ構造図である。It is a data structure figure which shows dependency information. 図１の文書分析支援装置における解析処理に係る一連の処理を示すフローチャートである。It is a flowchart which shows a series of processes which concern on the analysis process in the document analysis assistance apparatus of FIG. 図１の文書分析支援装置における抽出・集計処理に係る一連の処理を示すフローチャートである。It is a flowchart which shows a series of processes which concern on the extraction and totalization process in the document analysis assistance apparatus of FIG.

本実施の形態に係る文書分析支援システムの概要は以下の通りである。
本実施の形態に係る文書分析支援システムは、特許データベースに保持されるすべての特許文献を形態素解析した結果を予め保持する。また、その解析結果を利用して係り受け解析した結果を予め保持する。文書分析支援システムは、所定の検索条件で特許データベースを検索し、分析対象の特許文献のリストを取得する。例えば「出願人Ａ」の特許出願を分析対象とする場合は、検索条件の「出願人・権利者」に「出願人Ａ」を設定し、特許データベースを検索する。文書分析支援システムは、予め保持された形態素情報および係り受け情報のうち、リストに含まれる特許文献についての形態素情報および係り受け情報を抽出する。そして、それらを集計等して分析する。つまり、本実施の形態に係る文書分析支援システムでは、分析したい特許文献が予め形態素解析および係り受け解析されているため、分析する際に解析を実行する必要はない。分析する際にはそれら解析結果を取得し、集計等すれば足りる。そのため、比較的短時間で特許文献を分析することが可能となる。 The outline of the document analysis support system according to the present embodiment is as follows.
The document analysis support system according to the present embodiment holds in advance the results of morphological analysis of all patent documents held in the patent database. In addition, the result of the dependency analysis using the analysis result is stored in advance. The document analysis support system searches the patent database under a predetermined search condition, and acquires a list of patent documents to be analyzed. For example, when a patent application of “Applicant A” is to be analyzed, “Applicant A” is set in “Applicant / right holder” of the search condition, and the patent database is searched. The document analysis support system extracts morpheme information and dependency information on patent documents included in the list from morpheme information and dependency information held in advance. Then, they are aggregated and analyzed. That is, in the document analysis support system according to the present embodiment, since the patent document to be analyzed is previously subjected to morphological analysis and dependency analysis, it is not necessary to perform analysis when analyzing. When analyzing, it is sufficient to acquire the results of the analysis and aggregate them. Therefore, it becomes possible to analyze patent documents in a relatively short time.

図１は、実施の形態の文書分析支援システム１０の構成を示す。文書分析支援システム１０は、文書分析支援装置１００と、特許データベース２００と、ユーザ端末３００と、を備える。これらの各装置は、ＬＡＮ・ＷＡＮ・インターネット等、公知の通信網を介して接続される。 FIG. 1 shows a configuration of a document analysis support system 10 according to the embodiment. The document analysis support system 10 includes a document analysis support device 100, a patent database 200, and a user terminal 300. Each of these devices is connected via a known communication network such as LAN, WAN, or the Internet.

文書分析支援装置１００は、特許データベース２００に保持される特許文献の分析を支援するための装置である。文書分析支援装置１００の詳細な機能構成は図２で後述する。ユーザ端末３００は、ユーザにより操作される情報処理端末である。ユーザ端末３００は、ウェブブラウザがインストールされた一般的なＰＣ（Personal Computer）端末である。ユーザ端末３００は、タブレット端末やスマートフォン等であってもよい。 The document analysis support apparatus 100 is an apparatus for supporting the analysis of patent documents held in the patent database 200. A detailed functional configuration of the document analysis support apparatus 100 will be described later with reference to FIG. The user terminal 300 is an information processing terminal operated by a user. The user terminal 300 is a general PC (Personal Computer) terminal in which a web browser is installed. The user terminal 300 may be a tablet terminal or a smartphone.

特許データベース２００は、出願公開済みの特許文献を保持する。ここでいう特許文献には経過情報等の付随的な書誌情報も含まれる。特許データベース２００は、特許文献をテキスト形式で保持している。特許データベース２００は、ＮＲＩサイバーパテントデスク（登録商標）や独立行政法人工業所有権情報・研修館が提供する特許電子図書館などの既存データベースであってもよい。 The patent database 200 holds patent documents that have been published. The patent document referred to here includes accompanying bibliographic information such as progress information. The patent database 200 holds patent documents in a text format. The patent database 200 may be an existing database such as an NRI Cyber Patent Desk (registered trademark) or an electronic patent library provided by an industrial property information / training hall.

図２は、図１の文書分析支援装置１００の機能構成を示すブロック図である。これら各ブロックは、ハードウェア的には、コンピュータのＣＰＵをはじめとする素子や機械装置で実現でき、ソフトウェア的にはコンピュータプログラム等によって実現されるが、ここでは、それらの連携によって実現される機能ブロックを描いている。したがって、これらの機能ブロックはハードウェア、ソフトウェアの組合せによっていろいろなかたちで実現できることは、当業者には理解されるところである。 FIG. 2 is a block diagram showing a functional configuration of the document analysis support apparatus 100 of FIG. Each of these blocks can be realized in hardware by an element such as a CPU of a computer or a mechanical device, and is realized in software by a computer program or the like, but here, functions realized by their cooperation. Draw a block. Therefore, those skilled in the art will understand that these functional blocks can be realized in various forms by a combination of hardware and software.

文書分析支援装置１００は、解析部１２０と、特許文献検索部１３０と、特許文献リスト取得部１４０と、解析情報抽出部１５０と、集計部１６０と、表示制御部１７０と、形態素情報保持部１８１と、係り受け情報保持部１８２と、抽出形態素データ保持部１８３と、抽出係り受けデータ保持部１８４と、辞書データ保持部１８５と、形態素集計データ保持部１８６と、係り受け集計データ保持部１８７と、を備える。 The document analysis support apparatus 100 includes an analysis unit 120, a patent document search unit 130, a patent document list acquisition unit 140, an analysis information extraction unit 150, a totaling unit 160, a display control unit 170, and a morpheme information holding unit 181. A dependency information holding unit 182, an extracted morpheme data holding unit 183, an extracted dependency data holding unit 184, a dictionary data holding unit 185, a morpheme total data holding unit 186, and a dependency total data holding unit 187. .

解析部１２０は、特許文献を解析する。解析部１２０は、形態素解析部１２１と、係り受け解析部１２２と、を含む。形態素解析部１２１は、特許データベース２００に保持されるすべての特許文献の形態素解析を実行する。ここで形態素解析とは、文章を意味を持つ最小単位の文字列（形態素）に分割し、分割された文字列を品詞に分類することをいう。例えば、「私の名前は鈴木です」を形態素に分解すると、「私（代名詞）」「の（助詞）」「名前（一般名詞）」「は（係助詞）」「鈴木（固有名詞）」「です（助動詞）」となる。 The analysis unit 120 analyzes the patent document. The analysis unit 120 includes a morpheme analysis unit 121 and a dependency analysis unit 122. The morpheme analysis unit 121 performs morpheme analysis of all patent documents held in the patent database 200. Here, the morpheme analysis means that a sentence is divided into a character string (morpheme) of the smallest unit having meaning, and the divided character string is classified into parts of speech. For example, when “my name is Suzuki” is broken down into morphemes, “I (pronoun)” “no (particle)” “name (general noun)” “ha (counselor)” “Suzuki (proprietary noun)” “ Is (auxiliary verb) ".

形態素解析部１２１は、辞書データ保持部１８５に保持される辞書データを参照して形態素解析を実行する。これにより、より的確に特許文献を形態素に分割することができる。これについては後述する。形態素解析部１２１は、本実施の形態では、各特許文献の要約書の課題と特許請求の範囲の請求項１を形態素解析する。なお、これらに代え、またはこれらに加え、要約書の解決手段、特許請求の範囲の他の請求項、明細書の各項目を形態素解析してもよい。形態素解析の結果得られた形態素情報は、形態素情報保持部１８１に保持される。形態素情報については図３で後述する。 The morpheme analysis unit 121 performs morpheme analysis with reference to the dictionary data held in the dictionary data holding unit 185. As a result, the patent document can be more accurately divided into morphemes. This will be described later. In the present embodiment, the morpheme analysis unit 121 performs morpheme analysis on the subject of the summary of each patent document and claim 1 of the claims. In addition to or in addition to these, each means of the summary solution, other claims of the claims, and the specification may be subjected to morphological analysis. The morpheme information obtained as a result of the morpheme analysis is held in the morpheme information holding unit 181. The morpheme information will be described later with reference to FIG.

また、形態素解析部１２１は、特許データベース２００に保持される特許文献が更新された場合、その更新された特許文献の形態素解析を実行する。具体的には、明細書等の補正により特許文献が更新された場合、特許出願が新たに公開され、その特許文献が特許データベース２００に追加された場合に、形態素解析部１２１はそれら更新・追加された特許文献の形態素解析を実行し、形態素情報保持部１８１に保持される形態素情報を更新する。一般に、特許データベース２００は２週間に１回の頻度で定期更新される。形態素解析部１２１がその定期更新後に更新・追加された特許文献の形態素解析を実行するようスケジューリングしてもよい。もちろん、指示を受けてから形態素解析を実行するようにしてもよい。 In addition, when the patent document held in the patent database 200 is updated, the morpheme analysis unit 121 performs morphological analysis of the updated patent document. Specifically, when a patent document is updated by correcting the specification, etc., when a patent application is newly published and the patent document is added to the patent database 200, the morphological analysis unit 121 updates / adds them. The morpheme analysis of the published patent document is executed, and the morpheme information held in the morpheme information holding unit 181 is updated. In general, the patent database 200 is regularly updated at a frequency of once every two weeks. Scheduling may be performed so that the morphological analysis unit 121 executes the morphological analysis of the patent document updated / added after the periodic update. Of course, the morphological analysis may be executed after receiving the instruction.

また、形態素解析部１２１は、辞書データが更新された場合も形態素解析を実行する。この場合は、すべての特許文献の解析結果に影響がある可能性があるため、形態素解析部１２１は、再度すべての特許文献の形態素解析を実行し、形態素情報保持部１８１に保持される形態素情報を更新する。 In addition, the morphological analysis unit 121 performs morphological analysis even when the dictionary data is updated. In this case, since the analysis results of all patent documents may be affected, the morpheme analysis unit 121 executes morpheme analysis of all patent documents again, and the morpheme information held in the morpheme information holding unit 181. Update.

係り受け解析部１２２は、形態素解析部１２６における解析結果すなわち形態素情報を利用して係り受け解析を実行する。具体的には、形態素間の係り受けを決定する。係り受け解析の結果得られた係り受け情報は係り受け情報保持部１８２に保持される。係り受け情報については図４で後述する。係り受け解析部１２２は、形態素解析部１２１と同様、特許データベース２００に保持される特許文献が更新された場合は更新された特許文献の係り受け解析を実行し、係り受け情報保持部１８２に保持される係り受け情報を更新する。また、辞書データが更新された場合は再度すべての特許文献の係り受け解析を実行し、係り受け情報を更新する。なお、係り受け解析部１２２は、形態素解析部１２１による形態素解析の完了を受けて係り受け解析を開始するよう構成されてもよい。 The dependency analysis unit 122 performs dependency analysis using the analysis result in the morpheme analysis unit 126, that is, morpheme information. Specifically, the dependency between morphemes is determined. The dependency information obtained as a result of the dependency analysis is held in the dependency information holding unit 182. The dependency information will be described later with reference to FIG. Similar to the morphological analysis unit 121, the dependency analysis unit 122 executes dependency analysis of the updated patent document when the patent document stored in the patent database 200 is updated, and stores the dependency in the dependency information storage unit 182. Update the dependency information. When the dictionary data is updated, dependency analysis of all patent documents is executed again, and dependency information is updated. Note that the dependency analysis unit 122 may be configured to start the dependency analysis upon completion of the morphological analysis by the morpheme analysis unit 121.

特許文献検索部１３０は分析対象の特許文献を検索する。特許文献検索部１３０は、検索条件取得部１３１と、検索式生成部１３２と、検索実行部１３３と、を含む。検索条件取得部１３１は、分析対象の特許文献を検索するための検索条件の入力をユーザ端末３００から受け付ける。出願人・権利者名、出願人識別番号、発明者名、公開年月日、出願年月日、発明を実施するための形態や特許請求の範囲や要約などの検索のためのキーワード、などさまざまな検索条件を任意に入力できる。例えば、「出願人Ａ」の「衛星測位」という技術分野における特許出願を分析したい場合は、出願人・権利者名に「出願人Ａ」、発明を実施するための形態や特許請求の範囲や要約などの検索キーワードに「衛星測位」、をそれぞれ入力する。 The patent document search unit 130 searches for patent documents to be analyzed. The patent document search unit 130 includes a search condition acquisition unit 131, a search expression generation unit 132, and a search execution unit 133. The search condition acquisition unit 131 receives from the user terminal 300 input of search conditions for searching for patent documents to be analyzed. Applicant / right holder name, applicant identification number, inventor name, publication date, filing date, filing date, keywords for searching for forms, claims and abstracts, etc. Search conditions can be entered arbitrarily. For example, to analyze a patent application in the technical field of “satellite positioning” of “applicant A”, the name of the applicant / right holder is “applicant A”, the form for carrying out the invention, the scope of claims, Enter "satellite positioning" as a search keyword such as summary.

検索式生成部１３２は、検索条件取得部１３１が受け付けた検索条件に基づき検索式２１を生成する。検索実行部１３３は、検索式生成部１３２によって生成された検索式２１に基づき特許データベース２００を検索する。 The search expression generation unit 132 generates the search expression 21 based on the search conditions received by the search condition acquisition unit 131. The search execution unit 133 searches the patent database 200 based on the search formula 21 generated by the search formula generation unit 132.

特許文献リスト取得部１４０は、検索実行部１３３による検索の結果得られた特許文献のリストもしくはユーザ端末３００から指定された特許文献のリスト（以下、特許文献リスト２２と総称する）を、公開番号などの文献ＩＤの形で取得する。 The patent document list acquisition unit 140 obtains a list of patent documents obtained as a result of the search by the search execution unit 133 or a list of patent documents specified from the user terminal 300 (hereinafter collectively referred to as the patent document list 22). It is acquired in the form of a document ID such as

解析情報抽出部１５０は、特許文献の解析結果を抽出する。解析情報抽出部１５０は、形態素情報抽出部１５１と、係り受け情報抽出部１５２と、を含む。形態素情報抽出部１５１は、形態素情報保持部１８１から、特許文献リスト２２に含まれる文献ＩＤが特定する特許文献の形態素情報を抽出する。形態素情報抽出部１５１は、抽出した形態素情報を抽出形態素データ保持部１８３に記録する。 The analysis information extraction unit 150 extracts the analysis result of the patent document. The analysis information extraction unit 150 includes a morpheme information extraction unit 151 and a dependency information extraction unit 152. The morpheme information extraction unit 151 extracts the morpheme information of the patent document specified by the document ID included in the patent document list 22 from the morpheme information holding unit 181. The morpheme information extracting unit 151 records the extracted morpheme information in the extracted morpheme data holding unit 183.

係り受け情報抽出部１５２は、係り受け情報保持部１８２から、特許文献リスト２２に含まれる文献ＩＤ２４が特定する特許文献の係り受け情報を抽出する。係り受け情報抽出部１５２は、抽出した係り受け情報を抽出係り受けデータ保持部１８４に記録する。 The dependency information extraction unit 152 extracts the dependency information of the patent document specified by the document ID 24 included in the patent document list 22 from the dependency information holding unit 182. The dependency information extracting unit 152 records the extracted dependency information in the extracted dependency data holding unit 184.

集計部１６０は、抽出した解析結果を集計する。集計部１６０は、形態素集計部１６１と、係り受け集計部１６２と、を含む。形態素集計部１６１は、抽出形態素データ保持部１８３に保持された形態素情報に含まれる各形態素の出現頻度を任意の項目でグループかしつつ集計する。どの項目でグループ化しつつ集計するかはユーザが設定すればよい。例えば、形態素集計部１６１は、出願人別でグループ化しつつ、各形態素の出現頻度を集計してもよい。この場合、各出願人のと拒文献において出現頻度が高い技術用語を把握することができる。また、特定の出願人の特許文献にだけ出現する技術用語を把握できる。そして、これらから、各出願人が重点を置いている技術を把握することができる。 The totaling unit 160 totalizes the extracted analysis results. The totaling unit 160 includes a morpheme totaling unit 161 and a dependency totaling unit 162. The morpheme totaling unit 161 totals the appearance frequency of each morpheme included in the morpheme information held in the extracted morpheme data holding unit 183 while grouping them with arbitrary items. The user can set which items are grouped and aggregated. For example, the morpheme totaling unit 161 may total the appearance frequency of each morpheme while grouping by applicant. In this case, it is possible to grasp technical terms that appear frequently in each of the applicants' rejected documents. In addition, technical terms that appear only in the patent documents of a specific applicant can be grasped. And, from these, it is possible to grasp the technology that each applicant is placing importance on.

また例えば、形態素集計部１６１は、年代別、出願人別でグループ化しつつ、各形態素の出現頻度を集計してもよい。この場合、年代ごと、出願人ごとに出現頻度が高い技術用語を把握でき、各出願人の技術動向の変化を把握することができる。形態素集計部１６１は、集計結果を形態素集計データ保持部１８６に記録する。 Further, for example, the morpheme totaling unit 161 may total the appearance frequency of each morpheme while grouping by ages and applicants. In this case, technical terms having a high appearance frequency can be grasped for each age and for each applicant, and changes in the technical trends of each applicant can be grasped. The morpheme totaling unit 161 records the totalization result in the morpheme total data holding unit 186.

係り受け集計部１６２は、抽出係り受けデータ保持部１８４に保持された形態素情報に含まれる各係り受け関係の出現頻度を任意の項目でグループ化しつつ集計する。どの項目でグループ化しつつ集計するかはユーザが設定すればよい。例えば、係り受け集計部１６２は、出願人別でグループ化しつつ、特定の形態素（例えば「課題」）と係り受け関係にある形態素の出現頻度を集計してもよい。 The dependency totaling unit 162 totals the appearance frequencies of each dependency relationship included in the morpheme information held in the extracted dependency data holding unit 184 while grouping them with arbitrary items. The user can set which items are grouped and aggregated. For example, the dependency totaling unit 162 may total the appearance frequencies of morphemes having a dependency relationship with a specific morpheme (for example, “issue”) while grouping by applicant.

また例えば、係り受け集計部１６２は、特定の形態素と係り受け関係にある形態素を類義語でグループ化しつつその出現頻度を集計してもよい。例えば、「課題」という語句と係り受け関係にあり、かつ、「ユーザビリティ」、「見やすさ」、「秘匿性」、「精度向上」、「小型軽量化」、「低電力消費」と類義語である形態素の出現頻度を、それら５つの文字列ごとに集計してもよい。この場合、「ユーザビリティ」、「見やすさ」、「秘匿性」、「精度向上」、「小型軽量化」、「低電力消費」のうちのどれを課題としているかを把握することができる。係り受け集計部１６２は、集計結果を係り受け集計データ保持部１８７に記録する。 Further, for example, the dependency totaling unit 162 may total the appearance frequencies while grouping morphemes that are in a dependency relationship with a specific morpheme by synonyms. For example, it has a dependency relationship with the phrase “issue” and is synonymous with “usability”, “ease of viewing”, “confidentiality”, “improving accuracy”, “smaller and lighter”, and “low power consumption”. You may total the appearance frequency of a morpheme for every five character strings. In this case, it is possible to grasp which of “usability”, “easy to read”, “confidentiality”, “accuracy improvement”, “small and light weight”, and “low power consumption” is an issue. The dependency totaling unit 162 records the totaling result in the dependency totaling data holding unit 187.

表示制御部１７０は、集計部１６０が集計した結果をユーザ端末３００に表示させる。また、表示制御部１７０は、形態素情報抽出部１５１に保持された形態素情報と、係り受け情報抽出部１５２に保持された係り受け情報と、に主成分分析を施し、形態素情報と係り受け情報とを二次元マップ上に配置して可視化させてもよい。 The display control unit 170 causes the user terminal 300 to display the results obtained by the aggregation unit 160. In addition, the display control unit 170 performs principal component analysis on the morpheme information held in the morpheme information extraction unit 151 and the dependency information held in the dependency information extraction unit 152, and the morpheme information and dependency information May be visualized by arranging them on a two-dimensional map.

形態素情報保持部１８１は、形態素情報を保持する。特に、形態素情報保持部１８１は、特許データベース２００が保持するすべての特許文献についての形態素情報を保持する。図３は、形態素情報を示すデータ構造図である。文献ＩＤ２４は特許文献を一意に特定するＩＤを示す。項目２８は、特許文献において各形態素が含まれる項目を示す。文番号３０は、各項目に含まれる文を、その項目内において一意に識別する番号を示す。形態素ＩＤ３２は、各文に含まれる形態素を、その文内において一意に識別するＩＤを示す。形態素３４は、各文に含まれる形態素を示す。品詞３６は、形態素の品詞を示す。例えば、文献ＩＤが「特開２００３−０００１」の特許文献の要約の１文目には「レバー（名詞）」という形態素が含まれる。 The morpheme information holding unit 181 holds morpheme information. In particular, the morpheme information holding unit 181 holds morpheme information for all patent documents held in the patent database 200. FIG. 3 is a data structure diagram showing morpheme information. The document ID 24 indicates an ID that uniquely identifies a patent document. Item 28 indicates an item including each morpheme in the patent document. The sentence number 30 indicates a number that uniquely identifies a sentence included in each item within the item. The morpheme ID 32 indicates an ID that uniquely identifies a morpheme included in each sentence within the sentence. The morpheme 34 indicates a morpheme included in each sentence. The part of speech 36 indicates the part of speech of the morpheme. For example, the first sentence of the summary of a patent document with a document ID “JP 2003-0001” includes a morpheme “lever (noun)”.

係り受け情報保持部１８２は、係り受け情報を保持する。特に、係り受け情報保持部１８２は、特許データベース２００が保持するすべての特許文献についての係り受け情報を保持する。図４は、係り受け情報を示すデータ構造図である。形態素ＩＤ（係り元）４０と、形態素（係り元）４２は、それぞれ係り元の形態素ＩＤと、形態素を示す。形態素ＩＤ（係り先）４４と、形態素（係り先）４６は、それぞれ係り先の形態素ＩＤと、形態素を示す。例えば、文献ＩＤが「特開２００３−０００１」の特許文献の発明の名称の１文目の「テープ（名詞）」と「印字（名詞）」とが係り受け関係にあることを示している。 The dependency information holding unit 182 holds dependency information. In particular, the dependency information holding unit 182 holds dependency information for all patent documents held in the patent database 200. FIG. 4 is a data structure diagram showing dependency information. A morpheme ID (relation source) 40 and a morpheme (relation source) 42 indicate a morpheme ID and a morpheme, respectively. The morpheme ID (relation destination) 44 and the morpheme (relation destination) 46 indicate the morpheme ID and morpheme of the relationship destination, respectively. For example, it shows that “tape (noun)” and “print (noun)” in the first sentence of the patent document invention whose document ID is “JP 2003-0001” are in a dependency relationship.

抽出形態素データ保持部１８３は、分析対象の特許文献の形態素情報、すなわち形態素情報抽出部１５１が形態素情報保持部１８１から抽出した形態素情報を保持する。抽出形態素データ保持部１８３が保持する情報のデータ構造は図３と同様である。 The extracted morpheme data holding unit 183 holds the morpheme information of the patent document to be analyzed, that is, the morpheme information extracted from the morpheme information holding unit 181 by the morpheme information extracting unit 151. The data structure of the information held by the extracted morpheme data holding unit 183 is the same as that shown in FIG.

抽出係り受けデータ保持部１８４は、分析対象の特許文献の係り受け情報、すなわち係り受け情報抽出部１５２が係り受け情報保持部１８２から抽出した係り受け情報を保持する。抽出係り受けデータ保持部１８４が保持する情報のデータ構造は図４と同様である。 The extracted dependency data holding unit 184 holds dependency information of the patent document to be analyzed, that is, dependency information extracted from the dependency information holding unit 182 by the dependency information extracting unit 152. The data structure of the information held by the extraction dependency data holding unit 184 is the same as that shown in FIG.

辞書データ保持部１８５は、技術用語や専門用語を有する辞書データを保持する。この辞書データを参照することにより、形態素解析部１２１は、的確に形態素単位に分割することができる。例えば、辞書データが、気体と液体との界面を意味する技術用語である「気液界面」という用語を有していれば、特許文献にこの語句が含まれていた場合に、「気液」と「界面」の２語ではなく、「気液界面」という１つの単語として扱うことができる。 The dictionary data holding unit 185 holds dictionary data having technical terms and technical terms. By referring to this dictionary data, the morpheme analyzer 121 can accurately divide the data into morpheme units. For example, if the dictionary data has the term “gas-liquid interface”, which is a technical term that means the interface between gas and liquid, if this term is included in the patent literature, “gas-liquid” And “interface” can be treated as one word “gas-liquid interface”.

形態素集計データ保持部１８６は、形態素集計部１６１により集計されたデータを保持する。係り受け集計データ保持部１８７は、係り受け集計部１６２により集計されたデータ保持する。 The morpheme total data holding unit 186 holds the data totaled by the morpheme total unit 161. The dependency total data holding unit 187 holds data totaled by the dependency totaling unit 162.

以上の構成による文書分析支援装置１００の動作を説明する。
図５は、文書分析支援装置１００における解析処理に係る一連の処理を示すフローチャートである。形態素解析部１２１は、特許データベース２００に保持される特許文献を一度も形態素解析していない場合（Ｓ１０のＹ）、または辞書データ保持部１８５に保持される辞書データが更新された場合（Ｓ１１のＹ）、特許データベース２００に保持されるすべての特許文献の形態素解析を実行し、形態素情報保持部１８１に保持される形態素情報を更新する（Ｓ１２）。係り受け解析部１２２は、すべての特許文献の係り受け解析を実行し、係り受け情報保持部１８２に保持される係り受け情報を更新する（Ｓ１３）。 The operation of the document analysis support apparatus 100 having the above configuration will be described.
FIG. 5 is a flowchart showing a series of processes related to the analysis process in the document analysis support apparatus 100. The morphological analysis unit 121 has not performed morphological analysis on the patent document held in the patent database 200 (Y in S10), or the dictionary data held in the dictionary data holding unit 185 has been updated (in S11). Y) The morpheme analysis of all patent documents held in the patent database 200 is executed, and the morpheme information held in the morpheme information holding unit 181 is updated (S12). The dependency analysis unit 122 executes dependency analysis of all patent documents, and updates the dependency information held in the dependency information holding unit 182 (S13).

また、形態素解析部１２１は、特許データベース２００に特許文献が追加された場合（Ｓ１４のＹ）、または既存の特許文献が更新された場合（Ｓ１５のＹ）、追加・更新された特許文献を形態素解析を実行し、形態素情報を更新する（Ｓ１６）。係り受け解析部１２２は、更新された特許文献の係り受け解析を実行し、係り受け情報を更新する（Ｓ１７）。 In addition, when a patent document is added to the patent database 200 (Y in S14) or when an existing patent document is updated (Y in S15), the morpheme analysis unit 121 displays the added / updated patent document as a morpheme. The analysis is executed and the morpheme information is updated (S16). The dependency analysis unit 122 executes dependency analysis of the updated patent document and updates dependency information (S17).

図６は、文書分析支援装置１００における抽出・集計処理に係る一連の処理を示すフローチャートである。特許文献検索部１３０は、特許文献を検索する（Ｓ２０）。解析情報抽出部１５０は、検索結果に含まれる特許文献についての解析結果（形態素情報および係り受け情報）を抽出する（Ｓ２１）。集計部１６０は、解析結果を任意の条件で集計する（Ｓ２２）。表示制御部１７０は、集計結果等をユーザ端末３００に表示させる（Ｓ２３）。 FIG. 6 is a flowchart showing a series of processing relating to extraction / aggregation processing in the document analysis support apparatus 100. The patent document search unit 130 searches for patent documents (S20). The analysis information extraction unit 150 extracts an analysis result (morpheme information and dependency information) about the patent document included in the search result (S21). The totaling unit 160 totals the analysis results under arbitrary conditions (S22). The display control unit 170 causes the user terminal 300 to display the counting results and the like (S23).

本実施の形態に係る文書分析支援装置１００によれば、特許データベース２００が保持するすべての特許文献についての解析結果が保持される。そのため、分析する際に解析を実行する必要はなく、予め保持されている解析結果から所望の解析結果を抽出し、集計等すればよい。これにより、比較的短時間で特許文献を分析することが可能となる。 According to the document analysis support apparatus 100 according to the present embodiment, the analysis results for all patent documents held in the patent database 200 are held. For this reason, it is not necessary to perform analysis at the time of analysis, and a desired analysis result may be extracted from the analysis results held in advance and aggregated. This makes it possible to analyze the patent document in a relatively short time.

（第２の実施の形態）
第１の実施の形態に係る文書分析支援装置と第２の実施の形態に係る文書分析支援装置との主な違いは、特許データベース２００が保持する特許文献が追加・更新された場合の各部材の動作である。
第２の実施の形態に係る文書分析支援装置１００は、検索条件保持部をさらに備える点を除き、図２と同様の構成を有する。以下、第１の実施の形態との相違点を中心に説明する。 (Second Embodiment)
The main difference between the document analysis support apparatus according to the first embodiment and the document analysis support apparatus according to the second embodiment is that each member when the patent document held in the patent database 200 is added / updated. Is the operation.
The document analysis support apparatus 100 according to the second embodiment has the same configuration as that in FIG. 2 except that it further includes a search condition holding unit. Hereinafter, the difference from the first embodiment will be mainly described.

検索条件保持部は、検索条件取得部１３１が受け付けた検索条件を保持する。
特許文献検索部１３０は、特許データベース２００が保持する特許文献が更新された場合、検索条件保持部が保持する検索条件に基づいて更新された特許文献を検索する。具体的には、まず検索条件取得部１３１は、検索条件保持部から検索条件を取得する。検索式生成部１３２は、この検索条件に基づき検索式２１を生成する。検索実行部１３３は、検索式生成部１３２によって生成された検索式２１に基づき特許データベース２００を検索する。このとき検索実行部１３３は、特許文献の更新日時を参照することにより、前回の検索後に新たに追加された特許文献および前回の検索後に更新された特許文献を検索する。なお、これらの処理は、特許データベース２００の定期更新後に実行されるようスケジューリングされてもよい。 The search condition holding unit holds the search conditions received by the search condition acquisition unit 131.
When the patent document held by the patent database 200 is updated, the patent document search unit 130 searches for the updated patent document based on the search condition held by the search condition holding unit. Specifically, first, the search condition acquisition unit 131 acquires a search condition from the search condition holding unit. The search formula generation unit 132 generates the search formula 21 based on this search condition. The search execution unit 133 searches the patent database 200 based on the search formula 21 generated by the search formula generation unit 132. At this time, the search execution unit 133 searches the patent documents newly added after the previous search and the patent documents updated after the previous search by referring to the update date and time of the patent documents. These processes may be scheduled to be executed after the patent database 200 is regularly updated.

特許文献リスト取得部１４０は、検索実行部１３３による検索の結果得られた特許文献のリストを取得する。このリストを特許文献リスト２２’とする。つまり、特許文献リスト２２’には、追加・更新された特許文献の文献ＩＤが含まれる。形態素情報抽出部１５１は、特許文献リスト２２’に含まれる文献ＩＤ２４が特定する特許文献の形態素情報を抽出し、抽出形態素データ保持部１８３を更新する。なお、形態素解析部１２１による更新された特許文献の形態素解析が未完了の場合、形態素情報抽出部１５１はその完了を待ってから形態素情報を抽出する。 The patent document list acquisition unit 140 acquires a list of patent documents obtained as a result of the search by the search execution unit 133. This list is referred to as a patent document list 22 '. That is, the patent document list 22 'includes the document IDs of the added and updated patent documents. The morpheme information extraction unit 151 extracts the morpheme information of the patent document specified by the document ID 24 included in the patent document list 22 ′, and updates the extracted morpheme data holding unit 183. When the morpheme analysis of the updated patent document by the morpheme analysis unit 121 is not completed, the morpheme information extraction unit 151 waits for the completion before extracting the morpheme information.

係り受け情報抽出部１５２は、特許文献リスト２２’に含まれる文献ＩＤ２４が特定する特許文献の係り受け情報を抽出し、抽出係り受けデータ保持部１８４を更新する。なお、係り受け解析部１２２による更新された特許文献の係り受け解析が未完了の場合、係り受け情報抽出部１５２はその完了を待ってから係り受け情報を抽出する。 The dependency information extraction unit 152 extracts the dependency information of the patent document identified by the document ID 24 included in the patent document list 22 ′, and updates the extraction dependency data holding unit 184. If the dependency analysis of the updated patent document by the dependency analysis unit 122 is not completed, the dependency information extraction unit 152 waits for the completion and extracts the dependency information.

集計部１６０は、更新された抽出形態素データ保持部１８３および抽出係り受けデータ保持部１８４に保持されるデータで再度集計処理を実施する。特許データベース２００が保持する特許文献が追加・更新された場合、特許文献検索部１３０から解析情報抽出部１５０までの一連の処理を自動で実施する。 The aggregation unit 160 performs the aggregation process again with the data held in the updated extracted morpheme data holding unit 183 and the extracted dependency data holding unit 184. When a patent document held in the patent database 200 is added / updated, a series of processing from the patent document search unit 130 to the analysis information extraction unit 150 is automatically performed.

本実施の形態に係る文書分析支援システム１０によれば、第１の実施の形態と同様の作用効果を奏することができる。加えて、本実施の形態に係る文書分析支援システム１０によれば、特許データベース２００が更新されるたびに、予め登録した検索条件に基づいて自動で検索を行い、分析対象の形態素解析情報および係り受け情報と、集計結果とを更新する。これにより、常に最新の分析結果を得ることができる。 According to the document analysis support system 10 according to the present embodiment, the same operational effects as in the first embodiment can be achieved. In addition, according to the document analysis support system 10 according to the present embodiment, every time the patent database 200 is updated, a search is automatically performed based on a pre-registered search condition, and morphological analysis information to be analyzed and related The received information and the total result are updated. Thereby, the latest analysis result can always be obtained.

以上、本発明を実施の形態をもとに説明した。この実施の形態は例示であり、それらの各構成要素や各処理プロセスの組合せにいろいろな変形例が可能なこと、またそうした変形例も本発明の範囲にあることは当業者に理解されるところである。以下変形例を示す。 The present invention has been described based on the embodiments. This embodiment is an exemplification, and it will be understood by those skilled in the art that various modifications can be made to combinations of the respective constituent elements and processing processes, and such modifications are also within the scope of the present invention. is there. A modification is shown below.

第１の変形例を説明する。第１、２の実施の形態では、文書データが特許文献である場合について説明したが、これに限らない。例えば、文書データは、学術論文、新聞、雑誌、その他の文書であってもよい。 A first modification will be described. In the first and second embodiments, the case where the document data is a patent document has been described, but the present invention is not limited to this. For example, the document data may be academic papers, newspapers, magazines, and other documents.

第２の変形例を説明する。第１、２の実施の形態では、形態素情報保持部１８１に保持される形態素情報が、１形態素ごとに１レコードとなるよう構成される場合について説明したが、これに限られない。例えば、形態素情報は、１特許文献ごとに１レコードとなるよう構成されてもよい。具体的には、例えば図３の各レコードの各フィールドのデータを「特開２００３−０００１発明の名称１１テープ２印字特開２００３−０００１発明の名称１２印字３装置・・・」のようにスペース区切りでつなげ、１特許文献ごとに１レコードとなるよう構成してもよい。これにより、形態素情報のレコード数が減少するため、これを検索するときの検索スピードが向上する。その結果、形態素情報抽出部１５１が形態素情報保持部１８１から分析対象の特許文献の形態素情報を抽出するときのスピードが向上する。
係り受け情報保持部１８２に保持される係り受け情報についても同様である。 A second modification will be described. In the first and second embodiments, the case has been described in which the morpheme information held in the morpheme information holding unit 181 is configured to be one record for each morpheme, but is not limited thereto. For example, the morpheme information may be configured to be one record for each patent document. Specifically, for example, the data of each field of each record of FIG. It is also possible to configure such that one record is connected for each patent document. Thereby, since the number of records of morpheme information decreases, the search speed at the time of searching this improves. As a result, the speed when the morpheme information extraction unit 151 extracts the morpheme information of the patent document to be analyzed from the morpheme information holding unit 181 is improved.
The same applies to the dependency information held in the dependency information holding unit 182.

第３の変形例を説明する。第１、２の実施の形態では言及していないが、文書分析支援装置１００は、集計部１６０が、どのような項目でグループ化しつつ解析結果を集計するかが設定されたテンプレートを保持するテンプレート保持部をさらに備えてもよい。一例としては、テンプレートには、「形態素集計部１６１が年代別、出願人別でグループ化しつつ形態素情報の各形態素の出現頻度を集計し、係り受け集計部１６２が出願人別でグループ化しつつ形態素「課題」と係り受け関係にある形態素の出現頻度を集計する」ことが設定される。テンプレート保持部はこうしたテンプレートを複数保持してもよく、ユーザは所望のテンプレートを選択すればよい。これにより、どの項目でグループ化しつつ集計するかをユーザが設定する必要がなくなり、ユーザの負担が軽減される。 A third modification will be described. Although not mentioned in the first and second embodiments, the document analysis support apparatus 100 includes a template that holds a template in which the aggregation unit 160 sets the items to be analyzed and the analysis results are aggregated. You may further provide a holding | maintenance part. As an example, the template includes: “The morpheme totaling unit 161 totals the appearance frequency of each morpheme in the morpheme information while grouping by age and applicant, and the dependency totaling unit 162 groups the morpheme by applicant. “Aggregating frequency of appearance of morphemes in dependency relationship with“ issue ”” is set. The template holding unit may hold a plurality of such templates, and the user only has to select a desired template. As a result, it is not necessary for the user to set which items are grouped and aggregated, and the burden on the user is reduced.

第４の変形例を説明する。第１、２の実施の形態では、文書分析支援装置１００が、解析部１２０と、特許文献検索部１３０と、特許文献リスト取得部１４０と、解析情報抽出部１５０と、集計部１６０と、表示制御部１７０と、形態素情報保持部１８１と、係り受け情報保持部１８２と、抽出形態素データ保持部１８３と、抽出係り受けデータ保持部１８４と、辞書データ保持部１８５と、形態素集計データ保持部１８６と、係り受け集計データ保持部１８７と、を備える場合について説明したが、これに限られず、文書分析支援装置１００の機能の一部を他の装置に移してもよい。例えば、文書分析支援装置とは別に特許文献検索装置を設け、これに特許文献検索部１３０の機能を持たせてもよい。また例えば、文書分析支援装置とは別に解析装置を設け、これに解析部１２０の機能を持たせてもよい。 A fourth modification will be described. In the first and second embodiments, the document analysis support apparatus 100 includes an analysis unit 120, a patent document search unit 130, a patent document list acquisition unit 140, an analysis information extraction unit 150, a totaling unit 160, and a display. Control unit 170, morpheme information holding unit 181, dependency information holding unit 182, extracted morpheme data holding unit 183, extraction dependency data holding unit 184, dictionary data holding unit 185, and morpheme total data holding unit 186 However, the present invention is not limited to this, and a part of the function of the document analysis support apparatus 100 may be transferred to another apparatus. For example, a patent document search device may be provided separately from the document analysis support device, and the function of the patent document search unit 130 may be provided thereto. Further, for example, an analysis device may be provided separately from the document analysis support device, and the function of the analysis unit 120 may be provided thereto.

上述した実施の形態および変形例の任意の組み合わせもまた本発明の実施の形態として有用である。組み合わせによって生じる新たな実施の形態は、組み合わされる実施の形態および変形例それぞれの効果をあわせもつ。また、請求項に記載の各構成要件が果たすべき機能は、実施の形態および変形例において示された各構成要素の単体もしくはそれらの連係によって実現されることも当業者には理解されるところである。 Any combination of the above-described embodiments and modifications is also useful as an embodiment of the present invention. The new embodiment generated by the combination has the effects of the combined embodiment and the modified examples. In addition, it should be understood by those skilled in the art that the functions to be fulfilled by the constituent elements described in the claims are realized by the individual constituent elements shown in the embodiments and the modified examples or by their linkage. .

１００文書分析支援装置、１２０解析部、１２１形態素解析部、１２２係り受け解析部、１３０特許文献検索部、１３１検索条件取得部、１３２検索式生成部、１３３検索実行部、１４０特許文献リスト取得部、１５０解析情報抽出部、１５１形態素情報抽出部、１５２係り受け情報抽出部、１６０集計部、１６１形態素集計部、１６２係り受け集計部、１７０表示制御部、１８１形態素情報保持部、１８２係り受け情報保持部、１８３抽出形態素データ保持部、１８４抽出係り受けデータ保持部、１８５辞書データ保持部、２００特許データベース、３００ユーザ端末。 100 document analysis support device, 120 analysis unit, 121 morpheme analysis unit, 122 dependency analysis unit, 130 patent document search unit, 131 search condition acquisition unit, 132 search expression generation unit, 133 search execution unit, 140 patent document list acquisition unit 150 analysis information extraction unit, 151 morpheme information extraction unit, 152 dependency information extraction unit, 160 tabulation unit, 161 morpheme tabulation unit, 162 dependency tabulation unit, 170 display control unit, 181 morpheme information holding unit, 182 dependency information Holding unit, 183 extracted morpheme data holding unit, 184 extraction dependency data holding unit, 185 dictionary data holding unit, 200 patent database, 300 user terminal.

Claims

An analysis unit that performs morphological analysis on a plurality of document data held in a database, and generates morpheme information obtained by dividing each of the plurality of document data into morphemes;
A morpheme information holding unit for holding morpheme information generated by the analysis unit ;
A search condition for searching for document data is input, and a search unit for searching for document data under the search condition;
A list acquisition unit for acquiring a list for acquiring a list of document data obtained as a result of the search by the search unit , among the document data held in the database;
An analysis information extraction unit that extracts morpheme information of document data included in the list from the morpheme information holding unit;
A totaling unit that totalizes the morpheme information extracted by the analysis information extraction unit ,
By the morphological analysis of the document data held in the database by the analysis unit, the morpheme information holding unit holds morpheme information of all document data targeted by the system,
When the document data held in the database is updated, the analysis unit performs a morphological analysis of the updated document data, updates the morpheme information,
The document collection support system , wherein the counting unit groups each morpheme included in the morpheme information by an item designated by the template, and totals the appearance frequency .

The analysis unit further performs dependency analysis for generating dependency information between morphemes included in the morpheme information, and when the document data held in the database is updated, dependency analysis of the updated document data is performed. To update the dependency information,
A dependency information holding unit for holding dependency information generated by the analysis unit ;
The document analysis support system according to claim 1, wherein the analysis information extraction unit extracts dependency information of document data included in the list from the dependency information holding unit.

Further comprising a dictionary data storage unit storing the dictionary data to be used for morphological analysis by pre-Symbol analyzer,
When the dictionary data is updated, the analysis unit performs morphological analysis and dependency analysis on a plurality of document data held in the database, and updates the morpheme information and the dependency information. The document analysis support system according to claim 2 .

A search condition holding unit for holding a search condition for searching for document data;
When the document data held in the database is updated, the search unit searches the updated document data for document data that matches a search condition held in the search condition holding unit. The document analysis support system according to claim 3.

An analysis function for performing morphological analysis on a plurality of document data stored in a database and generating morpheme information obtained by dividing each of the plurality of document data into morphemes;
A holding function for holding the generated morpheme information;
A search function for searching for document data is input, and a search function for searching for document data under the search condition;
A list acquisition function for acquiring a list of document data obtained as a result of search among the document data held in the database;
An analysis information extraction function for extracting morpheme information of document data included in the list;
The computer implements a tabulation function that tabulates the extracted morpheme information ,
By the morphological analysis of the document data held in the database by the analysis function, the holding function holds morpheme information of all document data targeted by the system,
When the document data stored in the database is updated, the analysis function performs morphological analysis of the updated document data, updates the morpheme information,
The said totaling function groups each morpheme contained in morpheme information by the item designated with the template, and totalizes the appearance frequency, The computer program characterized by the above-mentioned .