JP2018169753A

JP2018169753A - Document sorting apparatus, document sorting method and document sorting program

Info

Publication number: JP2018169753A
Application number: JP2017065917A
Authority: JP
Inventors: 大樹清水; Daiki Shimizu
Original assignee: Toyota Technical Development Corp
Current assignee: Toyota Technical Development Corp
Priority date: 2017-03-29
Filing date: 2017-03-29
Publication date: 2018-11-01
Anticipated expiration: 2037-03-29
Also published as: JP6735247B2

Abstract

PROBLEM TO BE SOLVED: To provide a document sorting apparatus for sorting documents with a small processing load.SOLUTION: The document sorting apparatus includes: a storage unit for storing first document information indicating a plurality of document data retrieved according to a retrieval formula, first feature information indicating one or more technical features of document data given to the document data, and relevant information indicating whether or not the document data is a document desired by the user in association with each other; an acquisition unit for acquiring second document information that is newly retrieved in accordance with the retrieval formula and that indicates other document data to which second feature information indicating one or more technical features is given; an extraction unit for extracting a predetermined number of document data from the plurality of document data on the basis of a consistency degree between the second feature information and the first feature information; a determination unit for determining whether or not the other document data is a document desired by the user on the basis of the relevant information associated with each of the predetermined number of document data extracted by the extraction unit; and an output unit for outputting a determination result of the determination unit.SELECTED DRAWING: Figure 1

Description

本発明は、文書を分類する文書分類装置、文書分類方法及び文書分類プログラムに関する。 The present invention relates to a document classification apparatus, a document classification method, and a document classification program for classifying documents.

近年、技術者達は、最新の技術動向を追うために、毎年自身の目的に沿った各種の特許文献を読むことがある。その特許文献を読むにあたり技術者は、検索式により絞り込みを行うものの、その結果得られる文献数は膨大なものになることがままあり、全ての文献に目を通すことは現実的ではない。そのため、膨大な量の文献の中から読むべき文献とそうでない文献のふるい分け、即ち、スクリーニングをユーザが行うことがある。 In recent years, engineers often read various patent documents according to their own purposes every year in order to keep up with the latest technological trends. In reading the patent document, the engineer narrows down the search formula, but the number of documents obtained as a result remains enormous, and it is not realistic to read all the documents. Therefore, the user may perform screening, that is, screening, of documents that should be read from a huge amount of documents, and documents that should not be read.

そこで、そのようなスクリーニングを補助する手法として、様々な技術がある。例えば、特許分類による検索の検索結果として得られる文献集合に、更に検索したい内容を表した種文書を追加し、文書内容の類似度に基づくクラスタリングを行って、クラスタ表示された特許文献を順次スクリーニングする技術が開示されている（例えば、特許文献１参照）。なお、類似度の算出には、文献に対して形態素解析を行って用語に分解し、各用語同士の類似度を算出することでクラスタリングを行う。 There are various techniques for assisting such screening. For example, a seed document representing the content to be searched is added to the document set obtained as a search result of the search by patent classification, and clustering based on the similarity of the document content is performed, and the patent documents displayed in a cluster are sequentially screened. The technique to do is disclosed (for example, refer patent document 1). For calculating the similarity, clustering is performed by performing morphological analysis on the document, decomposing it into terms, and calculating the similarity between the terms.

また、文献検索後のスクリーニングを効率よく行うために、文献単位で、各文献を示す内容を自動めくりする方法が開示されている（例えば、特許文献２参照）。 Moreover, in order to perform screening after literature search efficiently, a method of automatically turning the content indicating each document in document units is disclosed (for example, see Patent Document 2).

さらには、分類する文書と、予め分類の付与された文書集合との類似度を文書内のキーワードに基づいて算出し、入力された文書と最も類似する指定数の文書を抽出し、類似度を加味した分類の数に基づいて抽出した指定数の文書の分類のスコアを算出し、算出したスコアが指定値より大きい分類を抽出して、分類対象の文書に付与することで自動的に分類を行う技術もある（例えば、特許文献３参照）。 Further, the degree of similarity between the document to be classified and the document set to which the classification is assigned is calculated based on the keyword in the document, and a specified number of documents most similar to the input document are extracted, and the degree of similarity is calculated. Calculate the classification score of the specified number of documents extracted based on the number of classifications taken into account, extract the classification whose calculated score is greater than the specified value, and assign it to the target document for classification automatically There is also a technique to perform (see, for example, Patent Document 3).

特開２００３−１５７２７０号公報JP 2003-157270 A 特開２００４−２３４４３０号公報JP 2004-234430 A 特開２００７−３２３４５４号公報JP 2007-323454 A

ところで、上記特許文献２の場合、文献単位で自動めくりを行ってはくれるものの、ユーザが読むべき文献数が減るわけではなく、人がスクリーニングを行うこと自体は変わらないため、その人物の処理負荷が大きいという問題がある。 By the way, in the case of the above-mentioned Patent Document 2, although the document is automatically turned, the number of documents to be read by the user is not reduced, and the fact that the person performs the screening itself does not change. There is a problem that is large.

また、上記特許文献１や特許文献３の場合、各文献に対して形態素解析を行った上で抽出された用語各々について類似度を算出するという手法をとっているため、形態素解析や膨大な数の用語の類似度の算出といった膨大な処理負荷がプロセッサにかかるという問題がある。 In the case of Patent Document 1 and Patent Document 3 described above, a method is used in which similarity is calculated for each extracted term after performing morphological analysis on each document. There is a problem that a huge processing load is applied to the processor, such as calculating the similarity of the terms.

そこで、本発明は上記問題に鑑みて成されたものであり、上記特許文献１〜３よりも人やプロセッサの処理負荷が少ない文書分類装置、文書分類方法及び文書分類プログラムを提供することを目的とする。 Accordingly, the present invention has been made in view of the above problems, and an object thereof is to provide a document classification device, a document classification method, and a document classification program that require less processing load on humans and processors than in Patent Documents 1 to 3. And

上記課題を解決するために、本発明の一態様に係る文書分類装置は、検索式に応じて検索された複数の文書データを示す第一文書情報と、文書データに付与された当該文書データの１つ以上の技術的特徴を示す第一特徴情報と、当該文書データがユーザにとって所望の文献であるか否かを示す該当情報とを対応付けて記憶する記憶部と、検索式に応じて新たに検索され、１つ以上の技術的特徴を示す第二特徴情報が付与された他の文書データを示す第二文書情報を取得する取得部と、第二特徴情報と第一特徴情報との一致度に基づいて、複数の文書データから所定数の文書データを抽出する抽出部と、抽出部が抽出した所定数の文書データ各々に対応付けられた該当情報に基づいて、他の文書データが、ユーザにとって所望の文献であるか否かを判断する判断部と、判断部の判断結果を出力する出力部とを備える。 In order to solve the above problems, a document classification device according to an aspect of the present invention includes first document information indicating a plurality of document data searched according to a search formula, and the document data assigned to the document data. A storage unit that stores first feature information indicating one or more technical features and corresponding information indicating whether or not the document data is a document desired by the user, and a new one according to the search formula The second feature information and the first feature information coincide with each other, an acquisition unit for acquiring second document information indicating other document data to which the second feature information indicating one or more technical features is added. Based on the degree, the extraction unit that extracts a predetermined number of document data from the plurality of document data, and other document data based on the corresponding information associated with each of the predetermined number of document data extracted by the extraction unit, Whether the document is desired by the user Comprising a determining section for determining, and an output unit for outputting a determination result of the determination unit.

また、本発明の一態様に係る文書分類方法は、検索式に応じて検索された複数の文書データを示す第一文書情報と、文書データに付与された当該文書データの１つ以上の技術的特徴を示す第一特徴情報と、当該文書データがユーザにとって所望の文献であるか否かを示す該当情報とを対応付けて記憶する記憶ステップと、検索式に応じて新たに検索され、１つ以上の技術的特徴を示す第二特徴情報が付与された他の文書データを示す第二文書情報を取得する取得ステップと、第二特徴情報と第一特徴情報との一致度に基づいて、複数の文書データから所定数の文書データを抽出する抽出ステップと、抽出ステップにおいて抽出した所定数の文書データ各々に対応付けられた該当情報に基づいて、他の文書データが、ユーザにとって所望の文献であるか否かを判断する判断ステップと、判断ステップにおける判断結果を出力する出力ステップとを含む。 Further, the document classification method according to one aspect of the present invention includes one or more technical information of first document information indicating a plurality of document data searched according to a search formula and the document data attached to the document data. A storage step for storing first feature information indicating a feature and corresponding information indicating whether or not the document data is a document desired for the user, and a new search according to the search formula. Based on the acquisition step of acquiring second document information indicating other document data to which the second feature information indicating the above technical features is given, and the degree of coincidence between the second feature information and the first feature information, An extraction step for extracting a predetermined number of document data from the document data, and other document data based on the corresponding information associated with each of the predetermined number of document data extracted in the extraction step. Comprising a determining step of determining whether, and an output step of outputting the result of determination at decision step.

また、本発明の一態様に係る文書分類プログラムは、コンピュータに、検索式に応じて検索された複数の文書データを示す第一文書情報と、文書データに付与された当該文書データの１つ以上の技術的特徴を示す第一特徴情報と、当該文書データがユーザにとって所望の文献であるか否かを示す該当情報とを対応付けて記憶する記憶機能と、検索式に応じて新たに検索され、１つ以上の技術的特徴を示す第二特徴情報が付与された他の文書データを示す第二文書情報を取得する取得機能と、第二特徴情報と第一特徴情報との一致度に基づいて、複数の文書データから所定数の文書データを抽出する抽出機能と、抽出機能が抽出した所定数の文書データ各々に対応付けられた該当情報に基づいて、他の文書データが、ユーザにとって所望の文献であるか否かを判断する判断機能と、判断機能の判断結果を出力する出力機能とを実現させる。 Further, the document classification program according to one aspect of the present invention provides a computer with one or more of first document information indicating a plurality of document data searched according to a search expression and the document data attached to the document data. Is stored in association with the first feature information indicating the technical features of the document and the corresponding information indicating whether or not the document data is a document desired by the user, and is newly searched according to the search formula. Based on an acquisition function for acquiring second document information indicating other document data to which second feature information indicating one or more technical features is given, and a degree of coincidence between the second feature information and the first feature information Based on the extraction function for extracting a predetermined number of document data from a plurality of document data and the corresponding information associated with each of the predetermined number of document data extracted by the extraction function, other document data is desired by the user. Literature A determining function of determining whether, to achieve an output function of outputting a determination result of the determination function.

また、上記文書分類装置において、抽出部は、複数の文書データから、第二特徴情報と第一特徴情報との一致度の高いものから所定数を抽出し、判断部は、抽出部が抽出した文献に対応付けられている該当情報が、ユーザにとって所望の文献であることを示すものが閾値よりも多い場合に、他の文書データはユーザにとって所望の文献であると判断し、ユーザにとって所望の文献でないことを示すものが閾値よりも多い場合に、他の文書データはユーザにとって所望の文献ではないと判断することとしてもよい。 In the document classification device, the extraction unit extracts a predetermined number from a plurality of pieces of document data having a high degree of coincidence between the second feature information and the first feature information, and the determination unit is extracted by the extraction unit. If the corresponding information associated with the document is more than the threshold value indicating that the document is the document desired for the user, it is determined that the other document data is the document desired for the user, If there are more items indicating that the document is not a document than the threshold value, it may be determined that the other document data is not a document desired by the user.

また、上記文書分類装置において、判断部は、抽出部が抽出した文献に対応付けられている該当情報に対して、一致度に応じた重み付けを行い、重み付けを行った後の該当情報に基づいて、他の文書データが、ユーザにとって所望の文献であるか否かを判断することとしてもよい。 In the document classification apparatus, the determination unit weights the corresponding information associated with the document extracted by the extraction unit according to the degree of coincidence, and based on the corresponding information after the weighting is performed. Further, it may be determined whether the other document data is a document desired for the user.

また、上記文書分類装置において、判断部は、第一特徴情報が対応付けられている文書データであって該当情報がユーザが所望していない文献であることを示す文書データの、当該第一特徴情報が対応付けられている文書データ全体に対する割合を示す非該当率が、第一閾値を超える第一特徴情報と一致する第二特徴情報を有する他の文書データをユーザが所望していない文献であると判断することとしてもよい。 Further, in the document classification device, the determination unit is the document data associated with the first feature information, and the first feature of the document data indicating that the corresponding information is a document not desired by the user. Non-correspondence ratio indicating the ratio to the entire document data with which the information is associated is a document that the user does not desire other document data having the second feature information that matches the first feature information that exceeds the first threshold. It may be determined that there is.

また、上記文書分類装置において、判断部は、第一特徴情報が対応付けられている文書データであって該当情報がユーザが所望している文献であることを示す文書データの、当該第一特徴情報が対応付けられている文書データ全体に対する割合を示す該当率が、第二閾値を超える第一特徴情報と一致する第二特徴情報を有する他の文書データをユーザが所望している文献であると判断することとしてもよい。 Further, in the document classification device, the determination unit is document data associated with the first feature information, and the first feature of the document data indicating that the corresponding information is a document desired by the user. This is a document in which the user desires other document data having second feature information that matches the first feature information whose ratio indicating the ratio to the entire document data with which the information is associated exceeds the second threshold. It is good also as judging.

また、上記文書分類装置において、第一閾値は、第二閾値よりも大きいこととしてもよい。 In the document classification apparatus, the first threshold value may be larger than the second threshold value.

また、上記文書分類装置において、判断部は、検索式に、特徴情報が用いられている場合には、当該特徴情報を除く第一特徴情報と、第二特徴情報とに基づいて、判断を行うこととしてもよい。 In the document classification device, when the feature information is used in the search formula, the determination unit makes a determination based on the first feature information excluding the feature information and the second feature information. It is good as well.

本発明の一態様に係る文書分類装置は、新たに検索された他の文書データが、ユーザにとって所望の文献であるか否かを、予め同じ検索式で検索された文献に対して付与された技術的情報を示す特徴情報の一致度と、その文献がユーザにとって所望の文献であるか否かを示す該当情報に基づいて判断することができる。したがって、文書分類装置は、ユーザが読まなくてもよい文献のふるい分けを行うことができるので、ユーザが読むべき文献数を低減できる。よって、ユーザの処理負荷を軽減することができる。また、文書分類装置は、文献内を精査することなく、文献のふるい分けを行うことができるので、文書分類装置に対してかかる処理負荷を軽減することができる。 In the document classification device according to an aspect of the present invention, whether or not the newly searched other document data is a document desired by the user is given to a document previously searched by the same search formula. Determination can be made based on the degree of coincidence of feature information indicating technical information and corresponding information indicating whether the document is a document desired by the user. Therefore, since the document classification device can perform screening of documents that the user does not need to read, the number of documents that the user should read can be reduced. Therefore, the processing load on the user can be reduced. In addition, since the document classification apparatus can perform screening of documents without examining the documents, the processing load on the document classification apparatus can be reduced.

文書分類装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of a document classification device. 文書分類装置の詳細な構成例を示すブロック図である。It is a block diagram which shows the detailed structural example of a document classification device. （ａ）は、分類済みのリストを示す過去リストの一例を示すデータ概念図である。（ｂ）は、新たな文献のリストを示す新文献リストの一例を示すデータ概念図である。(A) is a data conceptual diagram which shows an example of the past list | wrist which shows the classified list | wrist. (B) is a data conceptual diagram which shows an example of the new literature list | wrist which shows the list | wrist of a new literature. 新たな文献との一致度で過去文献をソートした一致度表の一例を示すデータ概念図である。It is a data conceptual diagram which shows an example of the coincidence degree table which sorted the past literature by the coincidence with a new literature. 分類ごとにノイズであるか該当するかを示すＩＰＣ該当情報の一例を示すデータ概念図である。It is a data conceptual diagram which shows an example of the IPC applicable information which shows whether it is noise for every classification. 文書分類装置の動作であって、事前準備に係る処理を示すフローチャートである。It is an operation of the document classification device, and is a flowchart showing processing related to advance preparation. 図６に続く処理を示すフローチャートである。It is a flowchart which shows the process following FIG. 文書分類装置の動作であって、新たな文書を分類する際の処理を示すフローチャートである。It is operation | movement of a document classification | category apparatus, Comprising: It is a flowchart which shows the process at the time of classifying a new document. 図８に続く処理を示すフローチャートである。It is a flowchart which shows the process following FIG. 図９に続く処理を示すフローチャートである。10 is a flowchart illustrating processing subsequent to FIG. 9. 文書分類装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of a document classification device.

以下、本発明の一実施態様に係る文書分類装置について、図面を参照しながら詳細に説明する。 Hereinafter, a document classification apparatus according to an embodiment of the present invention will be described in detail with reference to the drawings.

＜実施の形態＞
＜構成＞
図１は、文書分類装置の構成例を示すブロック図である。図１に示すように、記憶部１３０と、取得部１１０と、抽出部１２１と、判断部１２２と、出力部１４０とを備える。 <Embodiment>
<Configuration>
FIG. 1 is a block diagram illustrating a configuration example of a document classification apparatus. As illustrated in FIG. 1, a storage unit 130, an acquisition unit 110, an extraction unit 121, a determination unit 122, and an output unit 140 are provided.

記憶部１３０は、検索式に応じて検索された複数の文書データを示す第一文書情報と、文書データに付与された当該文書データの１つ以上の技術的特徴を示す第一特徴情報と、当該文書データがユーザにとって所望の文献であるか否かを示す該当情報とを対応付けて記憶している。記憶部１３０は、例えば、ＨＤＤ（Hard Disc Drive）、ＳＳＤ（Solid State Drive）、フラッシュメモリなどにより実現できるが、これに限定されるものではない。ここで第一文書情報は、第一文書情報の文書そのものを示すデータであってもよいし、第一文書情報を示す識別情報であってもよい。ここで、第一特徴情報は、文書データ各々の技術的特徴を示すものであって、例えば、ＩＰＣ、ＣＰＣ、ＥＣＬＡ、ＩＣＯ、ＵＳＣ、ＦＩ、Ｆタームなどが挙げられるが、これらに限定されるものではない。また、該当情報とは、検索を行ったユーザが検索により得られた文献を目視することによって検索の結果得られた文献各々がユーザにとって所望の文献であるか否かを示す情報であればよく、例えば、「該当する」、「ノイズ」であるといった情報や、「どちらでもない」、「不明である」というような内容を示すような情報であってもよい。なお、該当情報は、文書分類装置が付与するものであってもよい。各第一文書情報には、上述の通り、技術的特徴を示す第一特徴情報として、少なくとも１つの技術的特徴が対応付けられる。 The storage unit 130 includes first document information indicating a plurality of document data searched according to a search formula, first feature information indicating one or more technical features of the document data attached to the document data, Corresponding information indicating whether or not the document data is a document desired by the user is stored in association with each other. The storage unit 130 can be realized by, for example, an HDD (Hard Disc Drive), an SSD (Solid State Drive), a flash memory, or the like, but is not limited thereto. Here, the first document information may be data indicating the document itself of the first document information or identification information indicating the first document information. Here, the first feature information indicates technical features of each document data, and examples thereof include, but are not limited to, IPC, CPC, ECLA, ICO, USC, FI, and F terms. It is not a thing. In addition, the relevant information may be information indicating whether each of the documents obtained as a result of the search by viewing the documents obtained by the search by the user who performed the search is a document desired for the user. For example, information such as “applicable” or “noise”, or information indicating contents such as “neither” or “unknown” may be used. Note that the corresponding information may be provided by the document classification device. As described above, each first document information is associated with at least one technical feature as the first feature information indicating the technical feature.

取得部１１０は、検索式に応じて新たに検索され、１つ以上の技術的特徴を示す第二特徴情報が付与された他の文書データを示す第二文書情報を取得する。取得部１２１は、例えば、有線又は無線による通信により第二文書情報を取得することとしてもよいし、あるいは、記憶部１３０に予め記憶されていた他の文書データを取得するものであってもよいし、文書分類装置に他の文書データを記憶した他の記憶媒体が接続されて当該他の記憶媒体から他の文書データを取得することとしてもよい。他の記憶媒体とは、例えば、フラッシュメモリなどの可搬型の記憶媒体である。また、第二文書情報は、第二文書情報の文書その物を示すデータであってもよいし、第二文書情報を示す識別情報であってもよい。また、第二特徴情報は、他の文書データの技術的特徴を示すものであって、例えば、ＩＰＣ、ＣＰＣ、ＥＣＬＡ、ＩＣＯ、ＵＳＣ、ＦＩ、Ｆタームなどが挙げられるが、これらに限定されるものではない。ここで、取得部１１０が取得する他の文書データは、１つでも複数でもどちらでもよい。 The obtaining unit 110 obtains second document information indicating another document data that is newly searched according to the search formula and is given second feature information indicating one or more technical features. For example, the acquisition unit 121 may acquire the second document information by wired or wireless communication, or may acquire other document data stored in the storage unit 130 in advance. Then, another storage medium that stores other document data may be connected to the document classification device to acquire other document data from the other storage medium. The other storage medium is a portable storage medium such as a flash memory. The second document information may be data indicating the document itself of the second document information or identification information indicating the second document information. The second feature information indicates technical features of other document data, and examples thereof include, but are not limited to, IPC, CPC, ECLA, ICO, USC, FI, and F terms. It is not a thing. Here, the other document data acquired by the acquisition unit 110 may be one or plural.

抽出部１２１は、第二特徴情報と第一特徴情報との一致度に基づいて、複数の文書データから所定数の文書データを抽出する。抽出部１２１は、例えば、記憶部１３０に記憶されている抽出プログラムを実行するプロセッサにより実現することができる。例えば、抽出部１２１は、複数の文書データの中から第二特徴情報と第一特徴情報との一致度が高い文書データを抽出することとしてもよいし、一致度が一定以上の文書データを抽出することとしてもよい。 The extraction unit 121 extracts a predetermined number of document data from a plurality of document data based on the degree of coincidence between the second feature information and the first feature information. The extraction unit 121 can be realized by a processor that executes an extraction program stored in the storage unit 130, for example. For example, the extraction unit 121 may extract document data having a high degree of coincidence between the second feature information and the first feature information from a plurality of document data, or extract document data having a certain degree of coincidence. It is good to do.

判断部１２２は、抽出部１２１が抽出した所定数の文書データ各々に対応付けられた該当情報に基づいて、他の文書データが、ユーザにとって所望の文献であるか否かを判断する。判断部１２２は、例えば、記憶部１３０に記憶されている判断プログラムを実行するプロセッサにより実現することができる。例えば、判断部１２２は、抽出部１２１が抽出した文書データに対応付けられた該当情報が該当を示すものが多い場合に、他の文書データも、ユーザが所望する文献に該当すると判断することができる。なお、「ユーザにとって所望の文献であるか」という条件は、「所定の観点に合致する文献であるか」といった条件や「所定の条件に合致する文献であるか」というような条件であってもよい。 The determination unit 122 determines whether the other document data is a document desired for the user based on the corresponding information associated with each of the predetermined number of document data extracted by the extraction unit 121. The determination unit 122 can be realized by, for example, a processor that executes a determination program stored in the storage unit 130. For example, the determination unit 122 may determine that other document data also corresponds to the document desired by the user when there are many pieces of corresponding information associated with the document data extracted by the extraction unit 121. it can. The condition “whether the document is desired for the user” is a condition such as “whether it is a document that matches a predetermined viewpoint” or “a document that meets a predetermined condition”. Also good.

出力部１４０は、判断部１２２の判断結果を出力する。出力部１４０は、判断部１２２による判断結果を外部に出力できればよく、例えば、文書分類装置１００が出力装置としてのモニターやスピーカを備えて、それらのモニターに画像情報として判断結果を出力する、あるいは、音声情報として判断結果を出力することとしてもよい。また、出力部１４０は、例えば、文書分類装置１００に外部の装置が接続されて、無線又は有線により、外部の装置に判断結果を示す情報を送信することにより出力することとしてもよい。 The output unit 140 outputs the determination result of the determination unit 122. The output unit 140 only needs to be able to output the determination result of the determination unit 122 to the outside. For example, the document classification device 100 includes a monitor and a speaker as an output device, and outputs the determination result as image information to those monitors. The determination result may be output as audio information. Further, the output unit 140 may be configured to output by transmitting information indicating a determination result to an external apparatus by wireless or wired connection with an external apparatus connected to the document classification apparatus 100, for example.

以下、文書分類装置１００について更に詳細に説明する。 Hereinafter, the document classification device 100 will be described in more detail.

図２は、文書分類装置１００の詳細な構成例を示すブロック図である。図２に示すように、文書分類装置１００は、取得部１１０と、制御部１２０と、記憶部１３０と、出力部１４０とから構成される。文書分類装置１００は、新たな文書データが入力された場合に、当該新たな文書データが、ユーザの所望する文献に該当するノイズであるか否かを判定する機能を有するコンピュータシステムである。 FIG. 2 is a block diagram illustrating a detailed configuration example of the document classification device 100. As illustrated in FIG. 2, the document classification device 100 includes an acquisition unit 110, a control unit 120, a storage unit 130, and an output unit 140. The document classification apparatus 100 is a computer system having a function of determining whether new document data is noise corresponding to a document desired by a user when new document data is input.

取得部１１０は、文書分類装置１００が分類する新たな他の文書データとしての特許文献を示す情報を取得する機能を有する。当該特許文献を示す情報は、特許文献を示す情報であればよく、特許文献を示す識別情報あるいは文書そのものであってもよい。当該特許文献を示す情報には、当該特許文献の技術的情報を示す第二特徴情報としての特許分類を示す情報が付与されている。取得部１１０は、一例として、外部の装置（図示せず）から、未分類の他の文書データを取得する通信インターフェースである。 The acquisition unit 110 has a function of acquiring information indicating a patent document as other new document data to be classified by the document classification device 100. The information indicating the patent document may be information indicating the patent document, and may be identification information indicating the patent document or the document itself. The information indicating the patent document is given information indicating the patent classification as the second feature information indicating the technical information of the patent document. For example, the acquisition unit 110 is a communication interface that acquires other unclassified document data from an external device (not shown).

制御部１２０は、記憶部１３０に記憶されている各種プログラムを実行することで、文書分類装置１００の各部を制御する機能を有するプロセッサである。制御部１２０は、抽出部１２１と、判断部１２２としての機能を有する。制御部１２０は、検索式に応じて検索された文献として、ユーザにとって所望の文献であるか否かを判定するために、各文献に付与されているＩＰＣが「ノイズ」となるか「該当」するかを判断するためのＩＰＣ該当情報を事前情報として生成する機能を有する。また、制御部１２０は、抽出部１２１や判断部１２２の機能により、新たに検索式により検索されたノイズか該当かの分類が付与されていない特許文献が、ユーザの所望する文献であるか否かを判断する機能も有する。 The control unit 120 is a processor having a function of controlling each unit of the document classification device 100 by executing various programs stored in the storage unit 130. The control unit 120 functions as an extraction unit 121 and a determination unit 122. The control unit 120 determines whether the IPC assigned to each document is “noise” in order to determine whether or not the document retrieved according to the retrieval formula is a document desired for the user. It has a function of generating IPC pertinent information for determining whether to do as prior information. Further, the control unit 120 determines whether or not the patent document to which the noise or the corresponding classification newly retrieved by the retrieval formula is not assigned by the functions of the extraction unit 121 and the determination unit 122 is the document desired by the user. It also has a function to determine whether or not.

抽出部１２１は、過去の分類済みの文献リストである過去文献リストの中から、新たな文献がノイズであるか否かを判定するために用いる文献を抽出する。抽出部１２１は、新たな文献とのＩＰＣの一致度が高い順にソートされた過去文献リストの上位から所定数の文献を抽出する。 The extraction unit 121 extracts a document used to determine whether or not a new document is noise from a past document list that is a past classified document list. The extraction unit 121 extracts a predetermined number of documents from the top of the past document list sorted in descending order of the degree of coincidence of IPC with new documents.

判断部１２２は、抽出部１２１が抽出した文献に付与されている該当情報としての分類（「ノイズ」か「該当」するか）に基づいて、新たな文献がユーザにとって所望の文献であるか否か、即ち、「ノイズ」であるか「該当」するかを判断する。判断部１２２は、抽出部１２１が抽出した文献のうち、過半数を占める分類を、新たな文献の分類とする。 The determination unit 122 determines whether or not the new document is a document desired by the user based on the classification (“noise” or “corresponding”) as the corresponding information given to the document extracted by the extraction unit 121. That is, whether it is “noise” or “corresponding”. The determination unit 122 sets a classification that occupies a majority of the documents extracted by the extraction unit 121 as a new document classification.

制御部１２０による新たな文献が、ユーザの所望する文献であるか否かを判断する際の処理やＩＰＣ該当情報を生成する際の処理の詳細については、後述する。 Details of the process for determining whether the new document by the control unit 120 is a document desired by the user and the process for generating the IPC pertinent information will be described later.

記憶部１３０は、文書分類装置１００が動作する上で必要とする各種のデータやプログラムを記憶する機能を有する記録媒体である。記憶部１３０は、例えば、ＨＤＤ、ＳＳＤ、フラッシュメモリ等により実現されるが、これらに限定されるものではない。記憶部１３０は、例えば、各ＩＰＣがノイズなのか該当するのかの事前情報を制御部１２０が生成するためのプログラムや、新たな文献が入力されたときに当該新たな文献がノイズなのか該当するのかを制御部１２０が判断するためのプログラムを記憶している。また、記憶部１３０は、過去の特許文献のリストであって、各文献がユーザの所望の文献に該当するか否かを示す該当情報が対応付けられた過去文献リスト３００と、取得部１１０が取得するものであって、新たな文献のリストである新文献リスト３５０と、制御部１２０が生成した事前情報であるＩＰＣ該当情報５００を記憶している。また、新たな文献がノイズか該当するかを判定する際に生成する一致度表４００も記憶する。 The storage unit 130 is a recording medium having a function of storing various data and programs necessary for the operation of the document classification device 100. The storage unit 130 is realized by, for example, an HDD, an SSD, a flash memory, or the like, but is not limited thereto. The storage unit 130 corresponds to, for example, a program for the control unit 120 to generate prior information as to whether each IPC is noise, or whether the new document is noise when a new document is input. A program for the control unit 120 to determine whether or not is stored. The storage unit 130 is a list of past patent documents, and the acquisition unit 110 includes a past document list 300 in which corresponding information indicating whether each document corresponds to a user's desired document is associated with the past document list 300. A new document list 350 that is a list of new documents to be acquired and an IPC corresponding information 500 that is prior information generated by the control unit 120 are stored. In addition, a coincidence degree table 400 generated when determining whether the new document is noise or not is also stored.

出力部１４０は、制御部１２０の新たな文献についての判断結果に関する情報を外部の装置に対して出力する機能を有する通信インターフェースである。ここでは、例えば、図３（ａ）に示すような態様（少なくとも新たな文献の公報番号と分類とが対応付けられた態様）で、分類が付与された新文献リストを出力することとする。 The output unit 140 is a communication interface having a function of outputting information related to the determination result of the new document from the control unit 120 to an external device. Here, for example, a new document list to which a classification is assigned is output in a mode as illustrated in FIG. 3A (a mode in which at least a publication number and a classification of a new document are associated with each other).

以上が、文書分別装置１００の構成の説明である。 The above is the description of the configuration of the document classification device 100.

＜データ＞
ここから、文書分類装置１００において用いられる各種データについて説明する。 <Data>
From here, various data used in the document classification device 100 will be described.

図３（ａ）は、記憶部１３０に記憶されている分類済みの文書データに関する過去文献リスト３００の構成例を示すデータ概念図である。過去文献リスト３００は、過去に所定の検索式で検索された文献に関する情報であって、各文献がユーザにとって所望の文献であるかいなかを示す情報を含む。図３（ａ）に示すように、過去文献リスト３００は、検索式に応じて検索された複数の文書データを示す第文書情報としての公報番号３０１と、対応する文書データである特許文献が検索の結果としてユーザが所望する内容が記載された文献であるか否かを示す該当情報に相当する情報である分類３０２と、当該文書データ各々に付与された１以上の技術的特徴を示す第一特徴情報に相当するＩＰＣ分類３０３とが対応付けられた情報である。 FIG. 3A is a data conceptual diagram illustrating a configuration example of the past document list 300 regarding classified document data stored in the storage unit 130. The past document list 300 includes information regarding documents that have been searched with a predetermined search formula in the past, and includes information indicating whether each document is a document desired for the user. As shown in FIG. 3A, the past document list 300 is searched for the publication number 301 as the first document information indicating a plurality of document data searched according to the search formula and the patent document as the corresponding document data. As a result, a classification 302 that is information corresponding to the corresponding information indicating whether or not the content desired by the user is described, and a first that indicates one or more technical features assigned to each document data This is information associated with the IPC classification 303 corresponding to the feature information.

公報番号３０１は、検索式に応じて検索された文書データであって、分類済みの文書データである特許文献を一意に特定するための情報である。ここでは、分類の対象となる特許文献の公報番号を用いているが、これは、公報番号に限るものではなく、当該文献を一意に特定できる識別情報であれば、公報番号以外を用いることとしてもよい。 The publication number 301 is document data searched according to the search formula, and is information for uniquely identifying a patent document that is classified document data. Here, the publication number of the patent document to be classified is used. However, this is not limited to the publication number, and any other identification number can be used as long as the identification information can uniquely identify the document. Also good.

分類３０２は、対応する特許文献が、ユーザにとって所望の文献であるか否かを示す該当情報と呼ぶべき情報であり、ここでは、対応する特許文献がユーザにとって所望の文献である場合には、「該当」で示し、所望の文献でない場合には、「ノイズ」の２値で示している。 The classification 302 is information that should be called relevant information indicating whether or not the corresponding patent document is a document desired for the user. Here, when the corresponding patent document is a document desired for the user, If it is not “relevant” and is not a desired document, it is indicated by a binary value of “noise”.

ＩＰＣ分類３０３は、対応する特許文献に付与されているＩＰＣを示す情報である。当該ＩＰＣ分類３０３は、対応する特許文献の１つ以上の技術的特徴を示す情報であり、国際的に統一されて用いられている特許文献の技術内容による分類を示す情報である。 The IPC classification 303 is information indicating the IPC assigned to the corresponding patent document. The IPC classification 303 is information indicating one or more technical features of the corresponding patent document, and is information indicating a classification based on the technical content of the patent document that is used internationally.

図３（ｂ）は、文書分類装置１００の取得部１１０が取得する新たな文書データの一例を示す新文献リスト３５０の構成例を示すデータ概念図である。新文献リスト３５０に記載される各特許文献が分類の対象となる新たな文書データの一覧である。新文献リスト３５０は、検索により新たに検索される他の文書データを示す情報に相当する公報番号３５１と、１つ以上の技術的特徴を示す第二特徴情報に相当するＩＰＣ分類３５２とが対応付けられた情報である。 FIG. 3B is a data conceptual diagram showing a configuration example of a new document list 350 showing an example of new document data acquired by the acquisition unit 110 of the document classification device 100. Each patent document described in the new document list 350 is a list of new document data to be classified. The new document list 350 corresponds to a publication number 351 corresponding to information indicating other document data newly searched by a search and an IPC classification 352 corresponding to second feature information indicating one or more technical features. It is the attached information.

公報番号３５１は、文書データを一意に特定するための情報である。ここでは、分類の対象となる特許文献の公報番号を用いているが、これは、公報番号に限るものではなく、当該文献を一意に特定できる識別情報であれば、公報番号以外を用いることとしてもよい。 The publication number 351 is information for uniquely identifying document data. Here, the publication number of the patent document to be classified is used. However, this is not limited to the publication number, and any other identification number can be used as long as the identification information can uniquely identify the document. Also good.

ＩＰＣ分類３５２は、対応する特許文献に付与されているＩＰＣを示す情報である。当該ＩＰＣ分類３５２は、対応する特許文献の１つ以上の技術的特徴を示す情報であり、国際的に統一されて用いられている特許文献の技術内容による分類を示す情報である。 The IPC classification 352 is information indicating the IPC assigned to the corresponding patent document. The IPC classification 352 is information indicating one or more technical features of the corresponding patent document, and is information indicating a classification based on the technical content of the patent document that is used internationally.

図４は、新文献リスト３００に含まれる一文献と、過去文献リスト３５０に含まれる各文献との技術分類の一致度を対応付けて、その一致度の高いものから降順に並べ替えた状態の一致度表４００の構成例を示すデータ概念図である。一致度表４００は、文書分類装置１００が新たな文書データが、「ノイズ」か「該当」かを判断する過程で生成する情報である。 FIG. 4 shows a state in which one document included in the new document list 300 is associated with the matching degree of the technical classification between each document included in the past document list 350 and sorted in descending order from the highest matching degree. 5 is a data conceptual diagram showing a configuration example of a coincidence degree table 400. FIG. The coincidence degree table 400 is information generated in the process in which the document classification apparatus 100 determines whether the new document data is “noise” or “corresponding”.

公報番号４０１は、分類済みの文書データである特許文献を一意に特定するための情報である。ここでは、分類の対象となる特許文献の公報番号を用いているが、これは、公報番号に限るものではなく、当該文献を一意に特定できる識別情報であれば、公報番号以外を用いることとしてもよい。 The publication number 401 is information for uniquely identifying a patent document that is classified document data. Here, the publication number of the patent document to be classified is used. However, this is not limited to the publication number, and if it is identification information that can uniquely identify the document, other than the publication number is used. Also good.

分類４０２は、対応する特許文献が、ユーザにとって所望の文献であるか否かを示す該当情報と呼ぶべき情報であり、ここでは、対応する特許文献がユーザにとって所望の文献である場合には、「該当」で示し、所望の文献でない場合には、「ノイズ」の２値で示している。 The classification 402 is information that should be called relevant information indicating whether or not the corresponding patent document is a document desired for the user. Here, when the corresponding patent document is a document desired for the user, If it is not “relevant” and is not a desired document, it is indicated by a binary value of “noise”.

ＩＰＣ分類４０３は、対応する特許文献に付与されているＩＰＣを示す情報である。当該ＩＰＣ分類４０３は、対応する特許文献の１つ以上の技術的特徴を示す情報であり、国際的に統一されて用いられている特許文献の技術内容による分類を示す情報である。 The IPC classification 403 is information indicating the IPC assigned to the corresponding patent document. The IPC classification 403 is information indicating one or more technical features of the corresponding patent document, and is information indicating a classification based on the technical content of the patent document that is used internationally.

一致度４０４は、新文献リスト３５０に含まれる一つの新文献について、当該新文献に付与されているＩＰＣ分類３５２と、過去文献リスト３００に含まれる各文献に対応付けられているＩＰＣ分類３０３との一致度を示す情報である。 The degree of coincidence 404 indicates that, for one new document included in the new document list 350, the IPC class 352 assigned to the new document, and the IPC class 303 associated with each document included in the past document list 300, This is information indicating the degree of coincidence.

一致度表４００は、新文献リスト３５０に含まれる各文献毎に生成される。そして、各文献について各々がユーザの所望する文献かそうでないかを、対応する一致度表４００を用いて、判断部１２２が判断する。 The degree of coincidence table 400 is generated for each document included in the new document list 350. Then, the determination unit 122 determines whether each document is a document desired by the user or not by using the corresponding matching degree table 400.

図５は、ＩＰＣ分類ごとに、ノイズとなるか、該当になるかの確率を示すＩＰＣ該当情報５００の構成例を示すデータ概念図である。図５に示すようにＩＰＣ該当情報５００は、ＩＰＣ分類５０１と、公報件数５０２と、ノイズ件数５０３と、ノイズ率５０４と、ノイズ判定５０５と、とが対応付けられた情報である。 FIG. 5 is a data conceptual diagram showing a configuration example of IPC corresponding information 500 indicating the probability of becoming noise or corresponding for each IPC classification. As shown in FIG. 5, the IPC applicable information 500 is information in which an IPC classification 501, a publication number 502, a noise number 503, a noise rate 504, and a noise determination 505 are associated with each other.

ＩＰＣ分類５０１は、特許文献の技術的特徴を示す情報であり、国際的に統一されて用いられている特許文献の技術内容による分類を示す情報である。ＩＰＣ分類５０１は、過去文献リスト３００に含まれる過去文献に付与されているＩＰＣを抽出したものである。 The IPC classification 501 is information indicating technical characteristics of patent documents, and is information indicating classification according to technical contents of patent documents that are used internationally. The IPC classification 501 is obtained by extracting IPCs assigned to past documents included in the past document list 300.

公報件数５０２は、過去文献リスト３００において、対応するＩＰＣ分類５０１が付与されている文献の総数を示す情報である。 The number of publications 502 is information indicating the total number of documents to which the corresponding IPC classification 501 is assigned in the past document list 300.

ノイズ件数５０３は、対応するＩＰＣ分類５０１が付与されている過去文献のうち、ユーザが「ノイズ」であると判断した文献の総数を示す情報である。 The number of noises 503 is information indicating the total number of documents that the user has determined to be “noise” among the past documents to which the corresponding IPC classification 501 is assigned.

ノイズ率５０４は、対応するＩＰＣ分類５０１が文書データに付与されている場合に、ユーザにとって所望の文献ではない確率を示唆する情報であって、対応するノイズ件数５０３を、対応する公報件数５０２で除した値を示している。 The noise rate 504 is information that suggests a probability that the document is not a desired document when the corresponding IPC classification 501 is given to the document data. The noise number 503 is the corresponding number of publications 502. The divided value is shown.

ノイズ判定５０５は、対応するＩＰＣ分類５０１が付与されている場合にノイズとなるか該当するかを判定するための値である。文書分類装置１００は、ノイズ判定が「１００」となっていれば「ノイズ」、即ちユーザにとって所望でない文書であると判定することができる。また、文書分類装置１００は、ノイズ判定が「０」となっていれば「該当」、即ちユーザにとって所望の文書であると判定することができる。本実施の形態においては、ノイズ率５０４が９５以上であるＩＰＣ分類はノイズ判定５０５を１００とし、ノイズ率５０４が１０以下であるＩＰＣ分類はノイズ判定５０５を０としている。なお、ここで９５や１０の閾値は、文書分類装置１００が定めた値であり、適宜その設定を変更できることとしてもよい。当該設定を変更する場合には、文書分類装置１００に接続された入力装置等を用いて変更することができる。 The noise determination 505 is a value for determining whether or not noise occurs when the corresponding IPC classification 501 is assigned. If the noise determination is “100”, the document classification apparatus 100 can determine that the document is “noise”, that is, a document that is not desired by the user. Further, the document classification device 100 can determine that the document is “corresponding”, that is, a document desired for the user if the noise determination is “0”. In the present embodiment, the IPC classification in which the noise ratio 504 is 95 or more has a noise determination 505 of 100, and the IPC classification in which the noise ratio 504 is 10 or less has a noise determination 505 of 0. Here, the threshold values 95 and 10 are values determined by the document classification apparatus 100, and the settings may be changed as appropriate. When changing the setting, the setting can be changed by using an input device connected to the document classification device 100.

ＩＰＣ該当情報５００は、特定のＩＰＣ分類について、高確率でノイズあるいは該当となり得る文書を、文書分類装置１００が特定するのに用いることができる。即ち、例えば、文書分類装置１００は、ノイズ判定が１００となっているＩＰＣ分類が付与されている文書は、ユーザにとって所望でない文献として特定することができる。逆に、ノイズ判定が０となっているＩＰＣ分類が付与されている文書は、ユーザにとって所望の文献であると特定することもできる。なお、図５に示す各値は一例である。 The IPC corresponding information 500 can be used by the document classification apparatus 100 to identify a document that can be a noise or a corresponding with high probability for a specific IPC classification. That is, for example, the document classification apparatus 100 can specify a document to which an IPC classification having a noise determination of 100 is given as a document that is not desired by the user. Conversely, a document to which an IPC classification with a noise determination of 0 can be specified as a document desired for the user. Each value shown in FIG. 5 is an example.

＜動作＞
ここから、文書分類装置１００による歪み量の算出に係る動作を説明する。図６から図７にかけて示すフローチャートは、文書分類装置１００が新たな文書の分類を行う前の事前準備のための処理を示すフローチャートである。当該処理は、文書分類装置１００の制御部１２０が実行する処理である。本処理は、図５に示す該当確率情報５００を生成するための処理である。以下、詳細に説明する。 <Operation>
From here, the operation | movement which concerns on calculation of the distortion amount by the document classification device 100 is demonstrated. The flowcharts shown in FIGS. 6 to 7 are flowcharts showing processing for preparation before the document classification device 100 classifies a new document. This process is a process executed by the control unit 120 of the document classification device 100. This process is a process for generating the corresponding probability information 500 shown in FIG. Details will be described below.

（ステップＳ６０１）
ステップＳ６０１において、文書分類装置１００の制御部１２０は、処理に用いる変数ｉを、１に設定する。当該変数ｉは、過去文献リスト３００に含まれる各文献について、処理対象の文献を定めるための変数である。変数ｉを１に設定した後に、ステップＳ６０２の処理に移行する。 (Step S601)
In step S601, the control unit 120 of the document classification device 100 sets the variable i used for processing to 1. The variable i is a variable for determining a document to be processed for each document included in the past document list 300. After setting the variable i to 1, the process proceeds to step S602.

（ステップＳ６０２）
ステップＳ６０２において、制御部１２０は、過去文献リスト３００に含まれる全ての文献について、処理を行ったか否かを判定する。当該判定は、変数ｉの数が、過去文献リスト３００の総数に一致するか否かによって判定できる。当該判定において、全ての文献について処理を行っていない場合には（ＮＯ）、ステップＳ６０３の処理に移行し、全ての文献について処理を終了している場合には（ＹＥＳ）、ステップＳ６０９の処理に移行する。 (Step S602)
In step S602, the control unit 120 determines whether or not processing has been performed for all documents included in the past document list 300. This determination can be made based on whether or not the number of variables i matches the total number of past document lists 300. In the determination, if all documents are not processed (NO), the process proceeds to step S603. If all documents are processed (YES), the process proceeds to step S609. Transition.

（ステップＳ６０３）
ステップＳ６０３において、制御部１２０は、過去文献リスト３００のｉ行目の公報の分類が「ノイズ」であるか否かを、過去文献リスト３００の対応する分類３０２を参照して判定する。ｉ行目の公報の分類が「ノイズ」である場合には（ＹＥＳ）、ステップＳ６０４の処理に移行し、「ノイズ」でない、即ち、「該当」となっている場合には（ＮＯ）、ステップＳ６０５の処理に移行する。 (Step S603)
In step S 603, the control unit 120 determines whether the classification of the publication in the i-th row of the past document list 300 is “noise” with reference to the corresponding classification 302 of the past document list 300. When the classification of the i-th publication is “noise” (YES), the process proceeds to step S604, and when it is not “noise”, that is, “applicable” (NO), step The process proceeds to S605.

（ステップＳ６０４）
ステップＳ６０４において、制御部１２０は、カウント設定値Ｃを１に設定して、ステップＳ６０６の処理に移行する。 (Step S604)
In step S604, the control unit 120 sets the count setting value C to 1, and proceeds to the process of step S606.

（ステップＳ６０５）
ステップＳ６０５において、制御部１２０は、カウント設定値Ｃを０に設定して、ステップＳ６０７に移行する。 (Step S605)
In step S605, the control unit 120 sets the count setting value C to 0, and proceeds to step S607.

（ステップＳ６０６）
ステップＳ６０６において、過去文献リスト３００のｉ番目の文献に対応するＩＰＣ分類３０３に示される各ＩＰＣのノイズカウントに、ステップＳ６０４又はステップＳ６０５において算出されたカウント設定値Ｃを足す。ここでノイズカウントは、各ＩＰＣがそれぞれノイズであるか否かを判断するための指標となる値である。その後に、ステップＳ６０７に移行する。 (Step S606)
In step S606, the count setting value C calculated in step S604 or step S605 is added to the noise count of each IPC indicated in the IPC classification 303 corresponding to the i-th document in the past document list 300. Here, the noise count is a value serving as an index for determining whether or not each IPC is noise. Thereafter, the process proceeds to step S607.

（ステップＳ６０７）
ステップＳ６０７において、制御部１２０は、過去文献リスト３００のｉ番目の文献に対応するＩＰＣ分類に示される各ＩＰＣ各々についての総数を示す総カウント値に１を足す。その後に、ステップＳ６０８に移行する。 (Step S607)
In step S 607, the control unit 120 adds 1 to the total count value indicating the total number for each IPC indicated in the IPC classification corresponding to the i-th document in the past document list 300. Thereafter, the process proceeds to step S608.

（ステップＳ６０８）
ステップＳ６０８において、制御部１２０は、変数ｉに１を足した値を次のｉの値として、ステップＳ６０２の処理に戻る。 (Step S608)
In step S608, the control unit 120 sets the value obtained by adding 1 to the variable i as the next value of i, and returns to the process of step S602.

（ステップＳ６０９）
ステップＳ６０９において、制御部１２０は、閾値Ｔを、過去文献リスト３００に記載されている文献の総数である総文献数の２．５％に設定して、ステップＳ６１０の処理に移行する。なお、ここで、閾値Ｔは、各ＩＰＣごとに設定される。 (Step S609)
In step S609, the control unit 120 sets the threshold value T to 2.5% of the total number of documents described in the past document list 300, and proceeds to the process of step S610. Here, the threshold value T is set for each IPC.

（ステップＳ６１０）
ステップＳ６１０において、制御部１２０は、ステップＳ６０９において算出した閾値Ｔが５０を超えるか否かを判定する。閾値Ｔが５０を超えている場合には（ＹＥＳ）、ステップＳ６１１の処理に移行し、超えていない場合には（ＮＯ）、閾値をそのままの値にして、図７のステップＳ７０１の処理に移行する。 (Step S610)
In step S610, the control unit 120 determines whether or not the threshold T calculated in step S609 exceeds 50. If the threshold value T exceeds 50 (YES), the process proceeds to step S611. If not (NO), the threshold value remains unchanged and the process proceeds to step S701 in FIG. To do.

（ステップＳ６１１）
ステップＳ６１１において、制御部１２０は、閾値Ｔを５０に設定しなおして、図７のステップＳ７０１の処理に移行する。 (Step S611)
In step S611, the control unit 120 resets the threshold T to 50, and proceeds to the process of step S701 in FIG.

（ステップＳ７０１）
図７に示すステップＳ７０１において、制御部１２０は、変数ｊを１に設定し、ステップＳ７０２の処理に移行する。変数ｊは、各ＩＰＣについての処理対象となるＩＰＣを特定するための変数である。 (Step S701)
In step S701 illustrated in FIG. 7, the control unit 120 sets the variable j to 1 and proceeds to the process of step S702. The variable j is a variable for specifying the IPC to be processed for each IPC.

（ステップＳ７０２）
ステップＳ７０２において、制御部１２０は、変数ｊが処理対象のＩＰＣの総数に１を足した数と同じであるか否かを判定する。変数ｊが処理対象のＩＰＣの総数に１を足した数と同数である場合には（ＹＥＳ）、処理を終了し、同数でない場合には（ステップＳ７０３）の処理に移行する。 (Step S702)
In step S702, the control unit 120 determines whether or not the variable j is equal to the number obtained by adding 1 to the total number of IPCs to be processed. When the variable j is the same as the number of IPCs to be processed plus 1 (YES), the process is terminated, and when the number is not the same, the process proceeds to the process of (Step S703).

（ステップＳ７０３）
ステップＳ７０３において、制御部１２０は、各ＩＰＣについて、総件数が閾値Ｔ未満であるか否かを判定する。総件数が閾値Ｔ未満である場合には（ＹＥＳ）、ステップＳ７０９に移行し、閾値Ｔ未満でない場合には（ＮＯ）、ステップＳ７０４の処理に移行する。 (Step S703)
In step S703, the control unit 120 determines whether the total number of cases is less than the threshold T for each IPC. If the total number is less than the threshold T (YES), the process proceeds to step S709. If not (NO), the process proceeds to step S704.

（ステップＳ７０４）
ステップＳ７０４において、制御部１２０は、各ＩＰＣのノイズ率を、各ＩＰＣのノイズカウント値を、当該ＩＰＣの総カウント値で除した値として算出する。ノイズカウント値は、図６のステップＳ６０２からＳ６０８の処理を繰り返すことで、ステップＳ６０６の処理により算出される値である。また、ＩＰＣの総カウント値は、ステップＳ６０２からＳ６０８の処理を繰り返すことで、ステップＳ６０７の処理により算出される値である。ノイズ率を算出した後には、ステップＳ７０５の処理に移行する。 (Step S704)
In step S704, the control unit 120 calculates the noise rate of each IPC as a value obtained by dividing the noise count value of each IPC by the total count value of the IPC. The noise count value is a value calculated by the process of step S606 by repeating the processes of steps S602 to S608 in FIG. The IPC total count value is a value calculated by the process of step S607 by repeating the processes of steps S602 to S608. After calculating the noise rate, the process proceeds to step S705.

（ステップＳ７０５）
ステップＳ７０５において、制御部１２０は、各ＩＰＣについて各々のノイズ率が１０％未満であるか否かを判定する。１０％未満である場合には（ＹＥＳ）、ステップＳ７０６に移行し、１０％未満でない場合、即ち、１０％以上である場合には（ＮＯ）、ステップＳ７０８の処理に移行する。 (Step S705)
In step S705, the control unit 120 determines whether each noise rate is less than 10% for each IPC. If it is less than 10% (YES), the process proceeds to step S706. If it is not less than 10%, that is, if it is 10% or more (NO), the process proceeds to step S708.

（ステップＳ７０６）
ステップＳ７０６において、制御部１２０は、ＩＰＣノイズ率が１０％未満であったＩＰＣのノイズ判定を０％に設定する。その後に、ステップＳ７０９の処理に移行する。 (Step S706)
In step S706, the control unit 120 sets the IPC noise determination that the IPC noise rate is less than 10% to 0%. Thereafter, the process proceeds to step S709.

（ステップＳ７０７）
ステップＳ７０７において、制御部１２０は、ＩＰＣノイズ率が１０％未満ではなかったＩＰＣ各々について、ノイズ率が９５％以上であるか否かを判定する。ノイズ率が９５％以上であった場合には（ＹＥＳ）、ステップＳ７０９に移行し、ノイズ率が９５％以上でなかった場合には（ＮＯ）、ノイズ率は、ステップＳ７０４で算出した値として、ステップＳ７０９の処理に移行する。 (Step S707)
In step S707, the control unit 120 determines whether or not the noise rate is 95% or more for each IPC whose IPC noise rate is not less than 10%. When the noise rate is 95% or more (YES), the process proceeds to step S709. When the noise rate is not 95% or more (NO), the noise rate is calculated as the value calculated in step S704. The process proceeds to step S709.

（ステップＳ７０８）
ステップＳ７０８において、制御部１２０は、ＩＰＣノイズ率が９５％以上であったＩＰＣのノイズ判定を１００％に設定する。その後にステップＳ７０９の処理に移行する。 (Step S708)
In step S708, the control unit 120 sets the IPC noise determination that the IPC noise rate is 95% or more to 100%. Thereafter, the process proceeds to step S709.

（ステップＳ７０９）
ステップＳ７０９において、制御部１２０は、ｊに１加算した値を新たなｊとし、ステップＳ７０２の処理に戻る。 (Step S709)
In step S709, the control unit 120 sets a value obtained by adding 1 to j as a new j, and returns to the process of step S702.

以上の処理を実行することにより、制御部１２０は、各ＩＰＣに対してノイズ率が算出され、図５に示すＩＰＣ該当情報５００を生成し、記憶部１３０に記憶する。 By executing the above processing, the control unit 120 calculates the noise rate for each IPC, generates the IPC corresponding information 500 shown in FIG. 5, and stores it in the storage unit 130.

次に、実際に新たな文書データ（公報）を入力された場合に、その公報が「ノイズ」であるか、「該当する」かを文書分類装置１００が判断する際の動作について説明する。図８〜図１０にかけて示すフローチャートが当該処理に該当する。本処理は、取得部１１０が新たな文献の集合である新文献リスト３５０を入手した後に、抽出部１２１及び判断部１２２が実行する処理となる。以下、詳細に説明する。 Next, an operation when the document classification apparatus 100 determines whether the publication is “noise” or “applicable” when new document data (publication) is actually input will be described. The flowchart shown in FIGS. 8 to 10 corresponds to this process. This process is a process executed by the extraction unit 121 and the determination unit 122 after the acquisition unit 110 obtains a new document list 350 that is a set of new documents. Details will be described below.

（ステップＳ８０１）
ステップＳ８０１において、判断部１２２は、未判別の文献を区別するための変数ｌを１に設定する。その後に、ステップＳ８０２の処理に移行する。 (Step S801)
In step S801, the determination unit 122 sets a variable 1 for distinguishing unidentified documents to 1. Thereafter, the process proceeds to step S802.

（ステップＳ８０２）
ステップＳ８０２において、判断部１２２は、未判別の文献が残っているか否かを判定する。当該判定は、新文献リスト３５０に含まれる文献数と、変数ｌが一致するか否かによって行う。未判別の文献が残っている場合には（ＹＥＳ）、ステップＳ８０３の処理に移行し、残っていない場合には（ＮＯ）、処理を終了する。 (Step S802)
In step S802, the determination unit 122 determines whether or not unidentified documents remain. This determination is made based on whether or not the number of documents included in the new document list 350 matches the variable l. If unidentified documents remain (YES), the process proceeds to step S803. If no documents remain (NO), the process ends.

（ステップＳ８０３）
ステップＳ８０３において、判断部１２２は、ｌ番目の公報のＩＰＣを抽出する。ここでは、新文献リスト３５０のＩＰＣ分類３５２から抽出する。抽出したＩＰＣは個別に管理する。ＩＰＣを抽出した後に、ステップＳ８０４の処理に移行する。 (Step S803)
In step S803, the determination unit 122 extracts the IPC of the l-th publication. Here, it is extracted from the IPC classification 352 of the new document list 350. The extracted IPC is managed individually. After extracting the IPC, the process proceeds to step S804.

（ステップＳ８０４）
ステップＳ８０４において、判断部１２２は、処理を行っていないＩＰＣを識別するために用いる変数ｍを１に設定する。ここで変数ｍの最大値は、ステップＳ８０３において抽出したＩＰＣの合計数に相当する。その後に、ステップＳ８０５の処理に移行する。 (Step S804)
In step S804, the determination unit 122 sets a variable m used to identify an IPC that is not performing processing to 1. Here, the maximum value of the variable m corresponds to the total number of IPCs extracted in step S803. Thereafter, the process proceeds to step S805.

（ステップＳ８０５）
ステップＳ８０５において、判断部１２２は、最後のＩＰＣについての判定であるか、即ち、ｍがｍの総数に１を足した数になっているか否かを判定する。ｍがＩＰＣの総数に１を足した数になっている場合には（ＹＥＳ）、ステップＳ８０６に移行し、なっていない場合には（ＮＯ）、ステップＳ８１２の処理に移行する。 (Step S805)
In step S805, the determination unit 122 determines whether the determination is for the last IPC, that is, whether m is a number obtained by adding 1 to the total number of m. When m is the number obtained by adding 1 to the total number of IPCs (YES), the process proceeds to step S806. When m is not (NO), the process proceeds to step S812.

（ステップＳ８０６）
ステップＳ８０６において、判断部１２２は、検索式ＩＰＣとの一致ＩＰＣ数カウントがｍになっているか否かを判定する。なっている場合には（ＹＥＳ）、ステップＳ８０７に移行し、なっていない場合には（ＮＯ）、図９のステップＳ９０１の処理に移行する。 (Step S806)
In step S806, the determination unit 122 determines whether or not the number of IPCs that match the search expression IPC is m. If yes (YES), the process proceeds to step S807. If not (NO), the process proceeds to step S901 in FIG.

（ステップＳ８０７）
ステップＳ８０７において、判断部１２２は、対象のＩＰＣのノイズ判定を０％とし、ステップＳ８１１の処理に移行する。 (Step S807)
In step S807, the determination unit 122 sets the target IPC noise determination to 0%, and proceeds to the process of step S811.

（ステップＳ８０８）
ステップＳ８０８において、判断部１２２は、ノイズ判定が０であるか否かを判定する。ノイズ判定が０である場合には（ＹＥＳ）、ステップＳ８０９の処理に移行し、０ではない場合には（ＮＯ）、ステップＳ８１０の処理に移行する。なお。ここでは、ノイズ判定が０であるか否かに基づいて判定しているが、これは、ノイズ判定が１００であるか否かに基づいて判定してもよく、ノイズ判定が１００である場合にステップＳ８１０の処理に移行し、１００でない場合にステップＳ８０９の処理に移行することになる。 (Step S808)
In step S808, the determination unit 122 determines whether or not the noise determination is zero. If the noise determination is 0 (YES), the process proceeds to step S809. If not (NO), the process proceeds to step S810. Note that. Here, the determination is based on whether or not the noise determination is 0, but this may be determined based on whether or not the noise determination is 100. The process proceeds to step S810, and if not 100, the process proceeds to step S809.

（ステップＳ８０９）
ステップＳ８０９において、判断部１２２は、Ｐｌ、即ち、ｌ番目の文献が「該当」、即ち、ユーザが所望する文献であると判断し、ｌ番目の文献に対応付けて記憶する。その後に、ステップＳ８１１の処理に移行する。 (Step S809)
In step S809, the determination unit 122 determines that Pl, that is, the l-th document is “corresponding”, that is, the document desired by the user, and stores it in association with the l-th document. Thereafter, the process proceeds to step S811.

（ステップＳ８１０）
ステップＳ８１０において、判断部１２２は、Ｐｌ、即ち、ｌ番目の文献が「ノイズ」、即ち、ユーザが所望する文献ではないと判断し、ｌ番目の文献に対応付けて記憶する。その後に、ステップＳ８１１の処理に移行する。 (Step S810)
In step S810, the determination unit 122 determines that Pl, that is, the l-th document is “noise”, that is, is not a document desired by the user, and stores it in association with the l-th document. Thereafter, the process proceeds to step S811.

（ステップＳ８１１）
ステップＳ８１１において、判断部１２２は、変数ｌに１加算した値を新たな変数ｌとし、ステップＳ８０２の処理に戻る。 (Step S811)
In step S811, the determination unit 122 sets the value obtained by adding 1 to the variable l as a new variable l, and returns to the process of step S802.

（ステップＳ８１２）
ステップＳ８１２において、判断部１２２は、処理対象のＩＰＣのノイズ率が、ＩＰＣ該当情報５００において０％若しくは１００％に設定されているか否かを判定する。ノイズ率が０％若しくは１００％に設定されている場合には（ＹＥＳ）、ステップＳ８１３に移行し、設定されていない場合には（ＮＯ）、ステップＳ８１４の処理に移行する。 (Step S812)
In step S812, the determination unit 122 determines whether the noise rate of the IPC to be processed is set to 0% or 100% in the IPC applicable information 500. When the noise rate is set to 0% or 100% (YES), the process proceeds to step S813, and when not set (NO), the process proceeds to step S814.

（ステップＳ８１３）
ステップＳ８１３において、判断部１２２は、新たな文献のノイズ判定を対象ＩＰＣのノイズ判定値（即ち、０％若しくは１００％のいずれか）に設定して、ステップＳ８０８の処理に移行する。当該処理は、新たな文献が、ＩＰＣ該当情報５００において、ノイズ１００％となるＩＰＣ分類または該当１００％となるＩＰＣ分類を有する場合に、ＩＰＣ該当情報５００で示される分類をそのまま新たな文献に適用するものである。 (Step S813)
In step S813, the determination unit 122 sets the noise determination of the new document to the noise determination value of the target IPC (that is, either 0% or 100%), and proceeds to the process of step S808. In the processing, when the new document has the IPC classification that is 100% noise or the IPC classification that is 100% corresponding to the IPC corresponding information 500, the classification indicated by the IPC corresponding information 500 is applied to the new document as it is. To do.

（ステップＳ８１４）
ステップＳ８１４において、判断部１２２は、処理対象のＩＰＣが検索式として使用したＩＰＣと一致するか否かを判定する。一致する場合には（ＹＥＳ）、ステップＳ８１５に移行し、一致しない場合には（ＮＯ）、ステップＳ８１６の処理に移行する。 (Step S814)
In step S814, the determination unit 122 determines whether the processing target IPC matches the IPC used as the search expression. If they match (YES), the process proceeds to step S815. If they do not match (NO), the process proceeds to step S816.

（ステップＳ８１５）
ステップＳ８１５において、判断部１２２は、検索式ＩＰＣと一致するＩＰＣ数カウントを１加算する。その後に、ステップＳ８１６の処理に移行する。 (Step S815)
In step S815, the determination unit 122 adds 1 to the IPC count that matches the search expression IPC. Thereafter, the process proceeds to step S816.

（ステップＳ８１６）
ステップＳ８１６において、判断部１２２は、変数ｍに１加算した値を新たなｍとし、ステップＳ８０５の処理に戻る。 (Step S816)
In step S816, the determination unit 122 sets a value obtained by adding 1 to the variable m as a new m, and returns to the process of step S805.

（ステップＳ９０１）
ステップＳ９０１において、判断部１２２は、変数ｋを１に設定する。変数ｋは、処理対象となる過去文献リスト３００中の過去文献を識別するための用いる変数である。変数ｋを１に設定した後に、ステップＳ９０２の処理に移行する。 (Step S901)
In step S901, the determination unit 122 sets the variable k to 1. The variable k is a variable used for identifying a past document in the past document list 300 to be processed. After the variable k is set to 1, the process proceeds to step S902.

（ステップＳ９０２）
ステップＳ９０２において、判断部１２２は、処理対象の文献が、過去文献リスト３００の過去文献リストの総数になっているか否かを、変数ｋが過去文献リスト３００に含まれる過去文献の総数に１足した値に一致するか否かによって判定する。処理対象の文献が、過去文献リスト３００に含まれる過去文献の最後の文献になっている場合には（ＹＥＳ）、図１０のステップＳ１００１に移行し、なっていない場合には（ＮＯ）、ステップＳ９０３の処理に移行する。 (Step S902)
In step S 902, the determination unit 122 determines whether or not the document to be processed is the total number of past document lists in the past document list 300, and adds a variable k to the total number of past documents included in the past document list 300. Judgment is made based on whether or not the value matches. If the document to be processed is the last document of the past documents included in the past document list 300 (YES), the process proceeds to step S1001 in FIG. 10, and if not (NO), the step The process proceeds to S903.

（ステップＳ９０３）
ステップＳ９０３において、判断部１２２は、過去文献リスト３００のｋ番目の公報のＩＰＣを抽出する。即ち、過去文献リスト３００のｋ行目のＩＰＣ分類３０３から、各ＩＰＣを抽出する。その後に、ステップＳ９０４の処理に移行する。 (Step S903)
In step S903, the determination unit 122 extracts the IPC of the kth publication in the past document list 300. That is, each IPC is extracted from the IPC classification 303 in the k-th row of the past document list 300. Thereafter, the process proceeds to step S904.

（ステップＳ９０４）
ステップＳ９０４において、新文献リスト３５０のｈ番目の公報の各ＩＰＣ（ｈ）を、新文献リスト３５０のＩＰＣ分類３５２から抽出する。その後に、ステップＳ９０５の処理に移行する。 (Step S904)
In step S904, each IPC (h) of the h-th publication in the new document list 350 is extracted from the IPC classification 352 in the new document list 350. Thereafter, the process proceeds to step S905.

（ステップＳ９０５）
ステップＳ９０５において、判断部１２２は、変数ｎを１に設定する。変数ｎは、処理対象の新文献に付与されているＩＰＣのうちの処理対象となっているＩＰＣを区別するための変数である。変数ｎを１に設定した後に、ステップＳ９０６の処理に移行する。 (Step S905)
In step S905, the determination unit 122 sets the variable n to 1. The variable n is a variable for distinguishing the IPC that is the processing target among the IPCs assigned to the new document to be processed. After setting the variable n to 1, the process proceeds to step S906.

（ステップＳ９０６）
ステップＳ９０６において、判断部１２２は、ｎがラストになっているか、即ち、新文献に付与されている全てのＩＰＣについて処理を行ったか否かを判定する。行っている場合には（ＹＥＳ）、ステップＳ９０６に移行し、行っていない場合には（ＮＯ）、ステップＳ９０７の処理に移行する。 (Step S906)
In step S906, the determination unit 122 determines whether n is last, that is, whether all IPCs assigned to the new document have been processed. If yes (YES), the process proceeds to step S906. If not (NO), the process proceeds to step S907.

（ステップＳ９０７）
ステップＳ９０７において、判断部１２２は、ＩＰＣ（ｈ）ｎが、検索式のＩＰＣと一致するか否かを判定する。一致する場合には（ＹＥＳ）、ステップＳ９０８に移行し、一致しない場合には（ＮＯ）、ステップＳ９０９の処理に移行する。 (Step S907)
In step S907, the determination unit 122 determines whether IPC (h) n matches the IPC of the search expression. If they match (YES), the process proceeds to step S908. If they do not match (NO), the process proceeds to step S909.

（ステップＳ９０８）
ステップＳ９０８において、判断部１２２は、対象ＩＰＣカウントを１減算し、ステップＳ９１１の処理に移行する。 (Step S908)
In step S908, the determination unit 122 decrements the target IPC count by 1, and proceeds to the process of step S911.

（ステップＳ９０９）
ステップＳ９０９において、判断部１２２は、ＩＰＣ（ｈ）ｎがＩＰＣ（ｋ）に一致するか否かを判定する。即ち、新文献リスト３５０のｈ番目の新文献に付与されているＩＰＣのうち、ｎ番目のＩＰＣが、過去文献リスト３００のｋ番目の過去文献に付与されているＩＰＣのいずれかと一致するか否かを判定する。一致する場合には、ステップＳ９１０に移行し（ＹＥＳ）、一致しない場合には（ＮＯ）、ステップＳ９１１の処理に移行する。 (Step S909)
In step S909, the determination unit 122 determines whether IPC (h) n matches IPC (k). That is, of the IPCs assigned to the h-th new document in the new document list 350, the n-th IPC matches any of the IPCs assigned to the k-th past document in the past document list 300. Determine whether. If they match, the process proceeds to step S910 (YES), and if they do not match (NO), the process proceeds to step S911.

（ステップＳ９１０）
ステップＳ９１０において、判断部１２２は、ＩＰＣ一致数カウントを１加算し、その後に、ステップＳ９１１の処理に移行する。 (Step S910)
In step S910, the determination unit 122 adds 1 to the IPC match count, and then proceeds to the process of step S911.

（ステップＳ９１１）
ステップＳ９１１において、判断部１２２は、ｋに１加算した値を新たなｋとし、ステップＳ９０２の処理に戻る。 (Step S911)
In step S911, the determination unit 122 sets a value obtained by adding 1 to k as a new k, and returns to the process of step S902.

（ステップＳ９１２）
ステップＳ９１２において、判断部１２２は、新文献リスト３５０のｈ番目の新文献のＩＰＣと、過去文献リスト３００のｋ番目の過去文献に付与されているＩＰＣとの一致率を、それまでにカウントしたＩＰＣ一致率カウントを、対象ＩＰＣカウント数で除することで、算出する。その後に、ステップＳ９１３の処理に移行する。 (Step S912)
In step S912, the determination unit 122 has counted the coincidence rate between the IPC of the h-th new document in the new document list 350 and the IPC assigned to the k-th past document in the past document list 300 so far. The IPC match rate count is calculated by dividing by the target IPC count number. Thereafter, the process proceeds to step S913.

（ステップＳ９１３）
ステップＳ９１３において、判断部１２２は、変数ｋに１加算した値を新たなｋとし、ステップＳ９０２の処理に戻る。 (Step S913)
In step S913, the determination unit 122 sets a value obtained by adding 1 to the variable k as a new k, and returns to the process of step S902.

（ステップＳ１００１）
ステップＳ１００１において、判断部１２２は、新文献リスト３００の新文献に付与されているＩＰＣと、過去文献リスト３５０の過去文献各々に付与されているＩＰＣとの各文献毎の一致率を降順で並べ替える。その後に、ステップＳ１００２の処理に移行する。 (Step S1001)
In step S 1001, the determination unit 122 arranges the matching rates for each document between the IPC assigned to the new document in the new document list 300 and the IPC assigned to each past document in the past document list 350 in descending order. Change. Thereafter, the process proceeds to step S1002.

（ステップＳ１００２）
ステップＳ１００２において、抽出部１２１は、変数ｑを１に設定する。変数ｑは、ＩＰＣの一致度の高いものから、過去文献を抽出するため個数を特定するための変数である。変数ｑを１に設定した後に、ステップＳ１００３の処理に移行する。 (Step S1002)
In step S1002, the extraction unit 121 sets the variable q to 1. The variable q is a variable for specifying the number in order to extract past documents from those having a high degree of coincidence of IPC. After the variable q is set to 1, the process proceeds to step S1003.

（ステップＳ１００３）
ステップＳ１００３において、抽出部１２１は、ｑが８になっているか否かを判定する。ｑが８になっている場合には（ＹＥＳ）、ステップＳ１００４に移行し、なっていない場合には（ＮＯ）、ステップＳ１００９の処理に移行する。 (Step S1003)
In step S1003, the extraction unit 121 determines whether or not q is 8. If q is 8 (YES), the process proceeds to step S1004, and if not (NO), the process proceeds to step S1009.

（ステップＳ１００４）
ステップＳ１００４において、判断部１２２は、対象の新たに検索された特許文献が、ユーザにとって所望の文献であるか否かを判断するための指標ｔを、ノイズカウントを比較公報数カウントで除することで算出する。ノイズカウントは、ステップＳ１０１１において算出される数であって、特許分類の一致度の高かった文献の上位から所定数抽出した過去文献の中で、ノイズである文献の個数を示す。比較公報数カウントは、ステップＳ１０１２においてカウントされる数であって、ｑの最大数に一致する。即ち、比較公報数カウントは、抽出する公報数のことを意味する。ｔを算出すると、ステップＳ１００５の処理に移行する。 (Step S1004)
In step S 1004, the determination unit 122 divides the noise count by the comparative publication number count for an index t for determining whether the target newly searched patent document is a document desired for the user. Calculate with The noise count is the number calculated in step S1011 and indicates the number of documents that are noise among the past documents extracted from the top of the documents having a high degree of coincidence in patent classification. The comparative publication number count is a number counted in step S1012 and coincides with the maximum number of q. That is, the comparative publication number count means the number of publications to be extracted. When t is calculated, the process proceeds to step S1005.

（ステップＳ１００５）
ステップＳ１００５において、判断部１２２は、ステップＳ１００４で算出したｔが所定の閾値αを超えるか否かを判定する。ｔが閾値αを超えている場合には（ＹＥＳ）、ステップＳ１００６に移行し、超えていない場合には（ＮＯ）、ステップＳ１００７の処理に移行する。 (Step S1005)
In step S1005, the determination unit 122 determines whether t calculated in step S1004 exceeds a predetermined threshold value α. When t exceeds the threshold value α (YES), the process proceeds to step S1006. When t does not exceed (NO), the process proceeds to step S1007.

（ステップＳ１００６）
ステップＳ１００６において、判断部１２２は、対応する新たな公報が、ユーザの所望の文献に該当することを示す情報を付与する（該当すると分類する）。その後に、ステップＳ１００８の処理に移行する。 (Step S1006)
In step S1006, the determination unit 122 gives information indicating that the corresponding new publication corresponds to the user's desired document (classifies as corresponding). Thereafter, the process proceeds to step S1008.

（ステップＳ１００７）
ステップＳ１００７において、判断部１２２は、対応する新たな公報が、ユーザの所望の文献ではないものとして、ノイズであることを示す情報を付与する（ノイズであると分類する）。その後に、ステップＳ１００８の処理に移行する。 (Step S1007)
In step S 1007, the determination unit 122 assigns information indicating that the corresponding new gazette is not a user's desired document as noise (classifies as noise). Thereafter, the process proceeds to step S1008.

（ステップＳ１００８）
ステップＳ１００８において、判断部１２２は、変数ｌに１加算した値を新たなｌとし、図８のステップＳ８０２の処理に移行する。 (Step S1008)
In step S1008, the determination unit 122 sets the value obtained by adding 1 to the variable l as a new l, and proceeds to the process of step S802 in FIG.

（ステップＳ１００９）
ステップＳ１００９において、判断部１２２は、処理対象の文献数が、過去文献リスト３００の総数に１足した値に達したか否かを判定する。当該判定は、過去文献リスト３００に、ｑ個の文献が含まれていない場合のための処置である。処理対象の文献の数が過去文献リスト３００の総数に１足した値に達していた場合には（ＹＥＳ）、ステップＳ１００４に移行し、達していなかった場合には（ＮＯ）、ステップＳ１０１０の処理に移行する。 (Step S1009)
In step S 1009, the determination unit 122 determines whether or not the number of documents to be processed has reached a value obtained by adding one to the total number of past document lists 300. This determination is a measure for a case where q documents are not included in the past document list 300. If the number of documents to be processed has reached the value added to the total number of the past document list 300 (YES), the process proceeds to step S1004. If not reached (NO), the process of step S1010 is performed. Migrate to

（ステップＳ１０１０）
ステップＳ１０１０において、判断部１２２は、過去文献リストのｑ番目の公報の分類３０２が「ノイズ」であるか否かを判定する。ノイズであると判定した場合には（ＹＥＳ）、ステップＳ１０１１に移行し、ノイズでないと判定した場合には（ＮＯ）、ステップＳ１０１２の処理に移行する。 (Step S1010)
In step S1010, the determination unit 122 determines whether the classification 302 of the qth publication in the past document list is “noise”. If it is determined to be noise (YES), the process proceeds to step S1011. If it is determined not to be noise (NO), the process proceeds to step S1012.

（ステップＳ１０１１）
ステップＳ１０１１において、判断部１２２は、ノイズカウントを１加算し、ステップＳ１０１２の処理に移行する。 (Step S1011)
In step S1011, the determination unit 122 adds 1 to the noise count, and the process proceeds to step S1012.

（ステップＳ１０１２）
ステップＳ１０１２において、判断部１２２は、比較公報数カウントを１加算し、ステップＳ１０１３の処理に移行する。 (Step S1012)
In step S1012, the determination unit 122 increments the comparative publication number count by 1, and proceeds to the process of step S1013.

（ステップＳ１０１３）
ステップＳ１０１３において、判断部１２２は、変数ｑに１加算した値を新たなｑとし、ステップＳ１００３の処理に移行する。 (Step S1013)
In step S1013, the determination unit 122 sets a value obtained by adding 1 to the variable q as a new q, and proceeds to the process of step S1003.

図８から図１０に示す処理を実行することにより、新文献リスト３５０に含まれる新たな文献全てについて、文書分類装置１００は、新たな文献各々が、ノイズであるか否かを判定することができる。 By executing the processing shown in FIGS. 8 to 10, the document classification apparatus 100 can determine whether each new document is noise or not for all new documents included in the new document list 350. it can.

以上が、文書分類装置１００の動作の説明である。 The operation of the document classification apparatus 100 has been described above.

＜まとめ＞
上記実施の形態に係る文書分類装置は、特許公報に元々付与されている特許分類に基づいて、予め検索式により得られた文献が所望のものであるか否かを、「ノイズ」、「該当」という分類情報を付与しておく。そして、新たな特許公報が入力されたときに、その新たな特許公報に付与されている特許分類と、分類済みの特許公報の特許分類との一致度に基づいて、文献を所定数抽出する。そして、抽出された文献に付与されている分類が「ノイズ」と「該当」とのいずれが多いかによって、新たな特許公報が「ノイズ」であるか「該当」するのかを、特許公報の内容を精査しなくとも分類することができる。そして、ユーザは、ユーザが設定した検索式に応じて検索された文献であっても、ノイズと判定された文献については、その内容を確認する必要がなくなるので、文献のスクリーニングに要する時間を短縮することができる。また、文書分類装置としては、公報内を精査する必要がない（形態素解析を行ったり、形態素解析により抽出された膨大な個数のワードの一致率などを見たりする必要がない）ので、特許文献１〜３に示す分類装置よりもプロセッサの処理負荷を少なくすることができる。 <Summary>
The document classification device according to the above embodiment determines whether or not a document obtained by a search formula in advance is a desired one based on the patent classification originally given to the patent gazette. "Is given. When a new patent publication is input, a predetermined number of documents are extracted based on the degree of coincidence between the patent classification assigned to the new patent publication and the patent classification of the classified patent publication. The contents of the patent gazette indicate whether the new patent gazette is “noise” or “corresponding” depending on whether the classification given to the extracted document is “Noise” or “Applicable”. Can be classified without scrutiny. And even if the document is searched according to the search formula set by the user, it is not necessary to check the content of the document determined to be noise, so the time required for document screening is reduced. can do. In addition, as a document classification device, it is not necessary to scrutinize the inside of the official gazette (it is not necessary to perform morphological analysis or see the matching rate of a huge number of words extracted by morphological analysis). The processing load of the processor can be reduced as compared with the classification devices shown in 1 to 3.

＜補足＞
上記実施の形態に係る文書分類装置は、上記実施の形態に限定されるものではなく、他の手法により実現されてもよいことは言うまでもない。以下、各種変形例について説明する。 <Supplement>
It goes without saying that the document classification device according to the above embodiment is not limited to the above embodiment, and may be realized by other methods. Hereinafter, various modifications will be described.

（１）上記実施の形態においては特に説明していないが、抽出部１２１が抽出する文献数ｑは、奇数であることが望ましい。奇数に設定することで、必ず、「ノイズ」か「該当」を特定できるためである。その変数ｑは、所謂ｋ近傍法を用いて、算出するとよい。 (1) Although not specifically described in the above embodiment, the number of documents q extracted by the extraction unit 121 is desirably an odd number. This is because “noise” or “corresponding” can always be specified by setting an odd number. The variable q may be calculated using a so-called k-nearest neighbor method.

なお、ｑを偶数に設定した場合に、「ノイズ」と「該当」との数が一致するような場合も考えられる。そのため、文書分類装置は、以下のような手法を用いて文書を分類することとしてもよい。即ち、「ノイズ」の文献の基本値を「−１」、「該当」の基本値を「＋１」とする。そして、その基本値に対して一致度を重み値として乗じた値を当該文献のノイズか該当かの分類値とする。そして、判断部１２２は、抽出部１２１が抽出した文献の分類値を合算し、その値が正であれば、「該当」と分類し、負であれば、「ノイズ」と分類することとしてもよい。当該手法の場合、「ノイズ」か「該当」かの判断処理に係る処理負荷は上述の実施形態に示した処理による処理負荷よりも大きくなるものの、より正確に「ノイズ」か「該当」かの判断を行うことができる。即ち、文書分類装置１００は、重み付けによる補正を行った上で、分類を行うこととしてもよい。なお、ここでは、一致度そのものを重み値としているが、これはその限りではなく、任意の値を重み値としてもよい。 When q is set to an even number, the number of “noise” and “corresponding” may be the same. Therefore, the document classification device may classify the document using the following method. That is, the basic value of the “noise” document is “−1”, and the basic value of “applicable” is “+1”. Then, a value obtained by multiplying the basic value by the matching degree as a weight value is set as a classification value indicating whether the reference is noise or appropriate. And the judgment part 122 adds together the classification | category value of the literature which the extraction part 121 extracted, and if the value is positive, it may classify | categorize as "applicable", and if it is negative, it may classify as "noise". Good. In the case of this method, the processing load related to the determination process of “noise” or “appropriate” is larger than the processing load due to the process shown in the above embodiment, but more accurately “noise” or “applicable”. Judgment can be made. That is, the document classification apparatus 100 may perform classification after performing correction by weighting. Here, the degree of coincidence itself is used as the weight value, but this is not limited to this, and an arbitrary value may be used as the weight value.

（２）上記実施の形態においては、ノイズか該当かの判定において過半数を占める方の分類を新たな文献の分類としているが、これはその限りではない。例えば、抽出部１２１が抽出するｑ個の文献のうち、所定数以上の文献の分類が「ノイズ」であれば、新たな文献も「ノイズ」であると判断する構成としてもよい。例えば、抽出した文献数を１０個とし、そのうちの８個以上の分類が「ノイズ」であれば、新たな文献の分類を「ノイズ」とするように構成してもよい。 (2) In the above-described embodiment, the classification of the majority occupying the determination of whether it is noise or not is the classification of a new document, but this is not limited thereto. For example, if the classification of a predetermined number or more of the q documents extracted by the extraction unit 121 is “noise”, the new document may be determined to be “noise”. For example, if the number of extracted documents is 10, and 8 or more of them are classified as “noise”, the new document may be classified as “noise”.

（３）上記実施の形態においては、各技術的特徴である特許分類がノイズであるか該当であるかを判定するにおいて、ノイズ率が１０％未満である分類をノイズ判定０％とし、ＩＰＣノイズ率が９５％以上である分類をノイズ判定１００％とすることとした。ここで、１０％の閾値は、対応する分類が付与されている場合に、文献がユーザの所望する文献に該当するか否かを判定するための第２閾値であると言える。つまり、ステップＳ７０５における判定は、該当している率が９０％以上であるかの判定であるともいえる。また、ステップＳ７０７における判定に用いた第１閾値についても同様のことが言える。 (3) In the above embodiment, in determining whether the patent classification, which is a technical feature, is noise or appropriate, the classification with a noise rate of less than 10% is set as noise determination 0%, and the IPC noise A classification with a rate of 95% or more was determined as a noise determination of 100%. Here, it can be said that the threshold value of 10% is a second threshold value for determining whether or not a document corresponds to a document desired by the user when a corresponding classification is given. That is, it can be said that the determination in step S705 is a determination as to whether the corresponding rate is 90% or more. The same applies to the first threshold value used for the determination in step S707.

つまり、文書分類装置１００は、文献に付与されている特許分類が該当か否かを示す該当率が第１閾値である９０％以上であるか否かに基づいて判定し、非該当率が第２閾値である９５％以上であるか否かにに基づいて判定していることが理解できる。ここで、第１閾値と第２閾値との間に差を設けることによって、分類を、ノイズか該当かのいずれかに必ず分類できるようにすることができる。また、その分類がノイズであることを判定することを優先するのか、該当であることを判定することを優先するのかに応じて、第１閾値と第２閾値とを変動させることとしてもよい。そのために、文書分類装置１００は、第１閾値、第２閾値を設定するための設定部を備えることとしてもよい。当該設定部に対する入力は、文書分類装置１００が学習によって適切な値に設定することとしてもよいし、文書分類装置１００のユーザが設定することとしてもよい。なお、これらの判定に用いた閾値のパーセンテージは、上記実施の形態に示した数値に限るものではなく、適宜その設定値を、文書分類装置１００のオペレータが変更することができる。 That is, the document classification device 100 determines based on whether or not the corresponding rate indicating whether or not the patent classification assigned to the document is applicable is 90% or more which is the first threshold, and the non-applicable rate is the first. It can be understood that the determination is made based on whether or not the threshold value is 95% or more. Here, by providing a difference between the first threshold value and the second threshold value, the classification can be surely classified as either noise or appropriate. Further, the first threshold value and the second threshold value may be varied depending on whether priority is given to determining that the classification is noise or priority is determined to be applicable. For this purpose, the document classification apparatus 100 may include a setting unit for setting the first threshold value and the second threshold value. The input to the setting unit may be set to an appropriate value by the document classification device 100 by learning, or may be set by the user of the document classification device 100. Note that the threshold percentages used for these determinations are not limited to the numerical values shown in the above embodiment, and the operator of the document classification apparatus 100 can change the set values as appropriate.

（４）上記実施の形態において、ＩＰＣノイズ率が１００％の分類が付与されている文献を、ノイズと分類し、ＩＰＣノイズ率が０％の分類が付与されている文献を、該当に分類することとしている。しかしながら、場合によっては、ノイズ率が１００％の分類と、ノイズ率が０％の分類が付与されている文献が存在する可能性がある。そのような場合には、予めユーザが定めた所定の基準にしたがって、文書分類装置１００は、その文献を「ノイズ」であると判定してもよいし、「該当」であると判定してもよい。例えば、「ノイズ」を優先する設定とした場合には、「ノイズ」であると判定し、「該当」を優先する設定とした場合には、「該当」であると判定することとしてよい。 (4) In the above embodiment, a document to which a classification with an IPC noise rate of 100% is assigned is classified as noise, and a document to which a classification with an IPC noise rate of 0% is assigned is classified as applicable. I am going to do that. However, in some cases, there may be a document to which a classification with a noise rate of 100% and a classification with a noise rate of 0% are given. In such a case, the document classification device 100 may determine that the document is “noise” or “appropriate” in accordance with a predetermined criterion determined in advance by the user. Good. For example, when “noise” is set to be prioritized, it is determined to be “noise”, and when “setting” is set to be prioritized, it is determined to be “corresponding”.

（５）上記実施の形態においては、文書分類装置が新たな文書データを分類する手法として、文書分類装置１００を構成する各機能部として機能するプロセッサが文書分類プログラム等を実行することにより、新たな文書データを分類することとしているが、これは装置に集積回路（ＩＣ（Integrated Circuit）チップ、ＬＳＩ（Large Scale Integration））等に形成された論理回路（ハードウェア）や専用回路によって実現してもよい。また、これらの回路は、１または複数の集積回路により実現されてよく、上記実施の形態に示した複数の機能部の機能を１つの集積回路により実現されることとしてもよい。ＬＳＩは、集積度の違いにより、ＶＬＳＩ、スーパーＬＳＩ、ウルトラＬＳＩなどと呼称されることもある。すなわち、図１１に示すように、文書分類装置１００を構成する各機能部は、物理的な回路により実現されてもよい。図１１に示すように、文書分類装置１００は、記憶回路１３０ａと、取得回路１１０ａと、抽出回路１２１ａと、判断回路１２２ａと、出力回路１４０ａ、とを備え、各回路は、上述の同名の機能部と同様の機能を有する。 (5) In the above embodiment, as a method for classifying new document data by the document classification apparatus, a processor functioning as each functional unit constituting the document classification apparatus 100 executes a document classification program or the like. Document data is classified, but this is realized by a logic circuit (hardware) or a dedicated circuit formed in an integrated circuit (IC (Integrated Circuit) chip, LSI (Large Scale Integration)) in the device. Also good. These circuits may be realized by one or a plurality of integrated circuits, and the functions of the plurality of functional units described in the above embodiments may be realized by a single integrated circuit. An LSI may be called a VLSI, a super LSI, an ultra LSI, or the like depending on the degree of integration. That is, as shown in FIG. 11, each functional unit constituting the document classification device 100 may be realized by a physical circuit. As shown in FIG. 11, the document classification device 100 includes a storage circuit 130a, an acquisition circuit 110a, an extraction circuit 121a, a determination circuit 122a, and an output circuit 140a, and each circuit has the same function as described above. The same function as the unit.

また、上記文書分類プログラムは、プロセッサが読み取り可能な記録媒体に記録されていてよく、記録媒体としては、「一時的でない有形の媒体」、例えば、テープ、ディスク、カード、半導体メモリ、プログラマブルな論理回路などを用いることができる。また、上記文書分類プログラムは、当該文書分類プログラムを伝送可能な任意の伝送媒体（通信ネットワークや放送波等）を介して上記プロセッサに供給されてもよい。本発明は、上記文書分類プログラムが電子的な伝送によって具現化された、搬送波に埋め込まれたデータ信号の形態でも実現され得る。 The document classification program may be recorded on a processor-readable recording medium, and the recording medium may be a “non-temporary tangible medium” such as a tape, a disk, a card, a semiconductor memory, a programmable logic A circuit or the like can be used. The document classification program may be supplied to the processor via an arbitrary transmission medium (such as a communication network or a broadcast wave) that can transmit the document classification program. The present invention can also be realized in the form of a data signal embedded in a carrier wave, in which the document classification program is embodied by electronic transmission.

なお、上記文書分類プログラムは、例えば、ActionScript、JavaScript（登録商標）などのスクリプト言語、Objective-C、Java（登録商標）などのオブジェクト指向プログラミング言語、HTML5などのマークアップ言語などを用いて実装できる。 The document classification program can be implemented using, for example, a script language such as ActionScript or JavaScript (registered trademark), an object-oriented programming language such as Objective-C or Java (registered trademark), or a markup language such as HTML5. .

（６）上記実施の形態及び各補足に示した構成は、適宜組み合わせることとしてもよい。 (6) The configurations described in the above embodiments and supplements may be combined as appropriate.

１００文書分類装置
１１０取得部
１２１抽出部
１２２判断部
１４０出力部 100 Document Classification Device 110 Acquisition Unit 121 Extraction Unit 122 Determination Unit 140 Output Unit

Claims

First document information indicating a plurality of document data searched according to a search formula, first feature information indicating one or more technical features of the document data assigned to the document data, and the document data A storage unit that stores corresponding information indicating whether or not the document is desired for the user;
An acquisition unit that acquires second document information indicating other document data that is newly searched according to the search formula and is given second feature information indicating one or more technical features;
An extraction unit that extracts a predetermined number of document data from the plurality of document data based on the degree of coincidence between the second feature information and the first feature information;
A determination unit that determines whether the other document data is a document desired by the user based on the corresponding information associated with each of the predetermined number of document data extracted by the extraction unit;
A document classification apparatus comprising: an output unit that outputs a determination result of the determination unit.

The extraction unit extracts a predetermined number from the plurality of document data from a high degree of coincidence between the second feature information and the first feature information,
The determination unit determines that the other document data is desired by the user when the corresponding information associated with the document extracted by the extraction unit is more than a threshold value indicating that the document is desired by the user. The document according to claim 1, wherein the document is determined to be a document, and if the number of documents indicating that the document is not desired for the user is greater than a threshold value, the other document data is determined not to be a document desired for the user Document document classification device.

The determination unit weights the corresponding information associated with the document extracted by the extraction unit according to the degree of matching, and based on the corresponding information after the weighting, the other document The document classification apparatus according to claim 1, wherein data is determined as to whether or not the document is a document desired for the user.

The determination unit associates the first feature information of the document data that is associated with the first feature information and indicates that the corresponding information is a document that the user does not desire. The non-applicable rate indicating the ratio to the entire document data is determined as a document that the user does not desire other document data having the second feature information that matches the first feature information exceeding the first threshold. The document classification device according to claim 1, wherein

The determination unit associates the first feature information of document data that is associated with the first feature information and indicates that the corresponding information is a document desired by the user. The corresponding rate indicating the ratio with respect to the entire document data is determined as a document that the user desires other document data having the second feature information that matches the first feature information exceeding the second threshold. The document classification device according to claim 4, wherein the document classification device is characterized in that:

The document classification apparatus according to claim 5, wherein the first threshold is larger than the second threshold.

In the case where feature information is used in the search formula, the determination unit performs the determination based on the first feature information excluding the feature information and the second feature information. The document classification device according to any one of claims 1 to 6.

First document information indicating a plurality of document data searched according to a search formula, first feature information indicating one or more technical features of the document data assigned to the document data, and the document data A storage step for storing the corresponding information indicating whether or not the document is desired for the user;
An acquisition step of acquiring second document information indicating other document data newly searched according to the search formula and provided with second feature information indicating one or more technical features;
An extraction step of extracting a predetermined number of document data from the plurality of document data based on the degree of coincidence between the second feature information and the first feature information;
A determination step of determining whether the other document data is a document desired for the user based on the corresponding information associated with each of the predetermined number of document data extracted in the extraction step;
A document classification method including an output step of outputting a determination result in the determination step.

On the computer,
First document information indicating a plurality of document data searched according to a search formula, first feature information indicating one or more technical features of the document data assigned to the document data, and the document data A storage function for associating and storing corresponding information indicating whether or not the document is desired for the user;
An acquisition function for acquiring second document information indicating other document data that is newly searched according to the search formula and is given second feature information indicating one or more technical features;
An extraction function for extracting a predetermined number of document data from the plurality of document data based on the degree of coincidence between the second feature information and the first feature information;
A determination function for determining whether the other document data is a document desired by the user based on the corresponding information associated with each of the predetermined number of document data extracted by the extraction function;
A document classification program for realizing an output function for outputting a determination result of the determination function.