JP6735247B2

JP6735247B2 - Document classification device, document classification method, and document classification program

Info

Publication number: JP6735247B2
Application number: JP2017065917A
Authority: JP
Inventors: 大樹清水
Original assignee: Toyota Technical Development Corp
Current assignee: Toyota Technical Development Corp
Priority date: 2017-03-29
Filing date: 2017-03-29
Publication date: 2020-08-05
Anticipated expiration: 2037-03-29
Also published as: JP2018169753A

Description

本発明は、文書を分類する文書分類装置、文書分類方法及び文書分類プログラムに関する。 The present invention relates to a document classification device, a document classification method, and a document classification program for classifying documents.

近年、技術者達は、最新の技術動向を追うために、毎年自身の目的に沿った各種の特許文献を読むことがある。その特許文献を読むにあたり技術者は、検索式により絞り込みを行うものの、その結果得られる文献数は膨大なものになることがままあり、全ての文献に目を通すことは現実的ではない。そのため、膨大な量の文献の中から読むべき文献とそうでない文献のふるい分け、即ち、スクリーニングをユーザが行うことがある。 In recent years, engineers often read various patent documents for their own purpose each year in order to keep up with the latest technological trends. Although an engineer narrows down the patent documents by using a search formula, the number of documents obtained as a result is sometimes enormous, and it is not realistic to read all the documents. Therefore, the user may perform screening of documents that should be read and documents that should not be read from a huge amount of documents, that is, screening.

そこで、そのようなスクリーニングを補助する手法として、様々な技術がある。例えば、特許分類による検索の検索結果として得られる文献集合に、更に検索したい内容を表した種文書を追加し、文書内容の類似度に基づくクラスタリングを行って、クラスタ表示された特許文献を順次スクリーニングする技術が開示されている（例えば、特許文献１参照）。なお、類似度の算出には、文献に対して形態素解析を行って用語に分解し、各用語同士の類似度を算出することでクラスタリングを行う。 Therefore, there are various techniques as a method for assisting such screening. For example, a seed document that represents the content to be searched is added to the document set obtained as a search result by patent classification, clustering is performed based on the similarity of document contents, and patent documents displayed in clusters are sequentially screened. A technique for doing so is disclosed (for example, see Patent Document 1). To calculate the degree of similarity, morphological analysis is performed on the document to decompose the terms into terms, and the degree of similarity between the terms is calculated to perform clustering.

また、文献検索後のスクリーニングを効率よく行うために、文献単位で、各文献を示す内容を自動めくりする方法が開示されている（例えば、特許文献２参照）。 Further, in order to efficiently perform the screening after the document search, a method of automatically turning over the contents showing each document in a document unit is disclosed (for example, see Patent Document 2).

さらには、分類する文書と、予め分類の付与された文書集合との類似度を文書内のキーワードに基づいて算出し、入力された文書と最も類似する指定数の文書を抽出し、類似度を加味した分類の数に基づいて抽出した指定数の文書の分類のスコアを算出し、算出したスコアが指定値より大きい分類を抽出して、分類対象の文書に付与することで自動的に分類を行う技術もある（例えば、特許文献３参照）。 Further, the similarity between the document to be classified and the document set to which the classification is added in advance is calculated based on the keyword in the document, and the specified number of documents most similar to the input document are extracted, and the similarity is calculated. Calculate the classification score of the specified number of documents extracted based on the number of added classifications, extract the classifications for which the calculated score is greater than the specified value, and assign it to the documents to be classified There is also a technique to do so (for example, see Patent Document 3).

特開２００３−１５７２７０号公報JP, 2003-157270, A 特開２００４−２３４４３０号公報JP, 2004-234430, A 特開２００７−３２３４５４号公報JP, 2007-323454, A

ところで、上記特許文献２の場合、文献単位で自動めくりを行ってはくれるものの、ユーザが読むべき文献数が減るわけではなく、人がスクリーニングを行うこと自体は変わらないため、その人物の処理負荷が大きいという問題がある。 By the way, in the case of the above-mentioned Patent Document 2, although automatic turning is performed for each document, the number of documents to be read by the user does not decrease, and the fact that a person performs screening does not change, so the processing load of that person is not changed. Is a big problem.

また、上記特許文献１や特許文献３の場合、各文献に対して形態素解析を行った上で抽出された用語各々について類似度を算出するという手法をとっているため、形態素解析や膨大な数の用語の類似度の算出といった膨大な処理負荷がプロセッサにかかるという問題がある。 Further, in the case of Patent Document 1 and Patent Document 3 described above, since a method is used in which morphological analysis is performed on each document and then the degree of similarity is calculated for each of the extracted terms, morphological analysis and enormous numbers are performed. There is a problem that a huge processing load such as calculation of the similarity of terms is applied to the processor.

そこで、本発明は上記問題に鑑みて成されたものであり、上記特許文献１〜３よりも人やプロセッサの処理負荷が少ない文書分類装置、文書分類方法及び文書分類プログラムを提供することを目的とする。 Therefore, the present invention has been made in view of the above problems, and an object of the present invention is to provide a document classification device, a document classification method, and a document classification program that have a smaller processing load on a person or a processor than those of Patent Documents 1 to 3. And

上記課題を解決するために、本発明の一態様に係る文書分類装置は、検索式に応じて検索された複数の文書データを示す第一文書情報と、文書データに付与された当該文書データの１つ以上の技術的特徴を示す第一特徴情報と、当該文書データがユーザにとって所望の文献であるか否かを示す該当情報とを対応付けて記憶する記憶部と、検索式に応じて新たに検索され、１つ以上の技術的特徴を示す第二特徴情報が付与された他の文書データを示す第二文書情報を取得する取得部と、第二特徴情報と第一特徴情報との一致度に基づいて、複数の文書データから所定数の文書データを抽出する抽出部と、抽出部が抽出した所定数の文書データ各々に対応付けられた該当情報に基づいて、他の文書データが、ユーザにとって所望の文献であるか否かを判断する判断部と、判断部の判断結果を出力する出力部とを備える。 In order to solve the above-mentioned problems, a document classification device according to an aspect of the present invention includes first document information indicating a plurality of document data searched according to a search formula, and the document data assigned to the document data. A storage unit that stores the first feature information indicating one or more technical features and the corresponding information indicating whether the document data is a desired document for the user in association with each other, Of the second feature information and the first feature information, which is searched for in the second document information indicating the other document data to which the second feature information indicating one or more technical features is added. Based on the degree, the extraction unit for extracting a predetermined number of document data from the plurality of document data, and the corresponding information associated with each of the predetermined number of document data extracted by the extraction unit, other document data, A determination unit that determines whether the document is a desired document for the user and an output unit that outputs the determination result of the determination unit are provided.

また、本発明の一態様に係る文書分類方法は、検索式に応じて検索された複数の文書データを示す第一文書情報と、文書データに付与された当該文書データの１つ以上の技術的特徴を示す第一特徴情報と、当該文書データがユーザにとって所望の文献であるか否かを示す該当情報とを対応付けて記憶する記憶ステップと、検索式に応じて新たに検索され、１つ以上の技術的特徴を示す第二特徴情報が付与された他の文書データを示す第二文書情報を取得する取得ステップと、第二特徴情報と第一特徴情報との一致度に基づいて、複数の文書データから所定数の文書データを抽出する抽出ステップと、抽出ステップにおいて抽出した所定数の文書データ各々に対応付けられた該当情報に基づいて、他の文書データが、ユーザにとって所望の文献であるか否かを判断する判断ステップと、判断ステップにおける判断結果を出力する出力ステップとを含む。 Further, a document classification method according to an aspect of the present invention includes first document information indicating a plurality of document data searched according to a search formula, and one or more technical data of the document data added to the document data. A storage step of storing first characteristic information indicating a characteristic and corresponding information indicating whether or not the document data is a document desired by the user in association with each other, and a new search is performed according to the search formula, Based on the acquisition step of acquiring the second document information indicating the other document data to which the second characteristic information indicating the above technical characteristics is added, and the degree of coincidence between the second characteristic information and the first characteristic information, a plurality of Other document data based on the extraction step of extracting a predetermined number of document data from the document data and the corresponding information associated with each of the predetermined number of document data extracted in the extraction step. It includes a judgment step of judging whether or not there is, and an output step of outputting a judgment result in the judgment step.

また、本発明の一態様に係る文書分類プログラムは、コンピュータに、検索式に応じて検索された複数の文書データを示す第一文書情報と、文書データに付与された当該文書データの１つ以上の技術的特徴を示す第一特徴情報と、当該文書データがユーザにとって所望の文献であるか否かを示す該当情報とを対応付けて記憶する記憶機能と、検索式に応じて新たに検索され、１つ以上の技術的特徴を示す第二特徴情報が付与された他の文書データを示す第二文書情報を取得する取得機能と、第二特徴情報と第一特徴情報との一致度に基づいて、複数の文書データから所定数の文書データを抽出する抽出機能と、抽出機能が抽出した所定数の文書データ各々に対応付けられた該当情報に基づいて、他の文書データが、ユーザにとって所望の文献であるか否かを判断する判断機能と、判断機能の判断結果を出力する出力機能とを実現させる。 Further, the document classification program according to one aspect of the present invention is configured such that, in a computer, one or more of first document information indicating a plurality of document data searched according to a search formula and the document data attached to the document data. The first feature information indicating the technical features of the document and the corresponding information indicating whether or not the document data is a desired document for the user are stored in association with each other, and a new search is performed according to the search formula. Based on an acquisition function of acquiring second document information indicating other document data to which second characteristic information indicating one or more technical characteristics is added, and a degree of coincidence between the second characteristic information and the first characteristic information Then, based on the extraction function of extracting a predetermined number of document data from a plurality of document data and the corresponding information associated with each of the predetermined number of document data extracted by the extraction function, other document data is desired by the user. The determination function for determining whether or not the document is a document and the output function for outputting the determination result of the determination function are realized.

また、上記文書分類装置において、抽出部は、複数の文書データから、第二特徴情報と第一特徴情報との一致度の高いものから所定数を抽出し、判断部は、抽出部が抽出した文献に対応付けられている該当情報が、ユーザにとって所望の文献であることを示すものが閾値よりも多い場合に、他の文書データはユーザにとって所望の文献であると判断し、ユーザにとって所望の文献でないことを示すものが閾値よりも多い場合に、他の文書データはユーザにとって所望の文献ではないと判断することとしてもよい。 Further, in the above document classification device, the extraction unit extracts a predetermined number from a plurality of pieces of document data in which the degree of matching between the second feature information and the first feature information is high, and the determination unit extracts the extraction unit. When the relevant information associated with the document has more than the threshold value indicating that the document is the document desired by the user, the other document data is determined to be the document desired by the user, and the document desired by the user is determined. If the number of documents indicating that the document is not a document is larger than the threshold value, it may be determined that the other document data is not a document desired by the user.

また、上記文書分類装置において、判断部は、抽出部が抽出した文献に対応付けられている該当情報に対して、一致度に応じた重み付けを行い、重み付けを行った後の該当情報に基づいて、他の文書データが、ユーザにとって所望の文献であるか否かを判断することとしてもよい。 In the document classification device, the determination unit weights the relevant information associated with the document extracted by the extraction unit according to the degree of coincidence, and based on the relevant information after weighting. Alternatively, it may be determined whether other document data is a document desired by the user.

また、上記文書分類装置において、判断部は、第一特徴情報が対応付けられている文書データであって該当情報がユーザが所望していない文献であることを示す文書データの、当該第一特徴情報が対応付けられている文書データ全体に対する割合を示す非該当率が、第一閾値を超える第一特徴情報と一致する第二特徴情報を有する他の文書データをユーザが所望していない文献であると判断することとしてもよい。 Further, in the above document classification device, the determination unit may determine the first feature of the document data that is associated with the first feature information and that indicates that the relevant information is a document that the user does not desire. In documents in which the user does not desire other document data having second feature information in which the non-correspondence rate indicating the ratio to the entire document data with which the information is associated matches the first feature information that exceeds the first threshold value. It may be determined that there is.

また、上記文書分類装置において、判断部は、第一特徴情報が対応付けられている文書データであって該当情報がユーザが所望している文献であることを示す文書データの、当該第一特徴情報が対応付けられている文書データ全体に対する割合を示す該当率が、第二閾値を超える第一特徴情報と一致する第二特徴情報を有する他の文書データをユーザが所望している文献であると判断することとしてもよい。 Further, in the above document classification device, the determination unit determines the first feature of the document data that is associated with the first feature information and that indicates that the relevant information is a document desired by the user. It is a document in which the user desires other document data having second feature information in which the corresponding rate indicating the ratio to the entire document data with which the information is associated exceeds the second threshold value. It may be determined that.

また、上記文書分類装置において、第一閾値は、第二閾値よりも大きいこととしてもよい。 Further, in the document classification device, the first threshold may be larger than the second threshold.

また、上記文書分類装置において、判断部は、検索式に、特徴情報が用いられている場合には、当該特徴情報を除く第一特徴情報と、第二特徴情報とに基づいて、判断を行うこととしてもよい。 Further, in the above document classification apparatus, when the feature information is used in the search formula, the determination unit makes the determination based on the first feature information excluding the feature information and the second feature information. It may be that.

本発明の一態様に係る文書分類装置は、新たに検索された他の文書データが、ユーザにとって所望の文献であるか否かを、予め同じ検索式で検索された文献に対して付与された技術的情報を示す特徴情報の一致度と、その文献がユーザにとって所望の文献であるか否かを示す該当情報に基づいて判断することができる。したがって、文書分類装置は、ユーザが読まなくてもよい文献のふるい分けを行うことができるので、ユーザが読むべき文献数を低減できる。よって、ユーザの処理負荷を軽減することができる。また、文書分類装置は、文献内を精査することなく、文献のふるい分けを行うことができるので、文書分類装置に対してかかる処理負荷を軽減することができる。 In the document classification device according to an aspect of the present invention, whether or not other newly searched document data is a desired document for the user is given to the documents searched in advance by the same search formula. The determination can be made based on the degree of coincidence of the characteristic information indicating the technical information and the corresponding information indicating whether or not the document is a document desired by the user. Therefore, the document classification device can screen documents that the user does not have to read, and thus can reduce the number of documents that the user should read. Therefore, the processing load on the user can be reduced. Further, since the document classification device can screen documents without scrutinizing the inside of the documents, the processing load on the document classification device can be reduced.

文書分類装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of a document classification device. 文書分類装置の詳細な構成例を示すブロック図である。It is a block diagram which shows the detailed structural example of a document classification device. （ａ）は、分類済みのリストを示す過去リストの一例を示すデータ概念図である。（ｂ）は、新たな文献のリストを示す新文献リストの一例を示すデータ概念図である。(A) is a data conceptual diagram which shows an example of the past list which shows the classified list. (B) is a data conceptual diagram showing an example of a new document list showing a list of new documents. 新たな文献との一致度で過去文献をソートした一致度表の一例を示すデータ概念図である。It is a data conceptual diagram which shows an example of the matching degree table which sorted the past document by the matching degree with a new document. 分類ごとにノイズであるか該当するかを示すＩＰＣ該当情報の一例を示すデータ概念図である。It is a data conceptual diagram which shows an example of IPC applicable information which shows whether it is noise or it corresponds for every classification. 文書分類装置の動作であって、事前準備に係る処理を示すフローチャートである。7 is a flowchart showing an operation of the document classification device and a process relating to advance preparation. 図６に続く処理を示すフローチャートである。7 is a flowchart showing a process following FIG. 6. 文書分類装置の動作であって、新たな文書を分類する際の処理を示すフローチャートである。7 is a flowchart showing an operation of the document classification device and showing a process for classifying a new document. 図８に続く処理を示すフローチャートである。It is a flowchart which shows the process following FIG. 図９に続く処理を示すフローチャートである。10 is a flowchart showing a process following FIG. 9. 文書分類装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of a document classification device.

以下、本発明の一実施態様に係る文書分類装置について、図面を参照しながら詳細に説明する。 Hereinafter, a document classification device according to one embodiment of the present invention will be described in detail with reference to the drawings.

＜実施の形態＞
＜構成＞
図１は、文書分類装置の構成例を示すブロック図である。図１に示すように、記憶部１３０と、取得部１１０と、抽出部１２１と、判断部１２２と、出力部１４０とを備える。 <Embodiment>
<Structure>
FIG. 1 is a block diagram showing a configuration example of a document classification device. As shown in FIG. 1, the storage unit 130, the acquisition unit 110, the extraction unit 121, the determination unit 122, and the output unit 140 are provided.

記憶部１３０は、検索式に応じて検索された複数の文書データを示す第一文書情報と、文書データに付与された当該文書データの１つ以上の技術的特徴を示す第一特徴情報と、当該文書データがユーザにとって所望の文献であるか否かを示す該当情報とを対応付けて記憶している。記憶部１３０は、例えば、ＨＤＤ（Hard Disc Drive）、ＳＳＤ（Solid State Drive）、フラッシュメモリなどにより実現できるが、これに限定されるものではない。ここで第一文書情報は、第一文書情報の文書そのものを示すデータであってもよいし、第一文書情報を示す識別情報であってもよい。ここで、第一特徴情報は、文書データ各々の技術的特徴を示すものであって、例えば、ＩＰＣ、ＣＰＣ、ＥＣＬＡ、ＩＣＯ、ＵＳＣ、ＦＩ、Ｆタームなどが挙げられるが、これらに限定されるものではない。また、該当情報とは、検索を行ったユーザが検索により得られた文献を目視することによって検索の結果得られた文献各々がユーザにとって所望の文献であるか否かを示す情報であればよく、例えば、「該当する」、「ノイズ」であるといった情報や、「どちらでもない」、「不明である」というような内容を示すような情報であってもよい。なお、該当情報は、文書分類装置が付与するものであってもよい。各第一文書情報には、上述の通り、技術的特徴を示す第一特徴情報として、少なくとも１つの技術的特徴が対応付けられる。 The storage unit 130 includes first document information indicating a plurality of document data searched according to the search formula, and first characteristic information indicating one or more technical characteristics of the document data added to the document data. The document data is stored in association with corresponding information indicating whether or not the document data is a document desired by the user. The storage unit 130 can be realized by, for example, an HDD (Hard Disc Drive), an SSD (Solid State Drive), a flash memory, or the like, but is not limited to this. Here, the first document information may be data indicating the document itself of the first document information or identification information indicating the first document information. Here, the first characteristic information indicates a technical characteristic of each document data, and examples thereof include IPC, CPC, ECLA, ICO, USC, FI, and F term, but are not limited thereto. Not a thing. Further, the relevant information may be information indicating whether or not each of the documents obtained as a result of the search by visually observing the documents obtained by the search by the user who has performed the search is a desired document for the user. For example, the information may be information such as “corresponding” or “noise”, or information indicating the content such as “neither” or “unknown”. The relevant information may be given by the document classification device. As described above, at least one technical feature is associated with each first document information as the first feature information indicating the technical feature.

取得部１１０は、検索式に応じて新たに検索され、１つ以上の技術的特徴を示す第二特徴情報が付与された他の文書データを示す第二文書情報を取得する。取得部１２１は、例えば、有線又は無線による通信により第二文書情報を取得することとしてもよいし、あるいは、記憶部１３０に予め記憶されていた他の文書データを取得するものであってもよいし、文書分類装置に他の文書データを記憶した他の記憶媒体が接続されて当該他の記憶媒体から他の文書データを取得することとしてもよい。他の記憶媒体とは、例えば、フラッシュメモリなどの可搬型の記憶媒体である。また、第二文書情報は、第二文書情報の文書その物を示すデータであってもよいし、第二文書情報を示す識別情報であってもよい。また、第二特徴情報は、他の文書データの技術的特徴を示すものであって、例えば、ＩＰＣ、ＣＰＣ、ＥＣＬＡ、ＩＣＯ、ＵＳＣ、ＦＩ、Ｆタームなどが挙げられるが、これらに限定されるものではない。ここで、取得部１１０が取得する他の文書データは、１つでも複数でもどちらでもよい。 The acquisition unit 110 is newly searched according to the search formula and acquires the second document information indicating other document data to which the second feature information indicating one or more technical features is added. The acquisition unit 121 may acquire the second document information by wired or wireless communication, or may acquire other document data stored in advance in the storage unit 130, for example. Alternatively, another storage medium that stores other document data may be connected to the document classification device to obtain the other document data from the other storage medium. The other storage medium is, for example, a portable storage medium such as a flash memory. The second document information may be data indicating the document itself of the second document information or identification information indicating the second document information. The second characteristic information indicates technical characteristics of other document data, and examples thereof include IPC, CPC, ECLA, ICO, USC, FI, F term, but are not limited thereto. Not a thing. Here, the other document data acquired by the acquisition unit 110 may be one, a plurality, or either.

抽出部１２１は、第二特徴情報と第一特徴情報との一致度に基づいて、複数の文書データから所定数の文書データを抽出する。抽出部１２１は、例えば、記憶部１３０に記憶されている抽出プログラムを実行するプロセッサにより実現することができる。例えば、抽出部１２１は、複数の文書データの中から第二特徴情報と第一特徴情報との一致度が高い文書データを抽出することとしてもよいし、一致度が一定以上の文書データを抽出することとしてもよい。 The extraction unit 121 extracts a predetermined number of document data from the plurality of document data based on the degree of coincidence between the second characteristic information and the first characteristic information. The extraction unit 121 can be realized by, for example, a processor that executes the extraction program stored in the storage unit 130. For example, the extraction unit 121 may extract the document data in which the degree of coincidence between the second characteristic information and the first characteristic information is high, from the plurality of document data, or the document data in which the degree of coincidence is equal to or higher than a certain level. It may be done.

判断部１２２は、抽出部１２１が抽出した所定数の文書データ各々に対応付けられた該当情報に基づいて、他の文書データが、ユーザにとって所望の文献であるか否かを判断する。判断部１２２は、例えば、記憶部１３０に記憶されている判断プログラムを実行するプロセッサにより実現することができる。例えば、判断部１２２は、抽出部１２１が抽出した文書データに対応付けられた該当情報が該当を示すものが多い場合に、他の文書データも、ユーザが所望する文献に該当すると判断することができる。なお、「ユーザにとって所望の文献であるか」という条件は、「所定の観点に合致する文献であるか」といった条件や「所定の条件に合致する文献であるか」というような条件であってもよい。 The determination unit 122 determines whether other document data is a desired document for the user, based on the corresponding information associated with each of the predetermined number of document data extracted by the extraction unit 121. The determination unit 122 can be realized by, for example, a processor that executes the determination program stored in the storage unit 130. For example, when the relevant information associated with the document data extracted by the extracting unit 121 indicates a lot of relevant information, the determining unit 122 may determine that the other document data also corresponds to the document desired by the user. it can. It should be noted that the condition "is the document desired by the user?" is a condition "is the document satisfying a predetermined viewpoint?" or the condition "is the document satisfying the predetermined condition?" Good.

出力部１４０は、判断部１２２の判断結果を出力する。出力部１４０は、判断部１２２による判断結果を外部に出力できればよく、例えば、文書分類装置１００が出力装置としてのモニターやスピーカを備えて、それらのモニターに画像情報として判断結果を出力する、あるいは、音声情報として判断結果を出力することとしてもよい。また、出力部１４０は、例えば、文書分類装置１００に外部の装置が接続されて、無線又は有線により、外部の装置に判断結果を示す情報を送信することにより出力することとしてもよい。 The output unit 140 outputs the determination result of the determination unit 122. The output unit 140 only needs to be able to output the determination result of the determination unit 122 to the outside. For example, the document classification device 100 includes a monitor or a speaker as an output device and outputs the determination result as image information to these monitors, or The determination result may be output as voice information. Further, the output unit 140 may be configured to output by transmitting an information indicating the determination result to the external device wirelessly or by wire, for example, when an external device is connected to the document classification device 100.

以下、文書分類装置１００について更に詳細に説明する。 Hereinafter, the document classification device 100 will be described in more detail.

図２は、文書分類装置１００の詳細な構成例を示すブロック図である。図２に示すように、文書分類装置１００は、取得部１１０と、制御部１２０と、記憶部１３０と、出力部１４０とから構成される。文書分類装置１００は、新たな文書データが入力された場合に、当該新たな文書データが、ユーザの所望する文献に該当するノイズであるか否かを判定する機能を有するコンピュータシステムである。 FIG. 2 is a block diagram showing a detailed configuration example of the document classification device 100. As shown in FIG. 2, the document classification device 100 includes an acquisition unit 110, a control unit 120, a storage unit 130, and an output unit 140. The document classification device 100 is a computer system having a function of determining, when new document data is input, whether the new document data is noise corresponding to a document desired by a user.

取得部１１０は、文書分類装置１００が分類する新たな他の文書データとしての特許文献を示す情報を取得する機能を有する。当該特許文献を示す情報は、特許文献を示す情報であればよく、特許文献を示す識別情報あるいは文書そのものであってもよい。当該特許文献を示す情報には、当該特許文献の技術的情報を示す第二特徴情報としての特許分類を示す情報が付与されている。取得部１１０は、一例として、外部の装置（図示せず）から、未分類の他の文書データを取得する通信インターフェースである。 The acquisition unit 110 has a function of acquiring information indicating a patent document as new document data that the document classification device 100 classifies. The information indicating the patent document may be any information indicating the patent document, and may be identification information indicating the patent document or the document itself. The information indicating the patent document is provided with the information indicating the patent classification as the second feature information indicating the technical information of the patent document. The acquisition unit 110 is, for example, a communication interface that acquires other unclassified document data from an external device (not shown).

制御部１２０は、記憶部１３０に記憶されている各種プログラムを実行することで、文書分類装置１００の各部を制御する機能を有するプロセッサである。制御部１２０は、抽出部１２１と、判断部１２２としての機能を有する。制御部１２０は、検索式に応じて検索された文献として、ユーザにとって所望の文献であるか否かを判定するために、各文献に付与されているＩＰＣが「ノイズ」となるか「該当」するかを判断するためのＩＰＣ該当情報を事前情報として生成する機能を有する。また、制御部１２０は、抽出部１２１や判断部１２２の機能により、新たに検索式により検索されたノイズか該当かの分類が付与されていない特許文献が、ユーザの所望する文献であるか否かを判断する機能も有する。 The control unit 120 is a processor having a function of controlling each unit of the document classification device 100 by executing various programs stored in the storage unit 130. The control unit 120 has a function as an extraction unit 121 and a determination unit 122. The control unit 120 determines whether the IPC assigned to each document is “noise” or “corresponding” in order to determine whether the document is a document desired by the user as a document searched according to the search formula. It has a function of generating IPC relevant information as prior information for determining whether to perform. Further, the control unit 120 uses the functions of the extraction unit 121 and the determination unit 122 to determine whether or not the patent document newly searched by the search formula and to which the classification of whether it is the noise or not is not a document desired by the user. It also has a function to judge whether or not.

抽出部１２１は、過去の分類済みの文献リストである過去文献リストの中から、新たな文献がノイズであるか否かを判定するために用いる文献を抽出する。抽出部１２１は、新たな文献とのＩＰＣの一致度が高い順にソートされた過去文献リストの上位から所定数の文献を抽出する。 The extraction unit 121 extracts a document used for determining whether or not a new document is noise from the past document list which is a past classified document list. The extraction unit 121 extracts a predetermined number of documents from the top of the past document list sorted in descending order of the degree of IPC matching with a new document.

判断部１２２は、抽出部１２１が抽出した文献に付与されている該当情報としての分類（「ノイズ」か「該当」するか）に基づいて、新たな文献がユーザにとって所望の文献であるか否か、即ち、「ノイズ」であるか「該当」するかを判断する。判断部１２２は、抽出部１２１が抽出した文献のうち、過半数を占める分類を、新たな文献の分類とする。 The determination unit 122 determines whether or not the new document is a desired document for the user based on the classification (“noise” or “corresponding”) as relevant information given to the document extracted by the extraction unit 121. That is, it is determined whether it is “noise” or “corresponding”. The determination unit 122 sets the classification that occupies the majority of the documents extracted by the extraction unit 121 as the new document classification.

制御部１２０による新たな文献が、ユーザの所望する文献であるか否かを判断する際の処理やＩＰＣ該当情報を生成する際の処理の詳細については、後述する。 Details of the process of determining whether or not the new document by the control unit 120 is a document desired by the user and the process of generating the IPC relevant information will be described later.

記憶部１３０は、文書分類装置１００が動作する上で必要とする各種のデータやプログラムを記憶する機能を有する記録媒体である。記憶部１３０は、例えば、ＨＤＤ、ＳＳＤ、フラッシュメモリ等により実現されるが、これらに限定されるものではない。記憶部１３０は、例えば、各ＩＰＣがノイズなのか該当するのかの事前情報を制御部１２０が生成するためのプログラムや、新たな文献が入力されたときに当該新たな文献がノイズなのか該当するのかを制御部１２０が判断するためのプログラムを記憶している。また、記憶部１３０は、過去の特許文献のリストであって、各文献がユーザの所望の文献に該当するか否かを示す該当情報が対応付けられた過去文献リスト３００と、取得部１１０が取得するものであって、新たな文献のリストである新文献リスト３５０と、制御部１２０が生成した事前情報であるＩＰＣ該当情報５００を記憶している。また、新たな文献がノイズか該当するかを判定する際に生成する一致度表４００も記憶する。 The storage unit 130 is a recording medium having a function of storing various data and programs required for the operation of the document classification device 100. The storage unit 130 is realized by, for example, an HDD, SSD, flash memory, or the like, but is not limited to these. The storage unit 130 corresponds to, for example, a program for the control unit 120 to generate advance information as to whether each IPC is noise or corresponds, or whether the new document is noise when a new document is input. A program for the control unit 120 to determine whether or not is stored. Further, the storage unit 130 is a list of past patent documents, and a past document list 300 in which relevant information indicating whether or not each document corresponds to a document desired by the user is associated with the acquisition unit 110. The new document list 350, which is a list of new documents to be acquired, and the IPC relevant information 500, which is the prior information generated by the control unit 120, are stored. Further, the matching degree table 400 generated when determining whether the new document is noise or relevant is also stored.

出力部１４０は、制御部１２０の新たな文献についての判断結果に関する情報を外部の装置に対して出力する機能を有する通信インターフェースである。ここでは、例えば、図３（ａ）に示すような態様（少なくとも新たな文献の公報番号と分類とが対応付けられた態様）で、分類が付与された新文献リストを出力することとする。 The output unit 140 is a communication interface having a function of outputting information regarding the determination result of a new document of the control unit 120 to an external device. Here, for example, it is assumed that the new document list to which the classification is added is output in the mode as shown in FIG. 3A (at least the mode in which the publication number of a new document and the classification are associated with each other).

以上が、文書分別装置１００の構成の説明である。 The above is the description of the configuration of the document classification device 100.

＜データ＞
ここから、文書分類装置１００において用いられる各種データについて説明する。 <Data>
From here, various data used in the document classification device 100 will be described.

図３（ａ）は、記憶部１３０に記憶されている分類済みの文書データに関する過去文献リスト３００の構成例を示すデータ概念図である。過去文献リスト３００は、過去に所定の検索式で検索された文献に関する情報であって、各文献がユーザにとって所望の文献であるかいなかを示す情報を含む。図３（ａ）に示すように、過去文献リスト３００は、検索式に応じて検索された複数の文書データを示す第文書情報としての公報番号３０１と、対応する文書データである特許文献が検索の結果としてユーザが所望する内容が記載された文献であるか否かを示す該当情報に相当する情報である分類３０２と、当該文書データ各々に付与された１以上の技術的特徴を示す第一特徴情報に相当するＩＰＣ分類３０３とが対応付けられた情報である。 FIG. 3A is a data conceptual diagram showing a configuration example of a past document list 300 regarding classified document data stored in the storage unit 130. The past document list 300 is information about documents retrieved by a predetermined search formula in the past, and includes information indicating whether or not each document is a desired document for the user. As shown in FIG. 3A, the past document list 300 is searched for a publication number 301 as the first document information indicating a plurality of document data searched according to a search expression and a patent document corresponding to the document data. As a result, the classification 302, which is information corresponding to the corresponding information indicating whether or not the content desired by the user is described, and one or more technical features assigned to each of the document data. This is information associated with the IPC classification 303 corresponding to the characteristic information.

公報番号３０１は、検索式に応じて検索された文書データであって、分類済みの文書データである特許文献を一意に特定するための情報である。ここでは、分類の対象となる特許文献の公報番号を用いているが、これは、公報番号に限るものではなく、当該文献を一意に特定できる識別情報であれば、公報番号以外を用いることとしてもよい。 The publication number 301 is document data retrieved according to a retrieval formula, and is information for uniquely identifying a patent document that is classified document data. Here, the publication number of the patent document to be classified is used, but this is not limited to the publication number, and if identification information that can uniquely identify the document is used, other than the publication number is used. Good.

分類３０２は、対応する特許文献が、ユーザにとって所望の文献であるか否かを示す該当情報と呼ぶべき情報であり、ここでは、対応する特許文献がユーザにとって所望の文献である場合には、「該当」で示し、所望の文献でない場合には、「ノイズ」の２値で示している。 The classification 302 is information that should be called corresponding information indicating whether or not the corresponding patent document is a desired document for the user. Here, when the corresponding patent document is a desired document for the user, If the document is not a desired document, it is represented by a binary value of "noise".

ＩＰＣ分類３０３は、対応する特許文献に付与されているＩＰＣを示す情報である。当該ＩＰＣ分類３０３は、対応する特許文献の１つ以上の技術的特徴を示す情報であり、国際的に統一されて用いられている特許文献の技術内容による分類を示す情報である。 The IPC classification 303 is information indicating the IPC assigned to the corresponding patent document. The IPC classification 303 is information indicating one or more technical characteristics of the corresponding patent documents, and is information indicating the classification according to the technical content of the patent documents that are internationally unified and used.

図３（ｂ）は、文書分類装置１００の取得部１１０が取得する新たな文書データの一例を示す新文献リスト３５０の構成例を示すデータ概念図である。新文献リスト３５０に記載される各特許文献が分類の対象となる新たな文書データの一覧である。新文献リスト３５０は、検索により新たに検索される他の文書データを示す情報に相当する公報番号３５１と、１つ以上の技術的特徴を示す第二特徴情報に相当するＩＰＣ分類３５２とが対応付けられた情報である。 FIG. 3B is a data conceptual diagram showing a configuration example of a new document list 350 showing an example of new document data acquired by the acquisition unit 110 of the document classification device 100. Each patent document described in the new document list 350 is a list of new document data to be classified. The new document list 350 corresponds to the publication number 351 corresponding to the information indicating other document data newly searched by the search and the IPC classification 352 corresponding to the second characteristic information indicating one or more technical characteristics. It is the attached information.

公報番号３５１は、文書データを一意に特定するための情報である。ここでは、分類の対象となる特許文献の公報番号を用いているが、これは、公報番号に限るものではなく、当該文献を一意に特定できる識別情報であれば、公報番号以外を用いることとしてもよい。 The publication number 351 is information for uniquely identifying the document data. Here, the publication number of the patent document to be classified is used, but this is not limited to the publication number, and if identification information that can uniquely identify the document is used, other than the publication number is used. Good.

ＩＰＣ分類３５２は、対応する特許文献に付与されているＩＰＣを示す情報である。当該ＩＰＣ分類３５２は、対応する特許文献の１つ以上の技術的特徴を示す情報であり、国際的に統一されて用いられている特許文献の技術内容による分類を示す情報である。 The IPC classification 352 is information indicating the IPC assigned to the corresponding patent document. The IPC classification 352 is information indicating one or more technical features of the corresponding patent documents, and is information indicating the classification according to the technical contents of the patent documents that are internationally unified and used.

図４は、新文献リスト３００に含まれる一文献と、過去文献リスト３５０に含まれる各文献との技術分類の一致度を対応付けて、その一致度の高いものから降順に並べ替えた状態の一致度表４００の構成例を示すデータ概念図である。一致度表４００は、文書分類装置１００が新たな文書データが、「ノイズ」か「該当」かを判断する過程で生成する情報である。 FIG. 4 shows a state in which one document included in the new document list 300 and each document included in the past document list 350 are associated with each other in the degree of coincidence in technical classification, and the documents are rearranged in descending order of the degree of coincidence. FIG. 6 is a data conceptual diagram showing a configuration example of a matching degree table 400. The matching degree table 400 is information generated by the document classification device 100 in the process of determining whether new document data is “noise” or “corresponding”.

公報番号４０１は、分類済みの文書データである特許文献を一意に特定するための情報である。ここでは、分類の対象となる特許文献の公報番号を用いているが、これは、公報番号に限るものではなく、当該文献を一意に特定できる識別情報であれば、公報番号以外を用いることとしてもよい。 The publication number 401 is information for uniquely identifying a patent document that is classified document data. Here, the publication number of the patent document to be classified is used, but this is not limited to the publication number, and if identification information that can uniquely identify the document is used, other than the publication number is used. Good.

分類４０２は、対応する特許文献が、ユーザにとって所望の文献であるか否かを示す該当情報と呼ぶべき情報であり、ここでは、対応する特許文献がユーザにとって所望の文献である場合には、「該当」で示し、所望の文献でない場合には、「ノイズ」の２値で示している。 The classification 402 is information that should be called corresponding information indicating whether or not the corresponding patent document is a desired document for the user. Here, when the corresponding patent document is a desired document for the user, If the document is not a desired document, it is represented by a binary value of "noise".

ＩＰＣ分類４０３は、対応する特許文献に付与されているＩＰＣを示す情報である。当該ＩＰＣ分類４０３は、対応する特許文献の１つ以上の技術的特徴を示す情報であり、国際的に統一されて用いられている特許文献の技術内容による分類を示す情報である。 The IPC classification 403 is information indicating the IPC assigned to the corresponding patent document. The IPC classification 403 is information indicating one or more technical features of the corresponding patent documents, and is information indicating the classification according to the technical contents of the patent documents that are internationally unified and used.

一致度４０４は、新文献リスト３５０に含まれる一つの新文献について、当該新文献に付与されているＩＰＣ分類３５２と、過去文献リスト３００に含まれる各文献に対応付けられているＩＰＣ分類３０３との一致度を示す情報である。 The degree of coincidence 404 is, for one new document included in the new document list 350, the IPC classification 352 assigned to the new document and the IPC classification 303 associated with each document included in the past document list 300. Is information indicating the degree of coincidence.

一致度表４００は、新文献リスト３５０に含まれる各文献毎に生成される。そして、各文献について各々がユーザの所望する文献かそうでないかを、対応する一致度表４００を用いて、判断部１２２が判断する。 The matching degree table 400 is generated for each document included in the new document list 350. Then, for each document, the determination unit 122 determines whether each document is a document desired by the user or not, using the corresponding matching degree table 400.

図５は、ＩＰＣ分類ごとに、ノイズとなるか、該当になるかの確率を示すＩＰＣ該当情報５００の構成例を示すデータ概念図である。図５に示すようにＩＰＣ該当情報５００は、ＩＰＣ分類５０１と、公報件数５０２と、ノイズ件数５０３と、ノイズ率５０４と、ノイズ判定５０５と、とが対応付けられた情報である。 FIG. 5 is a data conceptual diagram showing a configuration example of the IPC relevant information 500 showing the probability of becoming noise or being relevant for each IPC classification. As shown in FIG. 5, the IPC applicable information 500 is information in which the IPC classification 501, the number of publications 502, the number of noise cases 503, the noise ratio 504, and the noise determination 505 are associated with each other.

ＩＰＣ分類５０１は、特許文献の技術的特徴を示す情報であり、国際的に統一されて用いられている特許文献の技術内容による分類を示す情報である。ＩＰＣ分類５０１は、過去文献リスト３００に含まれる過去文献に付与されているＩＰＣを抽出したものである。 The IPC classification 501 is information indicating the technical characteristics of the patent documents, and is information indicating the classification according to the technical contents of the patent documents that are internationally unified and used. The IPC classification 501 is an extraction of IPCs assigned to past documents included in the past document list 300.

公報件数５０２は、過去文献リスト３００において、対応するＩＰＣ分類５０１が付与されている文献の総数を示す情報である。 The number of publications 502 is information indicating the total number of publications to which the corresponding IPC classification 501 is attached in the past publication list 300.

ノイズ件数５０３は、対応するＩＰＣ分類５０１が付与されている過去文献のうち、ユーザが「ノイズ」であると判断した文献の総数を示す情報である。 The noise number 503 is information indicating the total number of documents that the user has determined to be “noise” among the past documents to which the corresponding IPC classification 501 is assigned.

ノイズ率５０４は、対応するＩＰＣ分類５０１が文書データに付与されている場合に、ユーザにとって所望の文献ではない確率を示唆する情報であって、対応するノイズ件数５０３を、対応する公報件数５０２で除した値を示している。 The noise ratio 504 is information that indicates a probability that the document is not a document desired by the user when the corresponding IPC classification 501 is added to the document data, and the corresponding noise number 503 is the corresponding publication number 502. The divided value is shown.

ノイズ判定５０５は、対応するＩＰＣ分類５０１が付与されている場合にノイズとなるか該当するかを判定するための値である。文書分類装置１００は、ノイズ判定が「１００」となっていれば「ノイズ」、即ちユーザにとって所望でない文書であると判定することができる。また、文書分類装置１００は、ノイズ判定が「０」となっていれば「該当」、即ちユーザにとって所望の文書であると判定することができる。本実施の形態においては、ノイズ率５０４が９５以上であるＩＰＣ分類はノイズ判定５０５を１００とし、ノイズ率５０４が１０以下であるＩＰＣ分類はノイズ判定５０５を０としている。なお、ここで９５や１０の閾値は、文書分類装置１００が定めた値であり、適宜その設定を変更できることとしてもよい。当該設定を変更する場合には、文書分類装置１００に接続された入力装置等を用いて変更することができる。 The noise determination 505 is a value for determining whether or not the noise is generated when the corresponding IPC classification 501 is given. If the noise determination is “100”, the document classification apparatus 100 can determine “noise”, that is, a document that is not desired by the user. Further, the document classification apparatus 100 can determine that the document is “corresponding”, that is, the document desired by the user if the noise determination is “0”. In the present embodiment, the noise determination 505 is 100 for the IPC classification with the noise ratio 504 of 95 or more, and the noise determination 505 is 0 for the IPC classification with the noise ratio 504 of 10 or less. The thresholds of 95 and 10 are values determined by the document classification device 100, and the settings may be changed as appropriate. When changing the setting, it can be changed using an input device or the like connected to the document classification device 100.

ＩＰＣ該当情報５００は、特定のＩＰＣ分類について、高確率でノイズあるいは該当となり得る文書を、文書分類装置１００が特定するのに用いることができる。即ち、例えば、文書分類装置１００は、ノイズ判定が１００となっているＩＰＣ分類が付与されている文書は、ユーザにとって所望でない文献として特定することができる。逆に、ノイズ判定が０となっているＩＰＣ分類が付与されている文書は、ユーザにとって所望の文献であると特定することもできる。なお、図５に示す各値は一例である。 The IPC applicable information 500 can be used by the document classification device 100 to identify a document that can be noise or applicable with high probability for a specific IPC classification. That is, for example, the document classification apparatus 100 can specify a document to which an IPC classification with a noise determination of 100 is added as a document that is not desired by the user. On the contrary, the document to which the IPC classification with the noise determination of 0 is added can be specified as the document desired by the user. Each value shown in FIG. 5 is an example.

＜動作＞
ここから、文書分類装置１００による歪み量の算出に係る動作を説明する。図６から図７にかけて示すフローチャートは、文書分類装置１００が新たな文書の分類を行う前の事前準備のための処理を示すフローチャートである。当該処理は、文書分類装置１００の制御部１２０が実行する処理である。本処理は、図５に示す該当確率情報５００を生成するための処理である。以下、詳細に説明する。 <Operation>
From here, the operation related to the calculation of the distortion amount by the document classification device 100 will be described. The flowcharts shown in FIGS. 6 to 7 are flowcharts showing a process for advance preparation before the document classification device 100 classifies a new document. The process is a process executed by the control unit 120 of the document classification device 100. This process is a process for generating the hit probability information 500 shown in FIG. The details will be described below.

（ステップＳ６０１）
ステップＳ６０１において、文書分類装置１００の制御部１２０は、処理に用いる変数ｉを、１に設定する。当該変数ｉは、過去文献リスト３００に含まれる各文献について、処理対象の文献を定めるための変数である。変数ｉを１に設定した後に、ステップＳ６０２の処理に移行する。 (Step S601)
In step S601, the control unit 120 of the document classification device 100 sets the variable i used in the process to 1. The variable i is a variable for determining a document to be processed for each document included in the past document list 300. After setting the variable i to 1, the process proceeds to step S602.

（ステップＳ６０２）
ステップＳ６０２において、制御部１２０は、過去文献リスト３００に含まれる全ての文献について、処理を行ったか否かを判定する。当該判定は、変数ｉの数が、過去文献リスト３００の総数に一致するか否かによって判定できる。当該判定において、全ての文献について処理を行っていない場合には（ＮＯ）、ステップＳ６０３の処理に移行し、全ての文献について処理を終了している場合には（ＹＥＳ）、ステップＳ６０９の処理に移行する。 (Step S602)
In step S602, the control unit 120 determines whether or not all the documents included in the past document list 300 have been processed. The determination can be made based on whether or not the number of variables i matches the total number of the past document list 300. In the determination, if all documents have not been processed (NO), the process proceeds to step S603, and if all documents have been processed (YES), the process proceeds to step S609. Transition.

（ステップＳ６０３）
ステップＳ６０３において、制御部１２０は、過去文献リスト３００のｉ行目の公報の分類が「ノイズ」であるか否かを、過去文献リスト３００の対応する分類３０２を参照して判定する。ｉ行目の公報の分類が「ノイズ」である場合には（ＹＥＳ）、ステップＳ６０４の処理に移行し、「ノイズ」でない、即ち、「該当」となっている場合には（ＮＯ）、ステップＳ６０５の処理に移行する。 (Step S603)
In step S603, the control unit 120 determines whether or not the category of the publication on the i-th row of the past document list 300 is “noise” by referring to the corresponding category 302 of the past document list 300. If the publication in the i-th row is classified as "noise" (YES), the process proceeds to step S604. If it is not "noise", that is, "corresponding" (NO), step The processing moves to S605.

（ステップＳ６０４）
ステップＳ６０４において、制御部１２０は、カウント設定値Ｃを１に設定して、ステップＳ６０６の処理に移行する。 (Step S604)
In step S604, the control unit 120 sets the count setting value C to 1, and the process proceeds to step S606.

（ステップＳ６０５）
ステップＳ６０５において、制御部１２０は、カウント設定値Ｃを０に設定して、ステップＳ６０７に移行する。 (Step S605)
In step S605, the control unit 120 sets the count setting value C to 0, and proceeds to step S607.

（ステップＳ６０６）
ステップＳ６０６において、過去文献リスト３００のｉ番目の文献に対応するＩＰＣ分類３０３に示される各ＩＰＣのノイズカウントに、ステップＳ６０４又はステップＳ６０５において算出されたカウント設定値Ｃを足す。ここでノイズカウントは、各ＩＰＣがそれぞれノイズであるか否かを判断するための指標となる値である。その後に、ステップＳ６０７に移行する。 (Step S606)
In step S606, the count setting value C calculated in step S604 or step S605 is added to the noise count of each IPC shown in the IPC classification 303 corresponding to the i-th document in the past document list 300. Here, the noise count is a value that serves as an index for determining whether each IPC is noise. Then, it transfers to step S607.

（ステップＳ６０７）
ステップＳ６０７において、制御部１２０は、過去文献リスト３００のｉ番目の文献に対応するＩＰＣ分類に示される各ＩＰＣ各々についての総数を示す総カウント値に１を足す。その後に、ステップＳ６０８に移行する。 (Step S607)
In step S607, the control unit 120 adds 1 to the total count value indicating the total number for each IPC shown in the IPC classification corresponding to the i-th document in the past document list 300. Then, it transfers to step S608.

（ステップＳ６０８）
ステップＳ６０８において、制御部１２０は、変数ｉに１を足した値を次のｉの値として、ステップＳ６０２の処理に戻る。 (Step S608)
In step S608, the control unit 120 sets the value obtained by adding 1 to the variable i as the value of the next i, and returns to the process of step S602.

（ステップＳ６０９）
ステップＳ６０９において、制御部１２０は、閾値Ｔを、過去文献リスト３００に記載されている文献の総数である総文献数の２．５％に設定して、ステップＳ６１０の処理に移行する。なお、ここで、閾値Ｔは、各ＩＰＣごとに設定される。 (Step S609)
In step S609, the control unit 120 sets the threshold value T to 2.5% of the total number of documents, which is the total number of documents described in the past document list 300, and proceeds to the process of step S610. Here, the threshold value T is set for each IPC.

（ステップＳ６１０）
ステップＳ６１０において、制御部１２０は、ステップＳ６０９において算出した閾値Ｔが５０を超えるか否かを判定する。閾値Ｔが５０を超えている場合には（ＹＥＳ）、ステップＳ６１１の処理に移行し、超えていない場合には（ＮＯ）、閾値をそのままの値にして、図７のステップＳ７０１の処理に移行する。 (Step S610)
In step S610, the control unit 120 determines whether the threshold T calculated in step S609 exceeds 50. If the threshold T exceeds 50 (YES), the process proceeds to step S611, and if not (NO), the threshold is left as it is and the process proceeds to step S701 of FIG. To do.

（ステップＳ６１１）
ステップＳ６１１において、制御部１２０は、閾値Ｔを５０に設定しなおして、図７のステップＳ７０１の処理に移行する。 (Step S611)
In step S611, the control unit 120 resets the threshold value T to 50, and proceeds to the process of step S701 in FIG.

（ステップＳ７０１）
図７に示すステップＳ７０１において、制御部１２０は、変数ｊを１に設定し、ステップＳ７０２の処理に移行する。変数ｊは、各ＩＰＣについての処理対象となるＩＰＣを特定するための変数である。 (Step S701)
In step S701 shown in FIG. 7, the control unit 120 sets the variable j to 1 and shifts to the processing of step S702. The variable j is a variable for specifying the IPC to be processed for each IPC.

（ステップＳ７０２）
ステップＳ７０２において、制御部１２０は、変数ｊが処理対象のＩＰＣの総数に１を足した数と同じであるか否かを判定する。変数ｊが処理対象のＩＰＣの総数に１を足した数と同数である場合には（ＹＥＳ）、処理を終了し、同数でない場合には（ステップＳ７０３）の処理に移行する。 (Step S702)
In step S702, the control unit 120 determines whether or not the variable j is equal to the total number of IPCs to be processed plus one. When the variable j is the same as the number obtained by adding 1 to the total number of IPCs to be processed (YES), the process is terminated, and when the number is not the same, the process proceeds to the process of step S703.

（ステップＳ７０３）
ステップＳ７０３において、制御部１２０は、各ＩＰＣについて、総件数が閾値Ｔ未満であるか否かを判定する。総件数が閾値Ｔ未満である場合には（ＹＥＳ）、ステップＳ７０９に移行し、閾値Ｔ未満でない場合には（ＮＯ）、ステップＳ７０４の処理に移行する。 (Step S703)
In step S703, the control unit 120 determines whether or not the total number of cases is less than the threshold value T for each IPC. If the total number is less than the threshold T (YES), the process proceeds to step S709, and if it is not less than the threshold T (NO), the process proceeds to step S704.

（ステップＳ７０４）
ステップＳ７０４において、制御部１２０は、各ＩＰＣのノイズ率を、各ＩＰＣのノイズカウント値を、当該ＩＰＣの総カウント値で除した値として算出する。ノイズカウント値は、図６のステップＳ６０２からＳ６０８の処理を繰り返すことで、ステップＳ６０６の処理により算出される値である。また、ＩＰＣの総カウント値は、ステップＳ６０２からＳ６０８の処理を繰り返すことで、ステップＳ６０７の処理により算出される値である。ノイズ率を算出した後には、ステップＳ７０５の処理に移行する。 (Step S704)
In step S704, the control unit 120 calculates the noise rate of each IPC as a value obtained by dividing the noise count value of each IPC by the total count value of the IPC. The noise count value is a value calculated by the process of step S606 by repeating the processes of steps S602 to S608 of FIG. The total count value of IPC is a value calculated by the process of step S607 by repeating the processes of steps S602 to S608. After calculating the noise rate, the process proceeds to step S705.

（ステップＳ７０５）
ステップＳ７０５において、制御部１２０は、各ＩＰＣについて各々のノイズ率が１０％未満であるか否かを判定する。１０％未満である場合には（ＹＥＳ）、ステップＳ７０６に移行し、１０％未満でない場合、即ち、１０％以上である場合には（ＮＯ）、ステップＳ７０８の処理に移行する。 (Step S705)
In step S705, the control unit 120 determines whether or not the noise rate of each IPC is less than 10%. When it is less than 10% (YES), the process proceeds to step S706, and when it is not less than 10%, that is, when it is 10% or more (NO), the process proceeds to step S708.

（ステップＳ７０６）
ステップＳ７０６において、制御部１２０は、ＩＰＣノイズ率が１０％未満であったＩＰＣのノイズ判定を０％に設定する。その後に、ステップＳ７０９の処理に移行する。 (Step S706)
In step S706, the control unit 120 sets the noise determination of the IPC whose IPC noise rate is less than 10% to 0%. Then, the process proceeds to step S709.

（ステップＳ７０７）
ステップＳ７０７において、制御部１２０は、ＩＰＣノイズ率が１０％未満ではなかったＩＰＣ各々について、ノイズ率が９５％以上であるか否かを判定する。ノイズ率が９５％以上であった場合には（ＹＥＳ）、ステップＳ７０９に移行し、ノイズ率が９５％以上でなかった場合には（ＮＯ）、ノイズ率は、ステップＳ７０４で算出した値として、ステップＳ７０９の処理に移行する。 (Step S707)
In step S707, the control unit 120 determines whether or not the noise rate is 95% or more for each IPC whose IPC noise rate is not less than 10%. When the noise rate is 95% or more (YES), the process proceeds to step S709, and when the noise rate is not 95% or more (NO), the noise rate is the value calculated in step S704. The process moves to step S709.

（ステップＳ７０８）
ステップＳ７０８において、制御部１２０は、ＩＰＣノイズ率が９５％以上であったＩＰＣのノイズ判定を１００％に設定する。その後にステップＳ７０９の処理に移行する。 (Step S708)
In step S708, the control unit 120 sets the noise determination of the IPC whose IPC noise rate is 95% or more to 100%. Then, the process proceeds to step S709.

（ステップＳ７０９）
ステップＳ７０９において、制御部１２０は、ｊに１加算した値を新たなｊとし、ステップＳ７０２の処理に戻る。 (Step S709)
In step S709, the control unit 120 sets the value obtained by adding 1 to j as a new j, and returns to the process of step S702.

以上の処理を実行することにより、制御部１２０は、各ＩＰＣに対してノイズ率が算出され、図５に示すＩＰＣ該当情報５００を生成し、記憶部１３０に記憶する。 By executing the above processing, the control unit 120 calculates the noise ratio for each IPC, generates the IPC applicable information 500 shown in FIG. 5, and stores it in the storage unit 130.

次に、実際に新たな文書データ（公報）を入力された場合に、その公報が「ノイズ」であるか、「該当する」かを文書分類装置１００が判断する際の動作について説明する。図８〜図１０にかけて示すフローチャートが当該処理に該当する。本処理は、取得部１１０が新たな文献の集合である新文献リスト３５０を入手した後に、抽出部１２１及び判断部１２２が実行する処理となる。以下、詳細に説明する。 Next, the operation when the document classification device 100 determines whether the publication is “noise” or “corresponding” when new document data (publication) is actually input will be described. The flowcharts shown in FIGS. 8 to 10 correspond to the processing. This process is a process executed by the extraction unit 121 and the determination unit 122 after the acquisition unit 110 acquires the new document list 350 that is a new set of documents. The details will be described below.

（ステップＳ８０１）
ステップＳ８０１において、判断部１２２は、未判別の文献を区別するための変数ｌを１に設定する。その後に、ステップＳ８０２の処理に移行する。 (Step S801)
In step S801, the determination unit 122 sets a variable 1 for distinguishing an undetermined document to 1. Then, the process proceeds to step S802.

（ステップＳ８０２）
ステップＳ８０２において、判断部１２２は、未判別の文献が残っているか否かを判定する。当該判定は、新文献リスト３５０に含まれる文献数と、変数ｌが一致するか否かによって行う。未判別の文献が残っている場合には（ＹＥＳ）、ステップＳ８０３の処理に移行し、残っていない場合には（ＮＯ）、処理を終了する。 (Step S802)
In step S802, the determination unit 122 determines whether or not an undetermined document remains. The determination is performed based on whether the number of documents included in the new document list 350 and the variable l match. If there is any undetermined document (YES), the process proceeds to step S803, and if no document remains (NO), the process ends.

（ステップＳ８０３）
ステップＳ８０３において、判断部１２２は、ｌ番目の公報のＩＰＣを抽出する。ここでは、新文献リスト３５０のＩＰＣ分類３５２から抽出する。抽出したＩＰＣは個別に管理する。ＩＰＣを抽出した後に、ステップＳ８０４の処理に移行する。 (Step S803)
In step S803, the determination unit 122 extracts the IPC of the 1st publication. Here, it is extracted from the IPC classification 352 of the new document list 350. The extracted IPC is managed individually. After extracting the IPC, the process proceeds to step S804.

（ステップＳ８０４）
ステップＳ８０４において、判断部１２２は、処理を行っていないＩＰＣを識別するために用いる変数ｍを１に設定する。ここで変数ｍの最大値は、ステップＳ８０３において抽出したＩＰＣの合計数に相当する。その後に、ステップＳ８０５の処理に移行する。 (Step S804)
In step S804, the determination unit 122 sets 1 to the variable m used to identify the IPC that has not been processed. Here, the maximum value of the variable m corresponds to the total number of IPCs extracted in step S803. Then, the process proceeds to step S805.

（ステップＳ８０５）
ステップＳ８０５において、判断部１２２は、最後のＩＰＣについての判定であるか、即ち、ｍがｍの総数に１を足した数になっているか否かを判定する。ｍがＩＰＣの総数に１を足した数になっている場合には（ＹＥＳ）、ステップＳ８０６に移行し、なっていない場合には（ＮＯ）、ステップＳ８１２の処理に移行する。 (Step S805)
In step S805, the determination unit 122 determines whether or not the determination is for the last IPC, that is, whether m is the total number of m plus 1. If m is equal to the total number of IPCs plus 1 (YES), the process proceeds to step S806, and if not (NO), the process proceeds to step S812.

（ステップＳ８０６）
ステップＳ８０６において、判断部１２２は、検索式ＩＰＣとの一致ＩＰＣ数カウントがｍになっているか否かを判定する。なっている場合には（ＹＥＳ）、ステップＳ８０７に移行し、なっていない場合には（ＮＯ）、図９のステップＳ９０１の処理に移行する。 (Step S806)
In step S806, the determination unit 122 determines whether or not the number of matching IPCs with the search expression IPC is m. If so (YES), the process proceeds to step S807, and if not (NO), the process proceeds to step S901 in FIG.

（ステップＳ８０７）
ステップＳ８０７において、判断部１２２は、対象のＩＰＣのノイズ判定を０％とし、ステップＳ８１１の処理に移行する。 (Step S807)
In step S807, the determination unit 122 sets the noise determination of the target IPC to 0%, and proceeds to the process of step S811.

（ステップＳ８０８）
ステップＳ８０８において、判断部１２２は、ノイズ判定が０であるか否かを判定する。ノイズ判定が０である場合には（ＹＥＳ）、ステップＳ８０９の処理に移行し、０ではない場合には（ＮＯ）、ステップＳ８１０の処理に移行する。なお。ここでは、ノイズ判定が０であるか否かに基づいて判定しているが、これは、ノイズ判定が１００であるか否かに基づいて判定してもよく、ノイズ判定が１００である場合にステップＳ８１０の処理に移行し、１００でない場合にステップＳ８０９の処理に移行することになる。 (Step S808)
In step S808, the determination unit 122 determines whether or not the noise determination is 0. If the noise determination is 0 (YES), the process proceeds to step S809, and if it is not 0 (NO), the process proceeds to step S810. Incidentally. Here, the determination is made based on whether or not the noise determination is 0. However, this may be determined based on whether or not the noise determination is 100, and when the noise determination is 100, The process moves to step S810, and if not 100, the process moves to step S809.

（ステップＳ８０９）
ステップＳ８０９において、判断部１２２は、Ｐｌ、即ち、ｌ番目の文献が「該当」、即ち、ユーザが所望する文献であると判断し、ｌ番目の文献に対応付けて記憶する。その後に、ステップＳ８１１の処理に移行する。 (Step S809)
In step S809, the determination unit 122 determines that Pl, that is, the l-th document is “corresponding”, that is, the document desired by the user, and stores it in association with the l-th document. Then, the process proceeds to step S811.

（ステップＳ８１０）
ステップＳ８１０において、判断部１２２は、Ｐｌ、即ち、ｌ番目の文献が「ノイズ」、即ち、ユーザが所望する文献ではないと判断し、ｌ番目の文献に対応付けて記憶する。その後に、ステップＳ８１１の処理に移行する。 (Step S810)
In step S810, the determination unit 122 determines that Pl, that is, the l-th document is “noise”, that is, not the document desired by the user, and stores it in association with the l-th document. Then, the process proceeds to step S811.

（ステップＳ８１１）
ステップＳ８１１において、判断部１２２は、変数ｌに１加算した値を新たな変数ｌとし、ステップＳ８０２の処理に戻る。 (Step S811)
In step S811, the determination unit 122 sets the value obtained by adding 1 to the variable l as a new variable l, and returns to the process of step S802.

（ステップＳ８１２）
ステップＳ８１２において、判断部１２２は、処理対象のＩＰＣのノイズ率が、ＩＰＣ該当情報５００において０％若しくは１００％に設定されているか否かを判定する。ノイズ率が０％若しくは１００％に設定されている場合には（ＹＥＳ）、ステップＳ８１３に移行し、設定されていない場合には（ＮＯ）、ステップＳ８１４の処理に移行する。 (Step S812)
In step S812, the determination unit 122 determines whether the noise rate of the processing target IPC is set to 0% or 100% in the IPC applicable information 500. If the noise rate is set to 0% or 100% (YES), the process proceeds to step S813, and if it is not set (NO), the process proceeds to step S814.

（ステップＳ８１３）
ステップＳ８１３において、判断部１２２は、新たな文献のノイズ判定を対象ＩＰＣのノイズ判定値（即ち、０％若しくは１００％のいずれか）に設定して、ステップＳ８０８の処理に移行する。当該処理は、新たな文献が、ＩＰＣ該当情報５００において、ノイズ１００％となるＩＰＣ分類または該当１００％となるＩＰＣ分類を有する場合に、ＩＰＣ該当情報５００で示される分類をそのまま新たな文献に適用するものである。 (Step S813)
In step S813, the determination unit 122 sets the noise determination of the new document to the noise determination value of the target IPC (that is, either 0% or 100%), and proceeds to the processing of step S808. In the processing, when a new document has an IPC classification of 100% noise or an IPC classification of 100% in the IPC relevant information 500, the classification shown in the IPC relevant information 500 is applied to the new document as it is. To do.

（ステップＳ８１４）
ステップＳ８１４において、判断部１２２は、処理対象のＩＰＣが検索式として使用したＩＰＣと一致するか否かを判定する。一致する場合には（ＹＥＳ）、ステップＳ８１５に移行し、一致しない場合には（ＮＯ）、ステップＳ８１６の処理に移行する。 (Step S814)
In step S814, the determination unit 122 determines whether the IPC to be processed matches the IPC used as the search expression. If they match (YES), the process proceeds to step S815, and if they do not match (NO), the process proceeds to step S816.

（ステップＳ８１５）
ステップＳ８１５において、判断部１２２は、検索式ＩＰＣと一致するＩＰＣ数カウントを１加算する。その後に、ステップＳ８１６の処理に移行する。 (Step S815)
In step S815, the determination unit 122 adds 1 to the IPC number count that matches the search expression IPC. Then, the process proceeds to step S816.

（ステップＳ８１６）
ステップＳ８１６において、判断部１２２は、変数ｍに１加算した値を新たなｍとし、ステップＳ８０５の処理に戻る。 (Step S816)
In step S816, the determination unit 122 sets the value obtained by adding 1 to the variable m as a new m, and returns to the process of step S805.

（ステップＳ９０１）
ステップＳ９０１において、判断部１２２は、変数ｋを１に設定する。変数ｋは、処理対象となる過去文献リスト３００中の過去文献を識別するための用いる変数である。変数ｋを１に設定した後に、ステップＳ９０２の処理に移行する。 (Step S901)
In step S901, the determination unit 122 sets the variable k to 1. The variable k is a variable used to identify past documents in the past document list 300 to be processed. After setting the variable k to 1, the process proceeds to step S902.

（ステップＳ９０２）
ステップＳ９０２において、判断部１２２は、処理対象の文献が、過去文献リスト３００の過去文献リストの総数になっているか否かを、変数ｋが過去文献リスト３００に含まれる過去文献の総数に１足した値に一致するか否かによって判定する。処理対象の文献が、過去文献リスト３００に含まれる過去文献の最後の文献になっている場合には（ＹＥＳ）、図１０のステップＳ１００１に移行し、なっていない場合には（ＮＯ）、ステップＳ９０３の処理に移行する。 (Step S902)
In step S902, the determination unit 122 determines whether or not the number of the past documents list of the past documents list 300 is the number of the past documents list of the past documents list 300. It is determined by whether or not the value matches If the document to be processed is the last document in the past documents included in the past document list 300 (YES), the process proceeds to step S1001 in FIG. 10, and if not (NO), step Then, the process proceeds to S903.

（ステップＳ９０３）
ステップＳ９０３において、判断部１２２は、過去文献リスト３００のｋ番目の公報のＩＰＣを抽出する。即ち、過去文献リスト３００のｋ行目のＩＰＣ分類３０３から、各ＩＰＣを抽出する。その後に、ステップＳ９０４の処理に移行する。 (Step S903)
In step S903, the determination unit 122 extracts the IPC of the kth publication in the past literature list 300. That is, each IPC is extracted from the IPC classification 303 on the k-th row of the past document list 300. Then, the process proceeds to step S904.

（ステップＳ９０４）
ステップＳ９０４において、新文献リスト３５０のｈ番目の公報の各ＩＰＣ（ｈ）を、新文献リスト３５０のＩＰＣ分類３５２から抽出する。その後に、ステップＳ９０５の処理に移行する。 (Step S904)
In step S904, each IPC(h) of the h-th publication in the new document list 350 is extracted from the IPC classification 352 of the new document list 350. Then, the process proceeds to step S905.

（ステップＳ９０５）
ステップＳ９０５において、判断部１２２は、変数ｎを１に設定する。変数ｎは、処理対象の新文献に付与されているＩＰＣのうちの処理対象となっているＩＰＣを区別するための変数である。変数ｎを１に設定した後に、ステップＳ９０６の処理に移行する。 (Step S905)
In step S905, the determination unit 122 sets the variable n to 1. The variable n is a variable for distinguishing the IPC to be processed among the IPCs assigned to the new document to be processed. After setting the variable n to 1, the process proceeds to step S906.

（ステップＳ９０６）
ステップＳ９０６において、判断部１２２は、ｎがラストになっているか、即ち、新文献に付与されている全てのＩＰＣについて処理を行ったか否かを判定する。行っている場合には（ＹＥＳ）、ステップＳ９０６に移行し、行っていない場合には（ＮＯ）、ステップＳ９０７の処理に移行する。 (Step S906)
In step S906, the determination unit 122 determines whether n is the last, that is, whether or not processing has been performed for all IPCs assigned to the new document. If so (YES), the process proceeds to step S906, and if not (NO), the process proceeds to step S907.

（ステップＳ９０７）
ステップＳ９０７において、判断部１２２は、ＩＰＣ（ｈ）ｎが、検索式のＩＰＣと一致するか否かを判定する。一致する場合には（ＹＥＳ）、ステップＳ９０８に移行し、一致しない場合には（ＮＯ）、ステップＳ９０９の処理に移行する。 (Step S907)
In step S907, the determination unit 122 determines whether IPC(h)n matches the IPC of the search expression. When they match (YES), the process proceeds to step S908, and when they do not match (NO), the process proceeds to step S909.

（ステップＳ９０８）
ステップＳ９０８において、判断部１２２は、対象ＩＰＣカウントを１減算し、ステップＳ９１１の処理に移行する。 (Step S908)
In step S908, the determination unit 122 subtracts 1 from the target IPC count, and shifts to the processing in step S911.

（ステップＳ９０９）
ステップＳ９０９において、判断部１２２は、ＩＰＣ（ｈ）ｎがＩＰＣ（ｋ）に一致するか否かを判定する。即ち、新文献リスト３５０のｈ番目の新文献に付与されているＩＰＣのうち、ｎ番目のＩＰＣが、過去文献リスト３００のｋ番目の過去文献に付与されているＩＰＣのいずれかと一致するか否かを判定する。一致する場合には、ステップＳ９１０に移行し（ＹＥＳ）、一致しない場合には（ＮＯ）、ステップＳ９１１の処理に移行する。 (Step S909)
In step S909, the determination unit 122 determines whether IPC(h)n matches IPC(k). That is, of the IPCs assigned to the h-th new document in the new document list 350, whether the n-th IPC matches any of the IPCs assigned to the k-th past documents in the past document list 300. Determine whether. If they match, the process proceeds to step S910 (YES), and if they do not match (NO), the process proceeds to step S911.

（ステップＳ９１０）
ステップＳ９１０において、判断部１２２は、ＩＰＣ一致数カウントを１加算し、その後に、ステップＳ９１１の処理に移行する。 (Step S910)
In step S910, the determination unit 122 increments the IPC coincidence count by 1, and then proceeds to step S911.

（ステップＳ９１１）
ステップＳ９１１において、判断部１２２は、ｋに１加算した値を新たなｋとし、ステップＳ９０２の処理に戻る。 (Step S911)
In step S911, the determination unit 122 sets the value obtained by adding 1 to k as a new k, and returns to the process of step S902.

（ステップＳ９１２）
ステップＳ９１２において、判断部１２２は、新文献リスト３５０のｈ番目の新文献のＩＰＣと、過去文献リスト３００のｋ番目の過去文献に付与されているＩＰＣとの一致率を、それまでにカウントしたＩＰＣ一致率カウントを、対象ＩＰＣカウント数で除することで、算出する。その後に、ステップＳ９１３の処理に移行する。 (Step S912)
In step S912, the determination unit 122 has counted the concordance rate between the IPC of the h-th new document in the new document list 350 and the IPC assigned to the k-th past document in the past document list 300 by then. It is calculated by dividing the IPC match rate count by the target IPC count number. Then, the process proceeds to step S913.

（ステップＳ９１３）
ステップＳ９１３において、判断部１２２は、変数ｋに１加算した値を新たなｋとし、ステップＳ９０２の処理に戻る。 (Step S913)
In step S913, the determination unit 122 sets the value obtained by adding 1 to the variable k as a new k, and returns to the process of step S902.

（ステップＳ１００１）
ステップＳ１００１において、判断部１２２は、新文献リスト３００の新文献に付与されているＩＰＣと、過去文献リスト３５０の過去文献各々に付与されているＩＰＣとの各文献毎の一致率を降順で並べ替える。その後に、ステップＳ１００２の処理に移行する。 (Step S1001)
In step S1001, the determination unit 122 arranges the concordance rate for each document in descending order with respect to the IPC assigned to the new document in the new document list 300 and the IPC assigned to each past document in the past document list 350. Change. Then, the process proceeds to step S1002.

（ステップＳ１００２）
ステップＳ１００２において、抽出部１２１は、変数ｑを１に設定する。変数ｑは、ＩＰＣの一致度の高いものから、過去文献を抽出するため個数を特定するための変数である。変数ｑを１に設定した後に、ステップＳ１００３の処理に移行する。 (Step S1002)
In step S1002, the extraction unit 121 sets the variable q to 1. The variable q is a variable for specifying the number for extracting past documents from the one having a high degree of coincidence with the IPC. After setting the variable q to 1, the process proceeds to step S1003.

（ステップＳ１００３）
ステップＳ１００３において、抽出部１２１は、ｑが８になっているか否かを判定する。ｑが８になっている場合には（ＹＥＳ）、ステップＳ１００４に移行し、なっていない場合には（ＮＯ）、ステップＳ１００９の処理に移行する。 (Step S1003)
In step S1003, the extraction unit 121 determines whether q is 8. If q is 8 (YES), the process proceeds to step S1004, and if not (NO), the process proceeds to step S1009.

（ステップＳ１００４）
ステップＳ１００４において、判断部１２２は、対象の新たに検索された特許文献が、ユーザにとって所望の文献であるか否かを判断するための指標ｔを、ノイズカウントを比較公報数カウントで除することで算出する。ノイズカウントは、ステップＳ１０１１において算出される数であって、特許分類の一致度の高かった文献の上位から所定数抽出した過去文献の中で、ノイズである文献の個数を示す。比較公報数カウントは、ステップＳ１０１２においてカウントされる数であって、ｑの最大数に一致する。即ち、比較公報数カウントは、抽出する公報数のことを意味する。ｔを算出すると、ステップＳ１００５の処理に移行する。 (Step S1004)
In step S1004, the determination unit 122 divides the noise count by the comparative publication number count for the index t for determining whether or not the target newly searched patent document is a document desired by the user. Calculate with. The noise count is the number calculated in step S1011 and indicates the number of documents that are noise in the past documents extracted by a predetermined number from the top of the documents that have a high degree of agreement in the patent classification. The comparative publication number count is the number counted in step S1012, and matches the maximum number of q. That is, the count of comparative publications means the number of publications to be extracted. When t is calculated, the process proceeds to step S1005.

（ステップＳ１００５）
ステップＳ１００５において、判断部１２２は、ステップＳ１００４で算出したｔが所定の閾値αを超えるか否かを判定する。ｔが閾値αを超えている場合には（ＹＥＳ）、ステップＳ１００６に移行し、超えていない場合には（ＮＯ）、ステップＳ１００７の処理に移行する。 (Step S1005)
In step S1005, the determination unit 122 determines whether t calculated in step S1004 exceeds a predetermined threshold value α. If t exceeds the threshold value α (YES), the process proceeds to step S1006, and if t does not exceed (NO), the process proceeds to step S1007.

（ステップＳ１００６）
ステップＳ１００６において、判断部１２２は、対応する新たな公報が、ユーザの所望の文献に該当することを示す情報を付与する（該当すると分類する）。その後に、ステップＳ１００８の処理に移行する。 (Step S1006)
In step S1006, the determination unit 122 adds information indicating that the corresponding new publication corresponds to the document desired by the user (classifies it as applicable). Then, the process proceeds to step S1008.

（ステップＳ１００７）
ステップＳ１００７において、判断部１２２は、対応する新たな公報が、ユーザの所望の文献ではないものとして、ノイズであることを示す情報を付与する（ノイズであると分類する）。その後に、ステップＳ１００８の処理に移行する。 (Step S1007)
In step S1007, the determination unit 122 assigns information indicating that the corresponding new publication is noise, which is not a document desired by the user (classifies it as noise). Then, the process proceeds to step S1008.

（ステップＳ１００８）
ステップＳ１００８において、判断部１２２は、変数ｌに１加算した値を新たなｌとし、図８のステップＳ８０２の処理に移行する。 (Step S1008)
In step S1008, the determination unit 122 sets the value obtained by adding 1 to the variable l as a new l, and proceeds to the process of step S802 in FIG.

（ステップＳ１００９）
ステップＳ１００９において、判断部１２２は、処理対象の文献数が、過去文献リスト３００の総数に１足した値に達したか否かを判定する。当該判定は、過去文献リスト３００に、ｑ個の文献が含まれていない場合のための処置である。処理対象の文献の数が過去文献リスト３００の総数に１足した値に達していた場合には（ＹＥＳ）、ステップＳ１００４に移行し、達していなかった場合には（ＮＯ）、ステップＳ１０１０の処理に移行する。 (Step S1009)
In step S1009, the determination unit 122 determines whether the number of documents to be processed has reached a value obtained by adding one to the total number of the past document list 300. The determination is a treatment for the case where the past document list 300 does not include q documents. If the number of documents to be processed has reached the value obtained by adding one to the total number of the past document list 300 (YES), the process proceeds to step S1004, and if not (NO), the process of step S1010. Move to.

（ステップＳ１０１０）
ステップＳ１０１０において、判断部１２２は、過去文献リストのｑ番目の公報の分類３０２が「ノイズ」であるか否かを判定する。ノイズであると判定した場合には（ＹＥＳ）、ステップＳ１０１１に移行し、ノイズでないと判定した場合には（ＮＯ）、ステップＳ１０１２の処理に移行する。 (Step S1010)
In step S1010, the determination unit 122 determines whether or not the classification 302 of the q-th publication in the past document list is “noise”. If it is determined that it is noise (YES), the process proceeds to step S1011. If it is determined that it is not noise (NO), the process proceeds to step S1012.

（ステップＳ１０１１）
ステップＳ１０１１において、判断部１２２は、ノイズカウントを１加算し、ステップＳ１０１２の処理に移行する。 (Step S1011)
In step S1011, the determination unit 122 adds 1 to the noise count, and the process proceeds to step S1012.

（ステップＳ１０１２）
ステップＳ１０１２において、判断部１２２は、比較公報数カウントを１加算し、ステップＳ１０１３の処理に移行する。 (Step S1012)
In step S1012, the determination unit 122 adds 1 to the comparative publication number count, and the process proceeds to step S1013.

（ステップＳ１０１３）
ステップＳ１０１３において、判断部１２２は、変数ｑに１加算した値を新たなｑとし、ステップＳ１００３の処理に移行する。 (Step S1013)
In step S1013, the determination unit 122 sets the value obtained by adding 1 to the variable q as a new q, and proceeds to the process of step S1003.

図８から図１０に示す処理を実行することにより、新文献リスト３５０に含まれる新たな文献全てについて、文書分類装置１００は、新たな文献各々が、ノイズであるか否かを判定することができる。 By executing the processes shown in FIGS. 8 to 10, for all the new documents included in the new document list 350, the document classification device 100 can determine whether or not each new document is noise. it can.

以上が、文書分類装置１００の動作の説明である。 The above is the description of the operation of the document classification device 100.

＜まとめ＞
上記実施の形態に係る文書分類装置は、特許公報に元々付与されている特許分類に基づいて、予め検索式により得られた文献が所望のものであるか否かを、「ノイズ」、「該当」という分類情報を付与しておく。そして、新たな特許公報が入力されたときに、その新たな特許公報に付与されている特許分類と、分類済みの特許公報の特許分類との一致度に基づいて、文献を所定数抽出する。そして、抽出された文献に付与されている分類が「ノイズ」と「該当」とのいずれが多いかによって、新たな特許公報が「ノイズ」であるか「該当」するのかを、特許公報の内容を精査しなくとも分類することができる。そして、ユーザは、ユーザが設定した検索式に応じて検索された文献であっても、ノイズと判定された文献については、その内容を確認する必要がなくなるので、文献のスクリーニングに要する時間を短縮することができる。また、文書分類装置としては、公報内を精査する必要がない（形態素解析を行ったり、形態素解析により抽出された膨大な個数のワードの一致率などを見たりする必要がない）ので、特許文献１〜３に示す分類装置よりもプロセッサの処理負荷を少なくすることができる。 <Summary>
The document classification device according to the above-mentioned embodiment, based on the patent classification originally assigned to the patent publication, whether the document obtained by the search formula in advance is a desired one, "noise", "corresponding Classification information is added. Then, when a new patent publication is input, a predetermined number of documents are extracted based on the degree of coincidence between the patent classification assigned to the new patent publication and the patent classification of the classified patent publication. Then, depending on which of “noise” and “corresponding” is assigned to the extracted document, whether the new patent publication is “noise” or “corresponding” is determined by the content of the patent publication. Can be classified without scrutiny. Then, the user does not need to check the content of the document determined to be noise even if the document is searched according to the search formula set by the user, so the time required for screening the document is shortened. can do. Further, as the document classification device, it is not necessary to scrutinize the gazette (it is not necessary to perform morphological analysis or see the matching rate of a huge number of words extracted by morphological analysis). The processing load of the processor can be reduced as compared with the classifying devices shown in 1 to 3.

＜補足＞
上記実施の形態に係る文書分類装置は、上記実施の形態に限定されるものではなく、他の手法により実現されてもよいことは言うまでもない。以下、各種変形例について説明する。 <Supplement>
It goes without saying that the document classification device according to the above embodiment is not limited to the above embodiment and may be realized by other methods. Various modifications will be described below.

（１）上記実施の形態においては特に説明していないが、抽出部１２１が抽出する文献数ｑは、奇数であることが望ましい。奇数に設定することで、必ず、「ノイズ」か「該当」を特定できるためである。その変数ｑは、所謂ｋ近傍法を用いて、算出するとよい。 (1) Although not particularly described in the above embodiment, it is desirable that the number of documents q extracted by the extraction unit 121 is an odd number. This is because it is always possible to specify “noise” or “corresponding” by setting an odd number. The variable q may be calculated using the so-called k-nearest neighbor method.

なお、ｑを偶数に設定した場合に、「ノイズ」と「該当」との数が一致するような場合も考えられる。そのため、文書分類装置は、以下のような手法を用いて文書を分類することとしてもよい。即ち、「ノイズ」の文献の基本値を「−１」、「該当」の基本値を「＋１」とする。そして、その基本値に対して一致度を重み値として乗じた値を当該文献のノイズか該当かの分類値とする。そして、判断部１２２は、抽出部１２１が抽出した文献の分類値を合算し、その値が正であれば、「該当」と分類し、負であれば、「ノイズ」と分類することとしてもよい。当該手法の場合、「ノイズ」か「該当」かの判断処理に係る処理負荷は上述の実施形態に示した処理による処理負荷よりも大きくなるものの、より正確に「ノイズ」か「該当」かの判断を行うことができる。即ち、文書分類装置１００は、重み付けによる補正を行った上で、分類を行うこととしてもよい。なお、ここでは、一致度そのものを重み値としているが、これはその限りではなく、任意の値を重み値としてもよい。 In addition, when q is set to an even number, the number of “noise” and the number of “corresponding” may be the same. Therefore, the document classification device may classify the documents using the following method. That is, the basic value of the "noise" document is "-1", and the basic value of "corresponding" is "+1". Then, a value obtained by multiplying the basic value by the degree of coincidence as a weight value is set as a classification value of noise or relevant in the document. Then, the determination unit 122 may classify the classification values of the documents extracted by the extraction unit 121, classify as “corresponding” if the value is positive, and classify as “noise” if the value is negative. Good. In the case of the method, although the processing load related to the determination processing of “noise” or “corresponding” is larger than the processing load of the processing described in the above-described embodiment, it is more accurate whether the processing is “noise” or “corresponding”. Can make decisions. That is, the document classification device 100 may perform the classification after performing the correction by weighting. Note that, here, the degree of coincidence itself is used as the weight value, but this is not the limitation, and any value may be used as the weight value.

（２）上記実施の形態においては、ノイズか該当かの判定において過半数を占める方の分類を新たな文献の分類としているが、これはその限りではない。例えば、抽出部１２１が抽出するｑ個の文献のうち、所定数以上の文献の分類が「ノイズ」であれば、新たな文献も「ノイズ」であると判断する構成としてもよい。例えば、抽出した文献数を１０個とし、そのうちの８個以上の分類が「ノイズ」であれば、新たな文献の分類を「ノイズ」とするように構成してもよい。 (2) In the above-mentioned embodiment, the classification of the majority of the judgments as to whether it is noise or not is the classification of the new document, but this is not the limitation. For example, among the q documents extracted by the extraction unit 121, if a predetermined number or more of the documents are classified as “noise”, the new document may be determined to be “noise”. For example, the number of extracted documents may be set to 10, and if eight or more of them are classified as “noise”, a new document may be classified as “noise”.

（３）上記実施の形態においては、各技術的特徴である特許分類がノイズであるか該当であるかを判定するにおいて、ノイズ率が１０％未満である分類をノイズ判定０％とし、ＩＰＣノイズ率が９５％以上である分類をノイズ判定１００％とすることとした。ここで、１０％の閾値は、対応する分類が付与されている場合に、文献がユーザの所望する文献に該当するか否かを判定するための第２閾値であると言える。つまり、ステップＳ７０５における判定は、該当している率が９０％以上であるかの判定であるともいえる。また、ステップＳ７０７における判定に用いた第１閾値についても同様のことが言える。 (3) In the above-described embodiment, in determining whether the patent classification, which is a technical feature, is noise or the corresponding, the classification having a noise rate of less than 10% is set to 0% noise determination, and the IPC noise is determined. The classification with the rate of 95% or more was set as the noise determination of 100%. Here, it can be said that the threshold of 10% is the second threshold for determining whether or not the document corresponds to the document desired by the user when the corresponding classification is given. That is, it can be said that the determination in step S705 is a determination as to whether the corresponding rate is 90% or more. The same applies to the first threshold used for the determination in step S707.

つまり、文書分類装置１００は、文献に付与されている特許分類が該当か否かを示す該当率が第１閾値である９０％以上であるか否かに基づいて判定し、非該当率が第２閾値である９５％以上であるか否かにに基づいて判定していることが理解できる。ここで、第１閾値と第２閾値との間に差を設けることによって、分類を、ノイズか該当かのいずれかに必ず分類できるようにすることができる。また、その分類がノイズであることを判定することを優先するのか、該当であることを判定することを優先するのかに応じて、第１閾値と第２閾値とを変動させることとしてもよい。そのために、文書分類装置１００は、第１閾値、第２閾値を設定するための設定部を備えることとしてもよい。当該設定部に対する入力は、文書分類装置１００が学習によって適切な値に設定することとしてもよいし、文書分類装置１００のユーザが設定することとしてもよい。なお、これらの判定に用いた閾値のパーセンテージは、上記実施の形態に示した数値に限るものではなく、適宜その設定値を、文書分類装置１００のオペレータが変更することができる。 That is, the document classification apparatus 100 makes a determination based on whether or not the applicable rate indicating whether or not the patent classification given to the document is applicable is 90% or more, which is the first threshold, and the non-applicable rate is It can be understood that the determination is made based on whether the threshold value is 95% or more, which is two threshold values. Here, by providing a difference between the first threshold value and the second threshold value, it is possible to ensure that the classification can be classified as either noise or relevant. Further, the first threshold value and the second threshold value may be changed depending on whether to give priority to the determination that the classification is noise or to give the determination that the classification is applicable. Therefore, the document classification device 100 may include a setting unit for setting the first threshold and the second threshold. The input to the setting unit may be set to an appropriate value by the document classification device 100 through learning, or may be set by the user of the document classification device 100. Note that the threshold percentages used for these determinations are not limited to the numerical values shown in the above embodiments, and the set values can be changed by the operator of the document classification apparatus 100 as appropriate.

（４）上記実施の形態において、ＩＰＣノイズ率が１００％の分類が付与されている文献を、ノイズと分類し、ＩＰＣノイズ率が０％の分類が付与されている文献を、該当に分類することとしている。しかしながら、場合によっては、ノイズ率が１００％の分類と、ノイズ率が０％の分類が付与されている文献が存在する可能性がある。そのような場合には、予めユーザが定めた所定の基準にしたがって、文書分類装置１００は、その文献を「ノイズ」であると判定してもよいし、「該当」であると判定してもよい。例えば、「ノイズ」を優先する設定とした場合には、「ノイズ」であると判定し、「該当」を優先する設定とした場合には、「該当」であると判定することとしてよい。 (4) In the above-described embodiment, a document to which the IPC noise rate is assigned 100% is classified as noise, and a document to which the IPC noise rate is 0% is assigned. I have decided. However, in some cases, there may be a document to which a noise rate of 100% and a noise rate of 0% are assigned. In such a case, the document classification device 100 may determine that the document is “noise” or may be “corresponding” according to a predetermined standard determined by the user in advance. Good. For example, when “noise” is set as a priority, it may be determined as “noise”, and when “corresponding” is set as a priority, it may be determined as “corresponding”.

（５）上記実施の形態においては、文書分類装置が新たな文書データを分類する手法として、文書分類装置１００を構成する各機能部として機能するプロセッサが文書分類プログラム等を実行することにより、新たな文書データを分類することとしているが、これは装置に集積回路（ＩＣ（Integrated Circuit）チップ、ＬＳＩ（Large Scale Integration））等に形成された論理回路（ハードウェア）や専用回路によって実現してもよい。また、これらの回路は、１または複数の集積回路により実現されてよく、上記実施の形態に示した複数の機能部の機能を１つの集積回路により実現されることとしてもよい。ＬＳＩは、集積度の違いにより、ＶＬＳＩ、スーパーＬＳＩ、ウルトラＬＳＩなどと呼称されることもある。すなわち、図１１に示すように、文書分類装置１００を構成する各機能部は、物理的な回路により実現されてもよい。図１１に示すように、文書分類装置１００は、記憶回路１３０ａと、取得回路１１０ａと、抽出回路１２１ａと、判断回路１２２ａと、出力回路１４０ａ、とを備え、各回路は、上述の同名の機能部と同様の機能を有する。 (5) In the above embodiment, as a method for the document classification device to classify new document data, the processor functioning as each functional unit constituting the document classification device 100 executes the document classification program or It is intended to classify various document data, but this is realized by a logic circuit (hardware) formed in an integrated circuit (IC (Integrated Circuit) chip, LSI (Large Scale Integration)) or a dedicated circuit in the device. Good. Further, these circuits may be implemented by one or a plurality of integrated circuits, and the functions of the plurality of functional units described in the above embodiment may be implemented by a single integrated circuit. The LSI may be referred to as VLSI, super LSI, ultra LSI, or the like depending on the degree of integration. That is, as shown in FIG. 11, each functional unit configuring the document classification device 100 may be realized by a physical circuit. As shown in FIG. 11, the document classification device 100 includes a storage circuit 130a, an acquisition circuit 110a, an extraction circuit 121a, a determination circuit 122a, and an output circuit 140a, and each circuit has the same function as described above. It has the same function as the section.

また、上記文書分類プログラムは、プロセッサが読み取り可能な記録媒体に記録されていてよく、記録媒体としては、「一時的でない有形の媒体」、例えば、テープ、ディスク、カード、半導体メモリ、プログラマブルな論理回路などを用いることができる。また、上記文書分類プログラムは、当該文書分類プログラムを伝送可能な任意の伝送媒体（通信ネットワークや放送波等）を介して上記プロセッサに供給されてもよい。本発明は、上記文書分類プログラムが電子的な伝送によって具現化された、搬送波に埋め込まれたデータ信号の形態でも実現され得る。 Further, the document classification program may be recorded in a processor-readable recording medium, and the recording medium may be a “non-transitory tangible medium”, for example, a tape, a disk, a card, a semiconductor memory, a programmable logic. A circuit or the like can be used. Further, the document classification program may be supplied to the processor via any transmission medium (communication network, broadcast wave, etc.) capable of transmitting the document classification program. The present invention can also be realized in the form of a data signal embedded in a carrier wave, in which the document classification program is realized by electronic transmission.

なお、上記文書分類プログラムは、例えば、ActionScript、JavaScript（登録商標）などのスクリプト言語、Objective-C、Java（登録商標）などのオブジェクト指向プログラミング言語、HTML5などのマークアップ言語などを用いて実装できる。 The document classification program can be implemented using, for example, a script language such as ActionScript or JavaScript (registered trademark), an object-oriented programming language such as Objective-C or Java (registered trademark), or a markup language such as HTML5. ..

（６）上記実施の形態及び各補足に示した構成は、適宜組み合わせることとしてもよい。 (6) The configurations shown in the above-described embodiment and each supplement may be appropriately combined.

１００文書分類装置
１１０取得部
１２１抽出部
１２２判断部
１４０出力部 100 document classification device 110 acquisition unit 121 extraction unit 122 determination unit 140 output unit

Claims

The first document information indicating a plurality of document data searched according to the search formula, the first feature information indicating one or more technical features of the document data attached to the document data, and the document data are A storage unit that stores corresponding information indicating whether or not the document is a desired document for the user,
An acquisition unit that acquires second document information indicating another document data that is newly searched according to the search formula and that is provided with second characteristic information indicating one or more technical characteristics;
An extraction unit that extracts a predetermined number of document data from the plurality of document data based on the degree of coincidence between the second characteristic information and the first characteristic information,
A determination unit that determines whether or not the other document data is a document desired by the user, based on the corresponding information associated with each of the predetermined number of document data extracted by the extraction unit,
A document classification device, comprising: an output unit that outputs a determination result of the determination unit.

The extraction unit extracts, from the plurality of document data, a predetermined number from the one having a high degree of coincidence between the second characteristic information and the first characteristic information,
When the corresponding information associated with the document extracted by the extraction unit has more than a threshold value indicating that the document is desired by the user, the determination unit determines that the other document data is desired by the user. The document is determined to be a document, and when there are more than a threshold value indicating that the document is not a document desired by the user, it is determined that the other document data is not a document desired by the user. Document classification device described.

The determination unit performs weighting on the relevant information associated with the documents extracted by the extraction unit according to the degree of matching, and based on the relevant information after performing the weighting, the other document. The document classification device according to claim 1 or 2, wherein it is determined whether or not the data is a desired document for the user.

The determination unit is associated with the first feature information of the document data that is associated with the first feature information and indicates that the corresponding information is a document that the user does not desire. The non-correspondence rate indicating the ratio to the entire document data that the user does not want other document data having the second feature information that matches the first feature information that exceeds the first threshold value. The document classification device according to claim 1, wherein the document classification device is a document classification device.

The determination unit is associated with the first feature information of the document data that is the document data associated with the first feature information and indicates that the corresponding information is a document desired by the user. That the document is desired by the user for other document data having the second feature information that matches the first feature information that exceeds the second threshold. The document classification device according to claim 4, wherein the document classification device is a document classification device.

The document classification device according to claim 5, wherein the first threshold value is larger than the second threshold value.

When the feature information is used in the search formula, the determination unit makes the determination based on the first feature information excluding the feature information and the second feature information. The document classification device according to any one of claims 1 to 6.

The first document information indicating a plurality of document data searched according to the search formula, the first feature information indicating one or more technical features of the document data attached to the document data, and the document data are A storage step of storing corresponding information indicating whether or not the document is a desired document for the user,
An acquisition step of acquiring second document information indicating another document data newly searched in accordance with the search formula and provided with second characteristic information indicating one or more technical characteristics;
An extraction step of extracting a predetermined number of document data from the plurality of document data based on the degree of coincidence between the second feature information and the first feature information;
A determination step of determining whether or not the other document data is a document desired by the user, based on the corresponding information associated with each of the predetermined number of document data extracted in the extraction step,
An output step of outputting the result of the determination in the determination step.

On the computer,
The first document information indicating a plurality of document data searched according to the search formula, the first feature information indicating one or more technical features of the document data attached to the document data, and the document data are A storage function of storing corresponding information indicating whether or not the document is a desired document for the user,
An acquisition function for acquiring second document information indicating another document data newly searched in accordance with the search formula and provided with second characteristic information indicating one or more technical characteristics;
An extraction function of extracting a predetermined number of document data from the plurality of document data based on the degree of coincidence between the second feature information and the first feature information,
A determination function of determining whether or not the other document data is a document desired by the user, based on the corresponding information associated with each of the predetermined number of document data extracted by the extraction function,
A document classification program that realizes an output function of outputting a determination result of the determination function.