JPH1153397A

JPH1153397A - Device and method for document processing and storage medium storing document processing program

Info

Publication number: JPH1153397A
Application number: JP9219303A
Authority: JP
Inventors: Naoyuki Nomura; 直之野村
Original assignee: JustSystems Corp
Current assignee: JustSystems Corp
Priority date: 1997-07-29
Filing date: 1997-07-29
Publication date: 1999-02-26

Abstract

PROBLEM TO BE SOLVED: To decide the similarity with high accuracy between an acquired object document and another document by using one of discriminated text and non-text data for decision of the similarity. SOLUTION: A CPU 111 performs a morphological analysis to extract the independent words of a retrieval object document A, or the like, or a document B of interest, or the like. Then the CPU 111 decides the importance of every candidate word (phrase) based on the frequency of occurrence, the evaluation function, etc., set in every document for the candidate words (phrases) which are extracted from the document A or B and then decides every document vector based on the importance with the importance of every keyword of the document A or B used as an element. Furthermore, the CPU 111 decides the similarity between both documents A and B based on the cosine that is dependent on the angle set between the vectors of both documents. Then it is decided that the document A is similar to the document B when the similarity value is larger than its threshold.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００２】[0002]

【発明の属する技術分野】この発明は、文書処理装置、
文書処理プログラムを記憶した記憶媒体、及び文書処理
方法に係り、詳細には、取得した文書の類似性判断の精
度向上に関する。The present invention relates to a document processing apparatus,
The present invention relates to a storage medium storing a document processing program and a document processing method, and more particularly to improving the accuracy of similarity determination of acquired documents.

【０００３】[0003]

【従来の技術】従来、書籍、論文、報告書等の各種文書
に対し、ベクトル空間法その他の方法により自動検索処
理や、類似文書の検索処理等や、他文書等との関連づけ
処理等の各種処理をコンピュータを用いて実行すること
が行われている。2. Description of the Related Art Conventionally, various kinds of documents such as books, papers, reports, etc., such as an automatic search process by a vector space method or other methods, a search process of similar documents, a process of associating with other documents, and the like. The processing is performed using a computer.

【０００４】[0004]

【発明が解決しようとする課題】しかし、従来の文書処
理装置により、例えば、対象文書に類似する文書の検索
を行う場合、対象文書やデータベース中の文書の中にテ
キストのデータ以外の、プログラム言語や、ソースプロ
グラム、表データやグラブのスキーマ等の非テキストデ
ータを含んでいる場合は、類似性の判断を行っても、あ
まり高い精度を得ることができなかった。これは、例え
ば、表計算ソフトにより作成されたファイルとの類似性
を判断した場合、そのファイルは数字データが大半を占
め、類似性を判断するためのキーワードや重要語を抽出
しても数字データが大半を占めてしまうためである。However, when a conventional document processing apparatus searches for a document similar to a target document, for example, the target document or a document in a database contains a program language other than text data. In addition, when non-text data such as a source program, table data, and a grab schema is included, it is not possible to obtain very high accuracy even if the similarity is determined. This is because, for example, if similarity to a file created by spreadsheet software is determined, the file will be dominated by numeric data, and even if keywords and important words for determining similarity are extracted, numeric data For the majority.

【０００５】本発明は、このような従来の課題を解決す
るために成されたもので、取得した対象文書と他の文書
との類似性判断を高い精度で行うことが可能な文書処理
装置を提供することを第１の目的とする。また本発明
は、取得した対象文書と他の文書との類似性判断を高い
精度で行うことが可能なコンピュータ読取り可能な文書
処理プログラムが記憶された記憶媒体を提供することを
第２の目的とする。また本発明は、取得した対象文書と
他の文書との類似性判断を高い精度で行うことが可能な
文書処理方法を提供することを第３の目的とする。SUMMARY OF THE INVENTION The present invention has been made to solve such a conventional problem, and provides a document processing apparatus capable of determining the similarity between an acquired target document and another document with high accuracy. The primary purpose is to provide. A second object of the present invention is to provide a storage medium storing a computer-readable document processing program capable of determining the similarity between an acquired target document and another document with high accuracy. I do. A third object of the present invention is to provide a document processing method capable of determining the similarity between an acquired target document and another document with high accuracy.

【０００６】[0006]

【課題を解決するための手段】請求項１に記載した発明
では、処理対象となる対象文書を取得する対象文書取得
手段と、前記対象文書取得手段により取得された対象文
書から、テキストのデータ以外の非テキストデータを有
しているか否かを判断する判断手段と、前記判断手段に
より非テキスト文書を有すると判断された場合に、前記
対象文書からテキストのデータと非テキストデータとを
区別する区別手段と、前記区別手段により区別されたテ
キストのデータ及び非テキストデータの少なくとも一方
を使用して、他の文書との類似性を判断する類似性判断
手段と、を文書処理装置に具備させて前記第１の目的を
達成する。請求項２に記載の発明では、請求項１に記載
の文書処理装置において、前記非テキストデータは、表
データ、グラフのスキーマ情報、プログラム言語、ソー
スプログラム等とする。請求項３に記載の発明では、請
求項１に記載の文書処理装置において、前記区別手段に
より区別された非テキストデータがソースプログラムで
ある場合に、当該ソースプログラムを中間言語に変換す
る変換手段を備え、前記類似性判断手段は、前記変換手
段により変換された中間言語のプログラム間の類似性を
判断する。請求項４に記載の発明では、請求項１、請求
項２、又は請求項３に記載の文書処理装置において、前
記類似性判断手段は、ベクトル空間法により類似度を算
出する算出手段を有し、前記算出手段により算出された
類似度から類似性を判断する。請求項５に記載の発明で
は、処理対象となる対象文書を取得する対象文書取得機
能と、前記対象文書取得機能により取得された対象文書
から、テキストのデータ以外の非テキストデータを有し
ているか否かを判断する判断機能と、前記判断機能によ
り非テキスト文書を有すると判断された場合に、前記対
象文書からテキストのデータと非テキストデータを区別
する区別機能と、前記区別機能により区別されたテキス
トのデータ及び非テキストデータの少なくとも一方を使
用して、他の文書との類似性を判断する類似性判断機能
と、をコンピュータに実現させるためのコンピュータ読
取り可能な文書処理プログラムを記憶媒体に記憶させ
て、前記第２の目的を達成する。請求項６に記載した発
明では、請求項５に記載した記憶媒体において、前記非
テキストデータは、表データ、グラフのスキーマ情報、
プログラム言語、ソースプログラム等とする。請求項７
に記載した発明では、請求項５に記載した記憶媒体にお
いて、前記区別機能により区別された非テキストデータ
がソースプログラムである場合に、当該ソースプログラ
ムを中間言語に変換する変換機能を備え、前記類似性判
断機能は、前記変換機能により変換された中間言語のプ
ログラム間の類似性を判断する。請求項８に記載した発
明では、請求項５、請求項６、又は請求項７に記載した
記憶媒体において、前記類似性判断機能は、ベクトル空
間法により類似度を算出する算出機能を有し、前記算出
機能により算出された類似度から類似性を判断する。請
求項９に記載された発明では、処理対象となる対象文書
から、テキストのデータと非テキストデータとを区別
し、前記区別されたテキストのデータ及び非テキストデ
ータの少なくとも一方を使用して、他の文書との類似性
を判断する、ことで前記第３の目的を達成する。According to the first aspect of the present invention, a target document acquiring unit for acquiring a target document to be processed and a target document acquired by the target document acquiring unit other than text data. Determining means for determining whether or not the target document has non-text data; and distinguishing between text data and non-text data from the target document when the determining means determines that the document has a non-text document. Means and a similarity determining means for determining similarity with another document using at least one of the text data and the non-text data distinguished by the distinguishing means. Achieve the first objective. According to a second aspect of the present invention, in the document processing apparatus according to the first aspect, the non-text data includes table data, schema information of a graph, a programming language, a source program, and the like. According to a third aspect of the present invention, in the document processing apparatus according to the first aspect, when the non-text data distinguished by the distinguishing means is a source program, the converting means for converting the source program into an intermediate language is provided. The similarity determination unit determines the similarity between the intermediate language programs converted by the conversion unit. According to a fourth aspect of the present invention, in the document processing apparatus according to the first, second, or third aspect, the similarity determination unit includes a calculation unit that calculates a similarity by a vector space method. The similarity is determined from the similarity calculated by the calculation means. According to the fifth aspect of the present invention, a target document acquisition function for acquiring a target document to be processed, and whether the target document acquired by the target document acquisition function has non-text data other than text data A judgment function for judging whether or not the target document has a non-text document, and a discrimination function for discriminating text data and non-text data from the target document; A computer-readable document processing program for causing a computer to implement a similarity determination function of determining similarity with another document using at least one of text data and non-text data is stored in a storage medium. Thus, the second object is achieved. In the invention described in claim 6, in the storage medium according to claim 5, the non-text data includes table data, graph schema information,
Program language, source program, etc. Claim 7
The storage medium according to claim 5, further comprising, when the non-text data distinguished by the distinguishing function is a source program, a conversion function of converting the source program into an intermediate language. The gender determination function determines the similarity between the intermediate language programs converted by the conversion function. According to the invention described in claim 8, in the storage medium described in claim 5, 6, or 7, the similarity determination function has a calculation function of calculating a similarity by a vector space method, The similarity is determined from the similarity calculated by the calculation function. According to the ninth aspect of the present invention, text data and non-text data are distinguished from a target document to be processed, and at least one of the distinguished text data and non-text data is used. The third object is achieved by judging the similarity with the document.

【０００７】[0007]

【発明の実施の形態】以下、本発明の文書処理装置およ
び文書処理プログラムを記憶した記憶媒体における好適
な実施の形態について、図１から図９を参照して説明す
る。（１）実施の形態の概要本実施形態では、表データやグラフのスキーマについて
意味解釈を行うことにより、特に数量表現を中心に正確
なてがかりをもとに、ある注目文書（対象文書）に関連
する数値データやグラフを取り寄せることができる。具
体的には、個々の表データ・ファイルごとに、所与の、
もしくは同一構造で類似の数値内容をもつ別の表データ
に付属していたスキーマ（×100 円、Ｋｇ／ｍ等）情報
をもとに、単位付きの実値を計算する。その結果をクラ
リタイズ（claritize ）し、類似検索の対象とする。そ
の際、グラフの軸名やその単位表示、グラフのタイトル
についても、特にウェイトを高めて類似性判定の対象と
する。同様に、テキスト中の数量表現の正規化に適用す
ることで、数値データを軸にした類似検索の性能を高め
ることができる。また、プログラム言語のセマンティク
スを与える、予約語変換テーブルや構文解釈ルーチンに
よって局所的なセマンティクス、中間言語に変換した結
果をクラリタイズすることにより、異種のソースプログ
ラムも類似検索の対象となる。ソースプログラムに関し
ても同様な言語変換処理を行い、さらに自然言語による
コメントの記述領域に対しては、本来の英語や日本語向
けのクラリタイズし、同一ファイルに対して２層の転置
ファイルを用意する。これによって、テキスト中心の文
書とも類似性でヒットするし、別のソースプログラムと
も類似性でヒットさせることができる。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Preferred embodiments of a document processing apparatus and a storage medium storing a document processing program according to the present invention will be described below with reference to FIGS. (1) Outline of Embodiment In this embodiment, by interpreting the meaning of table data and graph schemas, a particular document (target document) can be obtained based on accurate clues, especially on numerical expressions. Related numerical data and graphs can be ordered. Specifically, for each individual table data file, a given,
Alternatively, a real value with a unit is calculated based on schema (× 100 yen, Kg / m, etc.) information attached to another table data having the same structure and similar numerical contents. The result is clarified and subjected to similarity search. At this time, the axis name of the graph, the unit display thereof, and the title of the graph are also subjected to similarity determination by increasing the weight. Similarly, by applying the normalization of the numerical expression in the text, it is possible to improve the performance of the similarity search based on the numerical data. In addition, by differentiating local semantics by a reserved word conversion table or a syntax interpretation routine that gives the semantics of a programming language and a result of conversion into an intermediate language, heterogeneous source programs can be subjected to similarity search. A similar language conversion process is performed on the source program, and furthermore, the comment area in the natural language is clarified for the original English or Japanese, and a two-layer transposed file is prepared for the same file. As a result, the document can be hit with similarity to a text-centered document, and can be hit with similarity to another source program.

【０００８】（２）実施の形態の詳細図１は、文書処理装置の構成を表すブロック図である。
本実施形態の文書処理装置は、パーソナルコンピュータ
やワードプロセッサ等を含むコンピュータシステムにお
いて構成することができる。図１に示すように文書処理
装置は、装置全体を制御するための制御部１１を備えて
いる。この制御部１１にアドレスバス、データバス等の
バスライン２１を介して、入力装置としてのキーボード
１２やマウス１３と、表示装置１４と、印刷装置１５
と、記憶装置１６と、記憶媒体駆動装置１７と、通信制
御装置１８と、入出力インタフェース（Ｉ／Ｆ）１９
と、文字認識装置２０とが接続されている。(2) Details of Embodiment FIG. 1 is a block diagram showing the configuration of a document processing apparatus.
The document processing apparatus according to the present embodiment can be configured in a computer system including a personal computer, a word processor, and the like. As shown in FIG. 1, the document processing apparatus includes a control unit 11 for controlling the entire apparatus. A keyboard 12 and a mouse 13 as input devices, a display device 14, and a printing device 15 are transmitted to the control unit 11 via a bus line 21 such as an address bus and a data bus.
, Storage device 16, storage medium drive device 17, communication control device 18, input / output interface (I / F) 19
And the character recognition device 20 are connected.

【０００９】制御部１１は、中央処理装置（ＣＰＵ）１
１１と、ＲＯＭ１１２と、ＲＡＭ１１３を備えている。
ＲＯＭ１１２は、ＣＰＵ１１１が制御や演算を行うため
の各種プログラムやデータが予め格納されたリードオン
リメモリである。ＲＡＭ１１３は、ＣＰＵ１１１にワー
キングメモリとして使用されるランダムアクセスメモリ
である。このＲＡＭ１１３には、本実施の形態による処
理を行うための各種エリアが確保されるようになってい
る。The control unit 11 includes a central processing unit (CPU) 1
11, a ROM 112, and a RAM 113.
The ROM 112 is a read-only memory in which various programs and data for the CPU 111 to perform control and calculation are stored in advance. The RAM 113 is a random access memory used as a working memory by the CPU 111. In the RAM 113, various areas for performing the processing according to the present embodiment are secured.

【００１０】キーボード１２は、かな文字、数字、記号
等を入力するためのかなキーやテンキー、各種機能を実
行させるための機能キー、カーソルキー等の各種キーが
配置されている。マウス１３は、ポインティングデバイ
スであり、表示装置１４に表示されたキーやアイコン等
を左クリックすることが対応する機能の指定を行う入力
装置である。表示装置１４は、例えばＣＲＴや液晶ディ
スプレイ等が使用される。この表示装置には、キーボー
ド１２やマウス１３による入力結果が表示されたり、Ｃ
ＰＵ１１１の処理結果が表示されたりするようになって
いる。印刷装置１５は、表示装置１４に表示された文書
や、記憶装置１６の文書データベース１６４に格納され
た文書等や、ＣＰＵ１１１による処理結果等の印刷を行
うためのものである。この印刷装置としては、レーザプ
リンタ、ドットプリンタ、インクジェットプリンタ、ペ
ージプリンタ、感熱式プリンタ、熱転写式プリンタ、等
の各種印刷装置が使用される。The keyboard 12 has various keys such as kana keys and numeric keys for inputting kana characters, numbers, and symbols, function keys for executing various functions, and cursor keys. The mouse 13 is a pointing device, and is an input device for designating a corresponding function by left-clicking a key, an icon, or the like displayed on the display device 14. As the display device 14, for example, a CRT or a liquid crystal display is used. This display device displays input results from the keyboard 12 and the mouse 13,
The processing result of the PU 111 is displayed. The printing device 15 prints a document displayed on the display device 14, a document stored in the document database 164 of the storage device 16, a processing result by the CPU 111, and the like. Various printing apparatuses such as a laser printer, a dot printer, an ink jet printer, a page printer, a thermal printer, and a thermal transfer printer are used as the printing apparatus.

【００１１】記憶装置１６は、読み書き可能な記憶媒体
と、その記憶媒体に対してプログラムやデータ等の各種
情報を読み書きするための駆動装置で構成されている。
この記憶装置１６に使用される記憶媒体としては、主と
してハードディスクが使用されるが、後述の記憶媒体駆
動装置１７で使用されている各種記憶媒体のうちの読み
書き可能な記憶媒体を使用するようにしてもよい。記憶
装置１６は、仮名漢字変換辞書１６１、プログラム格納
部１６２、データ格納部１６３、文書データベース１６
４、文書ベクトルデータベース１６６、図示しないその
他の格納部（例えば、この記憶装置１６内に格納されて
いるプログラムやデータ等をバックアップするための格
納部）等を有している。プログラム格納部１６２には、
本実施の形態における文書ベクトルプログラム、その他
のプログラムが格納されている。データ格納部１６３
は、各種データが格納されている。The storage device 16 comprises a readable and writable storage medium and a drive device for reading and writing various information such as programs and data on the storage medium.
As a storage medium used for the storage device 16, a hard disk is mainly used, but a readable / writable storage medium of various storage media used in a storage medium drive device 17 described later is used. Is also good. The storage device 16 includes a kana-kanji conversion dictionary 161, a program storage unit 162, a data storage unit 163, a document database 16
4, a document vector database 166, and other storage units (not shown) (for example, storage units for backing up programs, data, and the like stored in the storage device 16). In the program storage unit 162,
The document vector program and other programs according to the present embodiment are stored. Data storage unit 163
Stores various data.

【００１２】文書データベース１６４には、典型文書や
通常の文書が格納されている。典型文書としては、例え
ば、対象文書のワークフローを決定する際に、類似文書
を検索してその類似文書のワークフローを参考にすると
いったワークフローの決定支援をする場合に使用され
る、各ワークフローを特徴付ける典型的な内容の文書が
格納されている。また、対象文書を分類する場合、各分
類の特徴を表す典型的な内容の文書が典型文書として格
納されている。文書データベース１６４に格納される各
文書の形式は特に限定されるものではなく、テキスト形
式の文書、ＨＴＭＬ（Hyper Text Markup Language）形
式の文書、ＪＩＳ形式の文書等の各種形式の文書データ
の外、各種のデータ、プログラム、ソースプログラム、
及び、これらが混在している文書やデータ等も格納され
ている。The document database 164 stores typical documents and ordinary documents. As typical documents, for example, when determining a workflow of a target document, a typical document that is used to assist in determining a workflow such as searching for a similar document and referring to the workflow of the similar document. Documents with typical contents are stored. When classifying target documents, documents having typical contents representing the characteristics of each classification are stored as typical documents. The format of each document stored in the document database 164 is not particularly limited, and includes various types of document data such as text format documents, HTML (Hyper Text Markup Language) format documents, and JIS format documents. Various data, programs, source programs,
Documents and data in which these are mixed are also stored.

【００１３】文書ベクトルデータ１６６には、文書デー
タベース１６４に格納されている各文書、データ、プロ
グラムあるいはソースプログラムに対応する文書ベクト
ルが格納されるようになっている。The document vector data 166 stores a document vector corresponding to each document, data, program or source program stored in the document database 164.

【００１４】図２は、文書ベクトルデータベース１６６
の内容を概念的に表したものである。この図２に示され
るように、文書、単語やスキーマ情報を含むデータ、あ
るいはプログラム等から自動抽出されたキーワードｘに
対して求められた重要度ｆ（ｘ）が文書ベクトルの要要
素値ｆ（ｘ）が、各文書、各データ、各プログラムある
いはソースプログラム毎の文書ベクトルの要素値として
格納されている。この文書ベクトルは、各文書（Ａａ，
Ｂａ，Ｃａ，…）、各データ（Ａｂ，Ｂｂ，Ｃｂ，
…）、各プログラム（Ａｃ，Ｂｃ，Ｃｃ，…）あるいは
ソースプログラム（Ａｄ，Ｂｄ，Ｃｄ，…）毎に格納さ
れており、文書データベース１６４に格納されている各
文書、各データ、各プログラムあるいはソースプログラ
ムと対応つけられている。各文書ベクトルの次元は採用
するキーワードｘ（重要語句）の数であるが、２文書間
等の類似度を両文書ベクトルから求める場合には、両文
書のキーワードの和集合の数が両文書ベクトルの次元と
なる。この場合、一方の文書ベクトルにのみ含まれるキ
ーワードに対する他方の文書ベクトルの要素値は、”
０”に定義される。FIG. 2 shows a document vector database 166.
Are conceptually represented. As shown in FIG. 2, the importance f (x) obtained for a document, data including words and schema information, or a keyword x automatically extracted from a program or the like is a key element value f ( x) is stored as an element value of a document vector for each document, each data, each program or each source program. This document vector represents each document (Aa,
Ba, Ca,...), Each data (Ab, Bb, Cb,
..), Each program (Ac, Bc, Cc,...) Or each source program (Ad, Bd, Cd,...), And each document, each data, each program or each program stored in the document database 164. Corresponds to the source program. The dimension of each document vector is the number of keywords x (keywords) to be adopted. When the similarity between two documents is obtained from both document vectors, the number of unions of keywords of both documents is equal to both document vectors. Of dimension. In this case, the element value of the other document vector for the keyword included only in one document vector is “
0 "is defined.

【００１５】例えば図２おいて、文書Ａａのキーワード
は「重要、重要語、重要度、…」、文書Ｃａのキーワー
ドは「重要、…、政治、…」であり、両文書の文書ベク
トルは次の通りである。文書Ａａの文書ベクトル＝（１，１８，１９，…）文書Ｃａの文書ベクトル＝（１８，…，２１，…）これに対して文書Ａａと文書Ｃａとの類似度を算出する
場合には、両文書のキーワードを「重要、重要語、重要
度、…、政治、…」とし、両文書の文書ベクトルはつぎ
の通り定義される。文書Ａａの文書ベクトル＝（１，１８，１９，…，
０，…）文書Ｃａの文書ベクトル＝（１８，０，０，…，２
１，…）For example, in FIG. 2, the keywords of the document Aa are “important, important words, importance,...”, The keywords of the document Ca are “important,..., Politics,. It is as follows. Document vector of document Aa = (1,18,19, ...) Document vector of document Ca = (18, ..., 21, ...) On the other hand, when calculating the similarity between document Aa and document Ca, The keywords of both documents are "important, important words, importance, ..., politics, ...", and the document vectors of both documents are defined as follows. Document vector of document Aa = (1, 18, 19,...,
0,...) Document vector of document Ca = (18, 0, 0,.
1,…)

【００１６】記憶媒体駆動装置１７（図１）は、ＣＰＵ
１１１が外部の記憶媒体からコンピュータプログラムや
文書を含むデータ等を読み込むための駆動装置である。
記憶媒体に記憶されているコンピュータプログラム等に
は、本実施形態の文書処理装置により実行される各種処
理のためのプログラム、および、そこで使用される辞
書、データ等も含まれる。ここで、記憶媒体とは、コン
ピュータプログラムやデータ等が記憶される記憶媒体を
いい、具体的には、フロッピーディスク、ハードディス
ク、磁気テープ等の磁気記憶媒体、メモリチップやＩＣ
カード等の半導体記憶媒体、ＣＤ−ＲＯＭやＭＯ、ＰＤ
（相変化書換型光ディスク）等の光学的に情報が読み取
られる記憶媒体、紙カードや紙テープ等の用紙（およ
び、用紙に相当する機能を持った媒体）を用いた記憶媒
体、その他各種方法でコンピュータプログラム等が記憶
される記憶媒体が含まれる。本実施形態の文書処理装置
において使用される記憶媒体としては、主として、ＣＤ
−ＲＯＭやフロッピーディスク等の記憶媒体が使用され
る。記憶媒体駆動装置１７は、これらの各種記憶媒体か
らコンピュータプログラムを読み込む他に、フロッピー
ディスクのような書き込み可能な記憶媒体に対してＲＡ
Ｍ１１３や記憶装置１６に格納されているデータ等を書
き込むことが可能である。The storage medium drive 17 (FIG. 1) has a CPU
Reference numeral 111 denotes a driving device for reading data including computer programs and documents from an external storage medium.
The computer programs and the like stored in the storage medium include programs for various processes executed by the document processing apparatus of the present embodiment, and dictionaries and data used therein. Here, the storage medium refers to a storage medium in which a computer program, data, and the like are stored, and specifically, a magnetic storage medium such as a floppy disk, a hard disk, a magnetic tape, a memory chip or an IC.
Semiconductor storage media such as cards, CD-ROM, MO, PD
(A phase change rewritable optical disk) or other storage medium from which information can be read optically, a storage medium using paper (such as a paper card or paper tape) (and a medium having a function equivalent to paper), and a computer using various other methods. It includes a storage medium in which programs and the like are stored. The storage medium used in the document processing apparatus of the present embodiment is mainly a CD.
-A storage medium such as a ROM or a floppy disk is used. The storage medium drive 17 reads a computer program from these various storage media, and also writes an RA to a writable storage medium such as a floppy disk.
It is possible to write data or the like stored in the M113 or the storage device 16.

【００１７】本実施の形態の文書処理装置では、制御部
１１のＣＰＵ１１１が、記憶媒体駆動装置１７にセット
された外部の記憶媒体からコンピュータプログラムを読
み込んで、記憶装置１６の各部に格納（インストール）
する。そして、本実施形態による各種処理を実行する場
合、記憶装置１６から該当のプログラムをＲＡＭ１１３
に読み込み、実行するようになっている。ただし、記憶
装置１６からではなく、記憶媒体駆動装置１７により外
部の記憶媒体から直接ＲＡＭ１１３に読み込んで実行す
ることも可能である。また、文書処理装置によっては、
本実施の形態のプログラム等を予めＲＯＭ１１２に記憶
させておき、これをＣＰＵ１１１が実行するようにして
もよい。さらに、本実施形態の処理プログラム等の各種
プログラムやデータを、通信制御装置１８を介して他の
記憶媒体からダウンロードし、実行するようにしてもよ
い。In the document processing apparatus of the present embodiment, the CPU 111 of the control unit 11 reads a computer program from an external storage medium set in the storage medium drive 17 and stores it in each unit of the storage device 16 (installation).
I do. When executing various processes according to the present embodiment, the corresponding program is stored in the RAM 113 from the storage device 16.
To read and execute. However, it is also possible to read the data directly from the external storage medium into the RAM 113 by the storage medium driving device 17 instead of the storage device 16 and execute it. Also, depending on the document processing device,
The program and the like of the present embodiment may be stored in the ROM 112 in advance, and the CPU 111 may execute the program. Further, various programs and data such as the processing program of the present embodiment may be downloaded from another storage medium via the communication control device 18 and executed.

【００１８】通信制御装置１８は、他のパーソナルコン
ピュータやワードプロセッサ等との間でテキスト形式や
ＨＴＭＬ形式等の各種形式の文書やビットマップデータ
等の各種データの送受信を行うことができるようになっ
ている。入力Ｉ／Ｆ１９は、音声や音楽等の出力を行う
スピーカ等の各種機器を接続するためのインタフェース
である。文字認識装置２０は、用紙等に記載された文字
をテキスト形式やＨＴＭＬ等の各種形式で認識する装置
であり、イメージスキャナや文字認識プログラム等で構
成されている。The communication control device 18 is capable of transmitting and receiving various types of documents such as text format and HTML format and various data such as bitmap data to and from other personal computers and word processors. I have. The input I / F 19 is an interface for connecting various devices such as a speaker that outputs voice, music, and the like. The character recognition device 20 is a device for recognizing characters written on paper or the like in various formats such as a text format or HTML, and is configured by an image scanner, a character recognition program, and the like.

【００１９】以上のように構成された本実施形態の文書
処理装置による、文書等の類似性を判断する類似性判断
動作について図３から図９を用いて説明する。図３は、
文書等が類似するか否かを判断する処理のメイン動作を
表したものであり、図４から図９は類似度を判定する処
理を概念的に表したものである。この図３のフローチャ
ートの右側に記した（Ａ）〜（Ｈ）は、図４から図８の
（Ａ）〜（Ｈ）に対応したものである。The similarity judging operation for judging the similarity of a document or the like by the document processing apparatus of the present embodiment configured as described above will be described with reference to FIGS. FIG.
FIG. 4 to FIG. 9 conceptually illustrate a process of determining the degree of similarity in the main operation of the process of determining whether or not documents and the like are similar. (A) to (H) on the right side of the flowchart in FIG. 3 correspond to (A) to (H) in FIGS. 4 to 8.

【００２０】ＣＰＵ１１１は、処理対象となる対象文書
Ａ（図４（Ａ））を取得し、ＲＡＭ１１３に格納する
（ステップ１１）。この対象文書等Ａは、ユーザの指示
に従ってＲＡＭ１１３（自装置内で作成された文書であ
る場合）、記憶装置１６の文書データベース１６４、記
憶媒体駆動装置１７（自装置または他装置で作成済の文
書の場合）、通信制御装置１８（パソコン通信、インタ
ーネット等の通信による場合）から取得する。この対象
文書等Ａは、テキスト文書、ＨＴＭＬ文書、ＪＩＳ文
書、他の形式の文書、各種データ、プログラム、ソース
プログラム、及び、これらが混在している文書やデータ
等の場合がある。The CPU 111 acquires the target document A (FIG. 4A) to be processed and stores it in the RAM 113 (step 11). The target document etc. A is stored in the RAM 113 (in the case where the document is created in the own device), the document database 164 of the storage device 16, the storage medium driving device 17 (the document created in the own device or another device) in accordance with a user's instruction. Is obtained from the communication control device 18 (in the case of communication by personal computer communication, the Internet, etc.). The target document A may be a text document, an HTML document, a JIS document, a document in another format, various data, a program, a source program, or a document or data in which these are mixed.

【００２１】次に、ＣＰＵ１１１は、図４（Ｂ）に示す
ように、対象文書Ａとの類似性を判断する注目文書等Ｂ
（他の文書）を文書データベース１６４から読み出し
て、ＲＡＭ１１３に格納する（ステップ１２）。この注
目文書等Ｂについては既に文書ベクトルが作成され文書
ベクトルデータベース１６６に格納されている場合に
は、作成済みの文書ベクトルを使用し、作成する必要は
ない。また、この注目文書等Ｂも、テキスト文書、ＨＴ
ＭＬ文書、ＪＩＳ文書、他の形式の文書、各種データ、
プログラム、ソースプログラム、及びこれらが混在して
いる文書やデータ等の場合がある。Next, as shown in FIG. 4B, the CPU 111 determines whether the similarity with the target document A
(Other documents) are read from the document database 164 and stored in the RAM 113 (step 12). If a document vector has already been created and stored in the document vector database 166 for this document of interest B, it is not necessary to use the created document vector. In addition, the noticed document B is also a text document, HT
ML documents, JIS documents, documents in other formats, various data,
There may be a program, a source program, and a document or data in which these are mixed.

【００２２】次に、ＣＰＵ１１１は、ユーザによってキ
ーボード１２等から必要なパラメータ（例えば類似度判
断の閾値等）が入力された場合には当該入力値を取得
し、ユーザによる入力がない場合にはデータ格納部１６
３に格納されたパラメータのデフォルト値を取得しＲＡ
Ｍ１１３に格納する（ステップ１３）。次に、ＣＰＵ１
１１は、対象文書等Ａが、文書か、データか、あるいは
プログラム等かを判断する（ステップ１４）。ここで、
ＣＰＵ１１１により文書であると判断されたときには
（ステップ１４；文書）、ＣＰＵ１１１はＲＡＭ１１３
に格納した対象文書等Ａ（図５（Ｃ）の符号５０参照）
に対する文書ベクトルＶを取得する（ステップ１５、図
５（Ｃ）の符号６０参照）。なお、文書ベクトルＶは、
一般にはキーワードの数の次元（例えばｎ次元）で構成
されるが、図５〜図８では、表示する関係から２次元ベ
クトルで表示する（以下、同じ）。Next, the CPU 111 acquires an input value when a necessary parameter (for example, a threshold value for similarity determination) is input from the keyboard 12 or the like by the user, and acquires the input value when there is no input by the user. Storage unit 16
3 to obtain the default value of the parameter stored in
It is stored in M113 (step 13). Next, CPU1
11 determines whether the target document A is a document, data, a program, or the like (step 14). here,
When the CPU 111 determines that the document is a document (step 14; document), the CPU 111
A, etc. stored in the target document A (see reference numeral 50 in FIG. 5C)
(Step 15, see reference numeral 60 in FIG. 5C). Note that the document vector V is
In general, it is composed of the number of keywords (for example, n dimensions), but in FIGS. 5 to 8, it is represented by a two-dimensional vector because of the display relationship (the same applies hereinafter).

【００２３】また、ＣＰＵ１１１によりデータであると
判断されたときには（ステップ１４；データ）、ＣＰＵ
１１１は、ＲＡＭ１１３に格納した対象文書等Ａから、
タイトルあるいは注意事項等の語句や、例えば四則演
算、１００円、10kg/m等のスキーマ情報を取り出し（ス
テップ１６、図５（Ｄ）の符号５１参照）、ＣＰＵ１１
１はＲＡＭ１１３に格納した対象文書等Ａに対する文書
ベクトルＶを取得する（ステップ１７、図５（Ｄ）の符
号６１参照）。When the CPU 111 determines that the data is data (step 14; data), the CPU 111
Reference numeral 111 denotes a target document or the like A stored in the RAM 113,
The CPU 11 retrieves words such as titles and notes, schema information such as four arithmetic operations, 100 yen, and 10 kg / m (step 16, see reference numeral 51 in FIG. 5D).
1 acquires a document vector V for a target document A or the like stored in the RAM 113 (step 17, see reference numeral 61 in FIG. 5D).

【００２４】また、ＣＰＵ１１１によりプログラム等で
あると判断されたときには（ステップ１４；プログラム
等）、さらにプログラムかソースプログラムかを判断す
る（ステップ１８）。ここで、プログラムであるとＣＰ
Ｕ１１１により判断されたときには（ステップ１８；
Ｙ）、ＣＰＵ１１１は、ＲＡＭ１１３に格納した対象文
書等Ａの該当プログラム（図６（Ｅ）の符号５２参照）
から、セマンティクスを与える予約変換テーブルあるい
は構文解釈ルーチン（図６（Ｅ）の符号５２１参照）を
用いてプログラム言語に意味を与えて言語にして（ステ
ップ１９、図６（Ｅ）の符号５２２参照）、ＲＡＭ１１
３の所定のエリアに格納し（ステップ２０）、ＣＰＵ１
１１は所定のエリアに格納された当該変換言語（符号５
２２）に対する文書ベクトルＶを求める（ステップ２
１、図６（Ｅ）の符号６２参照）。When the CPU 111 determines that the program is a program or the like (step 14; program or the like), it is further determined whether the program is a program or a source program (step 18). Here, if the program is CP
When determined by U111 (step 18;
Y), the CPU 111 executes the corresponding program of the target document A stored in the RAM 113 (see reference numeral 52 in FIG. 6E).
Then, the program language is given a meaning by using a reserved conversion table or a syntax interpretation routine (see reference numeral 521 in FIG. 6E) that gives semantics to make it into a language (step 19, see reference numeral 522 in FIG. 6E). , RAM 11
3 in a predetermined area (step 20).
Reference numeral 11 denotes the conversion language (reference numeral 5) stored in a predetermined area.
22) (step 2)
1, see reference numeral 62 in FIG. 6 (E)).

【００２５】他方、ソースプログラムであるとＣＰＵ１
１１により判断されたときには（ステップ１８；Ｎ）、
ＣＰＵ１１１は、ＲＡＭ１１３に格納した対象文書等Ａ
の該当ソースプログラム（図６（Ｆ）の符号５３参照）
について解析し（ステップ２２、図６（Ｆ）の符号５３
０参照）、自然言語かプログラム領域かの判定でプログ
ラム領域と判定されたときには（ステップ２２；Ｎ）、
セマンティクスを与える予約変換テーブルあるいは構文
解釈ルーチン（図６（Ｆ）の符号５３１参照）を用いて
プログラム言語に意味を与えて言語にしてＲＡＭ１１３
の所定のエリアに格納し（ステップ２４、図６（Ｆ）の
符号５３２参照）、ＲＡＭ１１３の所定のエリアに格納
し（ステップ２５）、ＣＰＵ１１１は所定のエリアに格
納された当該変換言語（符号５３２）に対する文書ベク
トルＶを求める（ステップ２５、図６（Ｆ）の符号６３
ａ参照）。また、ＣＰＵ１１１は、ＲＡＭ１１３に格納
した対象文書等Ａの該当ソースプログラム（図６（Ｆ）
の符号５３参照）について解析した結果（ステップ２
２；Ｙ、図６（Ｆ）の符号５３０参照）、自然言語によ
るコメントの記述領域（図６（Ｆ）の符号５３３参照）
に対しては、そのままで文書ベクトルＶを求める（ステ
ップ２７、図６（Ｆ）の符号６３ｂ参照）。On the other hand, if the source program is
11 (step 18; N),
The CPU 111 stores the target document A stored in the RAM 113
Corresponding source program (see reference numeral 53 in FIG. 6 (F))
(Step 22, reference numeral 53 in FIG. 6F)
0), when it is determined that the program area is a natural language or a program area (step 22; N),
The programming language is given a meaning by using a reserved conversion table or a syntax interpretation routine (see reference numeral 531 in FIG. 6 (F)) which gives semantics to make the RAM 113
(Step 24, see reference numeral 532 in FIG. 6F), and store it in a predetermined area of the RAM 113 (step 25), and the CPU 111 stores the conversion language (reference numeral 532) stored in the predetermined area. ) Is obtained (step 25, reference numeral 63 in FIG. 6 (F)).
a). Also, the CPU 111 executes the corresponding source program of the target document A stored in the RAM 113 (FIG. 6F).
(See reference numeral 53) (step 2).
2; Y, see reference numeral 530 in FIG. 6 (F)), a natural language comment description area (see reference numeral 533 in FIG. 6 (F))
, The document vector V is obtained as it is (step 27, see reference numeral 63b in FIG. 6F).

【００２６】次に、全ての文書が終了しているか否かを
判断する（ステップ２８）。この場合、対象文書等Ａも
全部終了していないし、注目文書等Ｂが終了していない
ので（ステップ２８；Ｎ）、再び、ステップ１４の処理
に戻り、対象文書等Ａの残り文書の処理を実行し、ある
いは注目文書等Ｂの文書ベクトルの算出を行う。この動
作は、上述した動作とまったく同じであるので、以後の
説明を省略する。なお、既に説明したが、注目文書等Ｂ
について予め文書ベクトルが算出されている場合には、
この処理をする必要はない。Next, it is determined whether or not all documents have been completed (step 28). In this case, since all of the target documents etc. A are not completed, and the target document B etc. are not completed (step 28; N), the process returns to step 14 again, and the processing of the remaining documents of the target document etc. A is performed. Execute, or calculate the document vector of the document of interest B or the like. This operation is exactly the same as the above-described operation, and a description thereof will not be repeated. Note that as described above, the noted document B
If the document vector is calculated in advance for
There is no need to do this.

【００２７】図９は、文書ベクトル作成処理の動作を表
したフローチャートであり、図３のステップ１５、１
７、２１、２５、２７の処理内容の詳細を説明するため
のものである。ＣＰＵ１１１は、形態素解析を行うこと
で検索対象文書等Ａあるいは注目文書等Ｂの自立語を抽
出する（ステップ１５１）。また、ＣＰＵ１１１は、名
詞句、複合名詞句等を含めた候補語（句）を対象文書等
Ａから抽出しＲＡＭ１１３の所定の作業領域に格納する
（ステップ１５２）。そして、抽出した候補語（句）の
対象文書等Ａでの出現頻度、評価関数から、各候補語
（句）重要度ｆ（ｘ）を決定する（ステップ１５３）。
ここで、評価関数としては、例えば、所定の重要語が予
め指定されている場合にはその重要語に対する重み付
け、単語、名詞句、複合名詞句等の候補語（句）の種類
による重み付け等が使用される。FIG. 9 is a flowchart showing the operation of the document vector creation processing.
This is for explaining the details of the processing contents of 7, 21, 25 and 27. The CPU 111 extracts the independent words of the search target document A or the like and the noted document B or the like by performing a morphological analysis (step 151). Further, the CPU 111 extracts a candidate word (phrase) including a noun phrase, a compound noun phrase and the like from the target document etc. A and stores it in a predetermined work area of the RAM 113 (step 152). Then, the degree of importance f (x) of each candidate word (phrase) is determined from the appearance frequency of the extracted candidate word (phrase) in the target document A and the evaluation function (step 153).
Here, as the evaluation function, for example, when a predetermined important word is specified in advance, weighting for the important word, weighting according to the type of a candidate word (phrase) such as a word, a noun phrase, a compound noun phrase, and the like are used. used.

【００２８】さらにＣＰＵ１１１は、決定した重要度ｆ
（ｘ）の値から対象文書等Ａのキーワードａ，ｂ，…を
決定する（ステップ１５４）。各キーワードの重要度ｆ
（ｘ）を要素として、文書ベクトルＶ＝（ｆ（ａ），ｆ
（ｂ），…）をＲＡＭ１１３の文書ベクトル格納エリア
に格納して（ステップ１５５）、図３の処理ルーチンに
リターンする。文書ベクトルＶが求まると、ＣＰＵ１１
１は、対象文書等Ａと注目文書等Ｂとの間の類似度Ｓ
を、文書等Ａの文書ベクトルＡｖ、注目文書等Ｂの文書
ベクトルＢｖの間の角度（ｑ）に依存するコサインによ
り求める（ステップ２９）。すなわち、両文書ベクトル
Ａｖ、Ｂｖとすると、両文書ベクトルの絶対値をそれぞ
れ｜Ａｖ｜、｜Ｂｖ｜とした場合、両者の文書ベクトル
の類似度Ｓは、次の数式１により求まる。Further, the CPU 111 determines the determined importance f
The keywords a, b,... Of the target document A etc. are determined from the value of (x) (step 154). Importance f of each keyword
The document vector V = (f (a), f
(B),... Are stored in the document vector storage area of the RAM 113 (step 155), and the process returns to the processing routine of FIG. When the document vector V is obtained, the CPU 11
1 is a similarity S between the target document A etc. and the noted document B etc.
Is obtained by the cosine depending on the angle (q) between the document vector Av of the document A or the like and the document vector Bv of the document of interest B or the like (step 29). That is, assuming that the two document vectors are Av and Bv, and the absolute values of the two document vectors are | Av | and | Bv |, respectively, the similarity S between the two document vectors is obtained by the following equation (1).

【００２９】[0029]

【数１】類似度Ｓ＝ｃｏｓ（ｑ）＝（Ａｖ・Ｂｖ）／（｜Ａｖ｜×｜Ｂｖ｜）(1) Similarity S = cos (q) = (Av · Bv) / (| Av | × | Bv |)

【００３０】この類似度Ｓの値は、−１≦Ｓ≦１までの
値をとり、１に近いほど二つの文書ベクトルが互いに平
行に近く、対象文書等Ａと注目文書等Ｂとは似ていると
考えることができる。例えば、通常の文書、データある
いはプログラムの場合には、対象文書等Ａの文書ベクト
ルＡｖ（図７（Ｇ）の符号７０）と、注目文書等Ｂの文
書ベクトルＢｖ（図７（Ｇ）の符号８０）とから、数式
１を用いて類似度Ｓを算出する（図７（Ｇ）の符号９
０）。これは、図７（Ｇ）の符号９００に示すように、
二つの文書ベクトルＡｖ、Ｂｖの角度（ｑ）についてコ
サインをとることにより算出できる。The value of the similarity S takes a value up to −1 ≦ S ≦ 1, and as the value is closer to 1, the two document vectors are closer to each other in parallel, and the target document A and the target document B are similar. Can be considered. For example, in the case of a normal document, data or program, the document vector Av of the target document A etc. (reference numeral 70 in FIG. 7G) and the document vector Bv of the target document B etc. (reference numeral 70 in FIG. 7 (G)) 80), the similarity S is calculated using Expression 1 (reference numeral 9 in FIG. 7G).
0). This is represented by reference numeral 900 in FIG.
It can be calculated by taking the cosine of the angle (q) between the two document vectors Av and Bv.

【００３１】また、ソースプログラムの場合には、対象
文書等Ａの文書ベクトルＡｖ（図８（Ｈ）の符号７１）
と、注目文書等Ｂの文書ベクトルＢｖ（図８（Ｈ）の符
号８１）について、プログラム系と自然言語系とに分類
し、それぞれ二つの文書ベクトルＡva、Ｂva、文書ベク
トルＡvb、Ｂvbの各角度（ｑａ、ｑｂ）について数式１
を用いてコサインをとり、それぞれ類似度Ｓを算出する
（図８（Ｈ）の符号９１）。これは、図８（Ｈ）の符号
９１０ａ、９１０ｂに示すように、プログラム系および
自然言語系の各二つの文書ベクトルＡｖ、Ｂｖの角度
（ｑａ、ｑｂ）についてそれぞれコサインをとることに
より類似度Ｓａ、Ｓｂを算出できる。In the case of a source program, the document vector Av of the target document A etc. (reference numeral 71 in FIG. 8H)
And the document vector Bv (reference numeral 81 in FIG. 8H) of the document of interest B, etc., are classified into a program system and a natural language system, and each angle of two document vectors Ava, Bva and document vectors Avb, Bvb Equation (1) for (qa, qb)
Is used to calculate the cosine, and the similarity S is calculated for each (reference numeral 91 in FIG. 8H). As shown by reference numerals 910a and 910b in FIG. 8H, the similarity Sa is obtained by taking the cosine of each of the angles (qa, qb) of the two document vectors Av and Bv of the program system and the natural language system. , Sb can be calculated.

【００３２】そして、ＣＰＵ１１１は、類似度Ｓの値が
所定の閾値（例えば、０．８）より大きいと判断したと
きには（ステップ３０；Ｙ）、対象文書等Ａは注目文書
等Ｂと類似していると判定してその結果をＲＡＭ１１３
の所定の格納エリアに格納する（ステップ３１）。ま
た、ＣＰＵ１１１は、類似度Ｓの値が所定の閾値（例え
ば、０．８）より小さいと判断したときには（ステップ
３０；Ｎ）、対象文書等Ａは注目文書等Ｂとは非類似し
ていると判定してその結果をＲＡＭ１１３の所定の格納
エリアに格納する（ステップ３２）。When the CPU 111 determines that the value of the similarity S is larger than a predetermined threshold (for example, 0.8) (step 30; Y), the target document A etc. are similar to the target document B etc. And the result is stored in the RAM 113.
(Step 31). When the CPU 111 determines that the value of the similarity S is smaller than a predetermined threshold (for example, 0.8) (Step 30; N), the target document A is dissimilar to the target document B. And stores the result in a predetermined storage area of the RAM 113 (step 32).

【００３３】以上の処理が終了すると、ＣＰＵ１１１
は、ユーザの指示によりＲＡＭ１１３の所定の格納エリ
アに格納しておいた各データの保存処理を行う。すなわ
ち、文書ベクトル作成処理（図３のステップ１３、図
７）で求めた文書ベクトルＶをＲＡＭ１１３から読み出
し、文書データベース１６４に格納した対象文書等Ａ、
注目文書等Ｂとの関連性をつけて記憶装置１６の文書ベ
クトルデータベース１６６に格納する。また、判断結果
についても、対象文書等Ａと注目文書等Ｂとに関連付け
て記憶装置１６の文書ベクトルデータベース１６６等に
格納しておく。When the above processing is completed, the CPU 111
Performs storage processing of each data stored in a predetermined storage area of the RAM 113 in accordance with a user's instruction. That is, the document vector V obtained in the document vector creation process (step 13 in FIG. 3, FIG. 7) is read from the RAM 113, and the target document A or the like stored in the document database 164,
The document is stored in the document vector database 166 of the storage device 16 in association with the document of interest B or the like. Also, the determination result is stored in the document vector database 166 or the like of the storage device 16 in association with the target document A or the like and the target document B or the like.

【００３４】以上説明したように本実施の形態によれ
ば、表計算ソフトで作成したファイルについても自然言
語処理の検索対象とし、数値データを単位やグラフの軸
名等の情報（スキーマ）込みでインバーテッドファイル
化することにより、数値データの検索精度を高めること
ができる。また、単位などの情報を加味して( 正規化の
一種）インバーテッドファイルに入れることにより、表
計算ソフトのファイルでも類似検索の対象にすることが
できる。また、軸の情報を利用することで、グラフなど
も類似検索の対象することができる。このように、本実
施形態によれば、対象文書とその他の文書との類似性判
断をする場合に無意味な判断をすることなく、有効なデ
ータとの間で判断することができ、類似性判断の精度を
高めることができる。特に数値データに関する検索精度
が高まり、既存の表計算データファイルの再利用性が飛
躍的に高まる。As described above, according to this embodiment, a file created by spreadsheet software is also searched for in natural language processing, and numerical data is included in information (schema) such as units and axis names of graphs. By making an inverted file, it is possible to improve the accuracy of numerical data search. In addition, by adding information such as units to the inverted file (a type of normalization), a file of spreadsheet software can be subjected to similarity search. In addition, by using the axis information, a graph or the like can be subjected to the similarity search. As described above, according to the present embodiment, it is possible to determine the similarity between the target document and other documents without making a meaningless determination, and to determine the validity data. The accuracy of the judgment can be improved. In particular, the retrieval accuracy for numerical data is improved, and the reusability of existing spreadsheet data files is dramatically improved.

【００３５】以上、本実施形態の構成およびその動作に
ついて説明したが、本発明では、これらの各形態に限定
されるものとはなく、請求項に記載された発明の範囲内
で種々の変形をすることが可能である。Although the configuration and operation of the present embodiment have been described above, the present invention is not limited to these embodiments, and various modifications can be made within the scope of the invention described in the claims. It is possible to

【発明の効果】本発明によれば、処理対象となる対象文
書から、テキストのデータと非テキストデータとを区別
し、前記区別されたテキストのデータ及び非テキストデ
ータの少なくとも一方を使用して、他の文書との類似性
を判断するようにしたので、取得した対象文書と他の文
書との類似性判断を高い精度で行うことができる。According to the present invention, text data and non-text data are distinguished from a target document to be processed, and at least one of the distinguished text data and non-text data is used. Since the similarity with another document is determined, the similarity between the acquired target document and another document can be determined with high accuracy.

[Brief description of the drawings]

【図１】本発明の１実施形態における文書処理装置の構
成を表したブロック図である。FIG. 1 is a block diagram illustrating a configuration of a document processing apparatus according to an embodiment of the present invention.

【図２】同上、実施形態における文書ベクトルデータベ
ースの内容を概念的に表した説明図である。FIG. 2 is an explanatory diagram conceptually showing the contents of a document vector database in the embodiment.

【図３】同上、実施形態における検索処理のメイン動作
を表したフローチャートである。FIG. 3 is a flowchart illustrating a main operation of a search process according to the embodiment.

【図４】同上、実施形態における図３に示した検索処理
の各工程に対応する処理を概念的に表した説明図の一部
である。FIG. 4 is a part of an explanatory diagram conceptually showing processing corresponding to each step of the search processing shown in FIG. 3 in the embodiment.

【図５】同上、実施形態における図３に示した検索処理
の各工程に対応する処理を概念的に表した説明図の一部
である。FIG. 5 is a part of an explanatory diagram conceptually showing processing corresponding to each step of the search processing shown in FIG. 3 in the embodiment.

【図６】同上、実施形態における図３に示した検索処理
の各工程に対応する処理を概念的に表した説明図の一部
である。FIG. 6 is a part of an explanatory diagram conceptually showing processing corresponding to each step of the search processing shown in FIG. 3 in the embodiment.

【図７】同上、実施形態における図３に示した検索処理
の各工程に対応する処理を概念的に表した説明図の一部
である。FIG. 7 is a part of an explanatory diagram conceptually showing processing corresponding to each step of the search processing shown in FIG. 3 in the embodiment.

【図８】同上、実施形態における図３に示した検索処理
の各工程に対応する処理を概念的に表した説明図の一部
である。FIG. 8 is a part of an explanatory diagram conceptually showing processing corresponding to each step of the search processing shown in FIG. 3 in the embodiment.

【図９】同上、実施形態における文書ベクトル作成処理
の動作を表したフローチャートである。FIG. 9 is a flowchart illustrating an operation of a document vector creation process according to the embodiment.

[Explanation of symbols]

１１制御部１１１ＣＰＵ１１２ＲＯＭ１１３ＲＡＭ１２キーボード１３マウス１４表示装置１５印刷装置１６１仮名漢字変換辞書１６２プログラム格納部１６３データ格納部１６４文書データベース１６５文書ベクトルデータベース１７記憶媒体駆動装置１８通信制御装置１９入出力Ｉ／Ｆ２０文字認識装置 Reference Signs List 11 control unit 111 CPU 112 ROM 113 RAM 12 keyboard 13 mouse 14 display device 15 printing device 161 kana-kanji conversion dictionary 162 program storage unit 163 data storage unit 164 document database 165 document vector database 17 storage medium drive device 18 communication control device 19 input Output I / F 20 Character recognition device

Claims

[Claims]

A target document acquisition unit for acquiring a target document to be processed; and a target document acquired by the target document acquisition unit.
Judgment means for judging whether or not to have non-text data other than text data, When it is judged by the judgment means to have a non-text document, text data and non-text data from the target document And a similarity determining unit that determines the similarity to another document by using at least one of the text data and the non-text data distinguished by the distinguishing unit. Characteristic document processing device.

2. The document processing apparatus according to claim 1, wherein the non-text data is table data, graph schema information, a programming language, a source program, or the like.

3. When the non-text data distinguished by the distinguishing means is a source program, the non-text data is provided with a converting means for converting the source program into an intermediate language, and the similarity determining means is converted by the converting means. 2. The document processing apparatus according to 1, wherein the similarity between the intermediate language programs is determined.

4. The apparatus according to claim 1, wherein the similarity determination unit includes a calculation unit that calculates a similarity by a vector space method, and determines the similarity from the similarity calculated by the calculation unit. The document processing apparatus according to claim 2.

5. A target document acquisition function for acquiring a target document to be processed, and a target document acquired by the target document acquisition function,
A determination function of determining whether or not non-text data other than text data is included, and when it is determined by the determination function that a non-text document is included, text data and non-text data are converted from the target document. A computer realizes a distinguishing function of distinguishing, and a similarity determining function of determining similarity with another document by using at least one of text data and non-text data distinguished by the distinguishing function. A storage medium storing the computer-readable document processing program of the present invention.

6. The storage medium according to claim 5, wherein the non-text data is table data, graph schema information, a programming language, a source program, or the like.

7. When the non-text data distinguished by the distinguishing function is a source program, the non-text data is provided with a converting function of converting the source program into an intermediate language. The similarity determining function is converted by the converting function. A storage medium storing the document processing program according to claim 5, wherein the similarity between the intermediate language programs is determined.

8. The similarity determination function has a calculation function of calculating a similarity by a vector space method, and determines similarity from the similarity calculated by the calculation function. A storage medium storing the document processing program according to claim 6.

9. A method according to claim 1, wherein the text data and the non-text data are distinguished from the target document to be processed, and at least one of the distinguished text data and the non-text data is used to determine a similarity to another document. A document processing method characterized by determining gender. [0001]