JPH07117961B2

JPH07117961B2 - Document data registration method

Info

Publication number: JPH07117961B2
Application number: JP2003582A
Authority: JP
Inventors: 好博嶋; 達也村上; 昌史古賀; 浩道藤澤
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1990-01-12
Filing date: 1990-01-12
Publication date: 1995-12-18
Anticipated expiration: 2010-12-18
Also published as: JPH03209564A

Description

Detailed Description of the Invention [Industrial applications]

本発明は、文献データベースにおいて紙に印刷或いは、
記載された文献から書誌事項を自動的に登録する文献デ
ータ登録方法に関する。The present invention prints on paper in a literature database or
The present invention relates to a document data registration method for automatically registering bibliographic items from written documents.

[Prior art]

従来、紙に印刷された文献から書誌事項を取り出す方法
については、第８回パターン認識国際会議（1986年）第
745頁から第748頁（Proc.8th Int.Conf.Pattern Recogn
ition,pp.745−748（1986））において論じられてい
る。ここでは、文献から著者名、所属機関等の文字列を
画像データから取り出すことが述べられている。For the conventional method of extracting bibliographic items from documents printed on paper, see the 8th International Conference on Pattern Recognition (1986).
745 to 748 (Proc.8th Int.Conf.Pattern Recogn
ition, pp.745-748 (1986)). Here, it is described that a character string such as an author name and an affiliated institution is extracted from image data from a document.

[Problems to be Solved by the Invention]

上記従来技術で著者名、所属機関を取り出して、検索の
ための属性情報として文献データベースに登録する場
合、対象とする文献により、著者名や所属機関の表記が
まちまちであり、異なっている。このため、従来技術で
取り出した著者名、所属機関をそのまま、属性情報とし
て登録すると、該当する文献を文献データベースから検
索することが出来ず、検索の漏れが発生するという問題
があった。本発明は、紙に印刷された文献に記載されている著者名
や所属機関の表記方法が文献によりまちまちであって
も、該当する文献を漏れなく検索することができる属性
情報を登録することを目的としており、さらに、紙に印
刷された文献の参考文献欄を解析し文献と他の文献とを
自動的に関係付ける文献データベースを提供することを
目的とする。When the author name and the affiliated organization are taken out and registered in the reference database as attribute information for retrieval in the above-mentioned conventional technique, the notation of the author name and the affiliated organization is different and different depending on the target document. For this reason, if the author name and the institution belonging to the related art are registered as they are as the attribute information, there is a problem in that the relevant document cannot be searched from the document database, resulting in omission of the search. The present invention, even if the notation method of the author name or the institution described in the document printed on the paper varies depending on the document, it is possible to register the attribute information that enables the corresponding document to be searched without omission. It is also an object of the present invention to further provide a document database that analyzes a reference column of a document printed on paper and automatically associates the document with other documents.

[Means for Solving the Problems]

上記目的を達成するために、抽出した書誌事項を変形し
て異形語を生成し、異形語を利用して所望の文献を検索
するようにしたものである。また、文献と他の文献との
関連付けを行うために、文献に記載されている参考文献
を解析し書誌事項を取り出すことにより文献間の関連性
を登録するようにしたものである。さらに、生成した異
形語から検索条件を設定することにより、該当する文献
を検索するようにしたものである。In order to achieve the above object, the extracted bibliographic items are transformed to generate variants, and the variants are used to search for desired documents. In addition, in order to associate a document with another document, a reference document described in the document is analyzed and bibliographic items are extracted to register the relationship between the documents. Further, the relevant documents are searched by setting the search condition from the generated variant.

[Action]

文献データベースは、抽出した書誌事項から異形語を生
成する。それによって、注目する文献と関係する文献を
関連性とともに登録するようになるので、ある文献と関
係する他の文献を検索することができる。The bibliographic database generates variant words from the extracted bibliographic items. As a result, since the document related to the document of interest is registered together with the relevance, another document related to a certain document can be searched.

【Example】

以下、本発明の一実施例を第１図により説明する。画像
入力部100は、イメージスキャナなど光学的入力装置で
構成されており、紙面に印刷された文献の表面画像は、
一頁単位で画像データとして入力される。この入力され
た画像データは、書誌事項抽出部101に送られ、ここ
で、タイトル、著者名、所属機関等や参考文献が分離さ
れ、検索に利用する書誌事項として抽出される。さら
に、異形語生成部102では、抽出した書誌事項に対して
表記の異なる異形語を生成する。文献ファイル103に
は、上記の画像データ及び抽出した書誌事項が格納され
るだけでなく、書誌事項に対して生成した異形語も検索
に利用するために格納される。検索用端末104は、文献
ファイル103に格納された多数の文献に対して、書誌事
項及びその異形語を用いた検索を行う指示を与えるとと
もに、検索結果を画面に表示するものである。本実施例
によれば、紙に印刷された文献から書誌事項を抽出する
とともに、その書誌事項と表記の異なる異形語を生成し
ており、多数の文献から所望の文献を書誌事項を用いて
検索する場合、検索の漏れが無くなるという効果があ
る。第２図は文献検索のための検索コマンドの作成手順を説
明する図である。先ず、ステップ200では、文献データ
ベースの利用者が、検索のための単語を入力する。次い
で、ステップ201で、入力した検索語に対して、異形語
を生成する。ここでは、利用者が入力した検索語に対し
て異形語を生成しているが、文献データベースの内部に
異形語を予め生成しておく方法もある。さらに、ステッ
プ202では、生成した異形語を画面に表示する。これに
より、生成した異形語に対して利用者が修正、追加が可
能である。ステップ203では、異形語を用いて文献デー
タベース検索のためのコマンドファイルを生成する。こ
のコマンドファイルを用いて、文献検索を実行すること
になる。第３図は所属機関名の異形語の生成を説明する図であ
る。所属機関名300は、例えば、「Smith Kline and Fre
nch」となっている場合、所属機関名の異形語生成部301
で、所属機関名300に対して異形語を生成する。生成し
た異形語を302に示す。ここでは、もとの所属機関名300
に対して、異形語303「Smith,Kline French」,304「Smi
th,Kline ＆ French」,305「Smith,Kline,and Frenc
h」,306「Smith,Kline,and French」,307「Smith Klin
e,French」,308「SK and F」,309「SK ＆ F」,310「SK
F」,311「Smith Kline」,312「Smith Kline and Frenc
h」,313「Smith Kline French」が、それぞれ生成され
る。本実施例によれば同一の所属機関に対して、その所
属機関名の表記がまちまちで異なっている場合であって
も、所属機関名の異形語を生成しているため、表記方法
の違いによる検索の漏れを防ぐことができる効果があ
る。第４図は著者名の異形語の生成を説明する図である。著
者名400は、例えば、「John Miles Smith」である場
合、著者名の異形語生成部401では、著者名400に対して
異形語を生成する。生成した異形語を402に示す。ここ
では、もとの著者名400に対して、異形語403「Miles Sm
ith,John」,404「Smith,John」,405「Smith,J.」,406
「Smith,J.M.」,407「M.Smith,J.」,408「J.M.Smith」,
409「J.Smith」,410「John Smith」,411「John M.Smit
h」が、それぞれ生成される。本実施例によれば同一の
著者に対して、その著者名の表記がまちまちで異なって
いる場合であっても、著者名の異形語を生成しているた
め、表記方法の違いによる検索の漏れを防ぐことができ
る効果がある。第５図は、文献相互の関連性を登録する手順の説明図で
ある。紙面に印刷された文献に対して、ステップ500で
表面画像を入力する。次いで、ステップ501で文字を抽
出すると共に文字認識を行う。ステップ502で、文字の
並びを有する文字列の配置を解析し、ステップ503にお
いて、文字列の中から書誌事項を抽出する。さらに、ス
テップ504で抽出した書誌事項から異形語を生成し、ス
テップ505で、当該異形語を用いて関連する文献の探索
条件を生成する。ステップ506で、多数の文献の中から
生成した探索条件に満足する文献を探索する。そして、
ステップ507で、先のステップで求めた文献を関連する
文献として登録する。本実施例によれば、紙面に印刷さ
れた文献の書誌事項を読取り、文献と文献との間に関連
性を自動的に登録することができるため、文献データベ
ースにおける登録作業の自動化、省力化ができるという
効果がある。第６図は、文献データベースの検索手順の説明図であ
る。ブロック600で書誌事項を入力し、ブロック601で書
誌事項から異形語を作成し検索コマンドを生成する。ブ
ロック602で生成した検索コマンドから検索条件ファイ
ルを作成し、ブロック604で検索を実行する。この時、
検索実行指示をブロック603で与え、多数の文献を格納
したファイル605から該当する文献を検索する。検索結
果は、検索結果ファイル606に格納し、ブロック607で検
索結果の表示処理を行う。この時、ブロック608で検索
結果の表示形式を指示し、ディスプレイ609に結果が表
示される。第７図は参考文献から書誌事項を抽出する手順を示す図
である。ステップ700で、参考文献欄を入力する。ここ
では、既に、紙面に印刷された文献から参考文献欄を抽
出し、文字認識をしており、参考文献欄が文字コードに
なっている。ステップ701では、先ず、参考文献欄から
空白を削除する。次いで、ステップ702で、文献番号を
取り出す。その後、ステップ703で著者名を抽出する。
ステップ704でタイトルを抽出する。次いで、ステップ7
05で雑誌名を抽出し、ステップ706で巻号を抽出し、ス
テップ707でページ番号を抽出する。さらに、ステップ7
08で発行年を抽出し、ステップ709で発行月を抽出す
る。ステップ710でこれら参考文献欄から抽出した書誌
事項をそれぞれ登録する。本実施例により、参考文献欄
から自動的に書誌事項を抽出することができるため、参
考文献欄で参照されている他の文献と関連性を付与する
作業を自動的に行うという効果がある。第８図は、ステップ701で示した空白の削除処理の手順
を示す図である。ステップ800で文字が尽きるまで、ス
テップ801以下の処理を行う。ステップ801では、１桁ご
とに文字を入力し、ステップ802で、その入力した文字
が空白文字かどうかの判定を行う。もし、その文字が空
白文字であれば、ステップ803を行う。ここで、もし空
白文字でなければ、そのまま、ステップ801に戻り、次
の桁の文字を入力する。ステップ803では、空白文字の
前後の桁の文字が次のいづれかの条件を満たすかどうか
の判定を行う。その条件としては、空白文字の後が
「＆」であること、又は、空白文字の後が「and」であ
ること、又は、空白文字の前が「．」であること、又
は、空白文字の前が「，」であること、又は、空白文字
の前が「”」であること、又は、空白文字の後が「（」
であること、である。この条件に満足しない場合、ステ
ップ804で空白文字を削除し、ステップ801に戻る。一
方、この条件に満足する場合は、そのまま、ステップ80
1に戻る。第９図は、ステップ702で示した文献番号の抽出の手順
を示す図である。ステップ900で第１桁目の文字を入力
する。次いで、ステップ901で、第１桁目の文字が、
「［」であるか、又は、「（」であるか、又は、数字で
あるかの条件を満たすかどうか判定を行う。もし、上記
条件を満たせば、ステップ902に移行する。一方、上記
条件を満たさなければ、ステップ906で文献番号の抽出
は不成功として終了する。ステップ902では、第ｎ桁ま
で、順次、ステップ903以下の処理を行う。ここで、ｎ
は、例えば、15としている。ステップ903で、先ず該当
する桁の文字を入力する。次いで、ステップ904で、該
当する桁の文字が、「］」であるか、又は、「）」であ
るか、又は、「．」であるかの条件を満たすかどうか判
定を行う。もし、上記条件を満たせば、ステップ905
で、該当桁を文献番号の終了桁の位置として登録し、ス
テップ907で、終了する。一方、ステップ904で、条件を
満足しない場合、ステップ903に移行する。ステップ902
で第ｎ桁まで処理を繰返し、条件を満たさない場合は、
ステップ906で文献番号の抽出は不成功であることを登
録する。第10図はステップ703に示す著者名の抽出手順を説明す
る図である。ステップ1000で、ｉ桁が著者名の開始桁と
しての条件を満足するかどうかの条件を判定する。その
条件は、ｉ桁が文献番号の最終桁の右方最近の文字桁で
あることであり、この条件を満足すれば、ステップ1001
でｉ桁は著者名の開始桁として登録する。また、ステッ
プ1002はｊ桁が著者名の終了桁としての条件を満足する
かどうかの条件を判定する。その条件は、（ａ）ｊ桁が
文字でありかつｊ＋１桁が「”」であること、又は、
（ｂ）ｊ桁が文字でありかつｊ＋１桁以降が「.:」であ
ること、又は、（ｃ）ｊ桁が文字でありかつｊ＋１桁以
降が「．（」でありかつ最左方にあること、又は、
（ｄ）ｊ桁が文字でありかつｊ＋１桁が「．」でありか
つｊ＋２桁が「，」でありかつＪ＋３桁が英大文字であ
ること、かつ、（ｅ）右方最近に「and」がないこと、
かつ、（ｆ）右方のある範囲にある空白文字の右は「.,
空白」となっていないこと、である。この条件を満足す
れば、ステップ1003で、ｊ桁は著者名の終了桁として登
録する。第11図は、ステップ704に示したタイトルの抽出手順を
説明する図である。ステップ1100ではｉ桁がタイトルの
開始桁かどうかを判定するもので、その条件は、（ａ）
ｉ−１桁が「”」でありかつｉ桁が文字でありかつｉ桁
は最左方にあること、又は、（ｂ）ｉ桁は最左方にあり
かつｉ−１桁は「：」でありかつｉ桁の左方で最近のピ
リオド文字「．」までに空白がないこと、又は、（ｃ）
ｉ桁は著者名の終了桁より右方で最近であり、かつ、ｉ
−１桁が空白でｉ桁が大文字であるか、又は、ｉ−１桁
が「，」でｉ桁が大文字であるか、又は、ｉ−１桁
が「．」でｉ桁が大文字であること、である。上記の条
件に満足すれば、ステップ1101において、ｉ桁をタイト
ルの開始桁として登録する。ステップ1102ではｊ桁がタ
イトルの終了桁かどうかを判定するもので、その条件
は、（ａ）ｊ桁が文字でありかつｊ＋１桁が「．」であ
りかつｊ＋２桁が空白文字であること、又は、（ｂ）ｊ
桁が文字でありかつｊ＋１桁が「．」でありかつｊ＋２
桁が小文字であること、又は、（ｃ）ｊ桁が文字であり
かつｊ＋１桁が「，」でありかつｊ＋２桁が空白文字で
ありかつｊ＋３桁が大文字であること、又は、（ｄ）ｊ
桁が文字でありかつｊ＋１桁が「，」でありかつｊ＋２
桁が大文字であること、である。この条件を満足すれ
ば、ステップ1103でｊ桁をタイトルの終了桁として登録
する。第12図は、ステップ705で示した雑誌名の抽出手順を説
明する図である。ステップ1200で、ｉ桁が雑誌名の開始
桁であるかどうかの判定を行うもので、その条件は、
（ａ）ｉ−１桁が「，」であること、又は、（ｂ）ｉ−
１桁が「”」であること、又は、（ｃ）ｉ−１桁が空白
文字であること、又は、（ｄ）ｉ−１桁が「．」である
こと、である。この条件を満足すれば、ステップ1201で
ｉ桁を雑誌名の開始桁として登録する。また、ステップ
1202がｊ桁が雑誌名の終了桁であるかどうかの判定を行
うもので、その条件は、（ａ）ｊ＋１桁が「（」である
こと、又は、（ｂ）ｊ＋１桁が「，」でありかつｊ＋２
桁が英小文字であること、又は、（ｃ）ｊ＋１桁
が「，」でありかつｊ＋２桁が数字であること、又は、
（ｄ）ｊ＋１桁が「，」でありかつｊ＋２桁が空白文字
でありかつｊ＋３桁が英小文字であること、又は、
（ｅ）ｊ＋１桁が「，」でありかつｊ＋２桁が空白文字
でありかつｊ＋３桁が数字であること、である。この条
件に満足すれば、ステップ1203でｊ桁を雑誌名の終了桁
として登録する。第13図は、ステップ706で示した巻号の抽出処理の手順
を説明する図である。今、ｉ桁及びｊ桁は雑誌名の終了
桁の次以降の桁であるとし、先ず、ステップ1300でｉ桁
が巻号の開始桁であるかどうかの判定を行うもので、そ
の条件は、（ａ）ｉ−１桁が非数字でありかつｉ桁及び
ｉ＋１桁が数字でありかつｉ＋２桁が非数字であるこ
と、又は、（ｂ）ｉ−１桁が非数字でありかつｉ桁が数
字でありかつｉ＋１桁が非数字であること、又は、
（ｃ）ｉ−１桁が「，」でありかつｉ桁が「ｖ」である
こと、又は、（ｄ）ｉ−１桁が「，」でありかつｉ桁並
びにｉ＋１桁が数字でありかつｉ＋２桁が非数字である
こと、又は、（ｅ）ｉ−２桁が「，」でありかつｉ−１
桁が空白文字でありかつｉ桁が「ｖ」であること、又
は、（ｆ）ｉ−２桁が「，」でありかつｉ−１桁が空白
文字でありかつｉ桁が数字であること、である。この条
件を満足すれば、ステップ1301でｉ桁を巻号の開始桁と
して登録する。次いで、ステップ1302でｊ桁が巻号の終
了桁であるかどうかの判定を行うこので、その条件は、
（ａ）ｊ桁が数字でありかつｊ＋１桁が「，」であるこ
と、又は、（ｂ）ｊ桁が数字でありかつｊ＋１桁が空白
文字であること、又は、（ｃ）ｊ桁が数字でありかつＪ
＋１桁が「（」であること、である。この条件を満足す
れば、ステップ1303でｊ桁が巻号の終了桁として登録す
る。第14図はステップ707で示したページ番号の抽出手順を
示す図である。ステップ1400はｉ桁がページ番号の開始
桁であるかどうかの判定を行うものであり、その条件
は、（ａ）ｉ桁並びにｉ＋１桁が「ｐ」でありかつｉ＋
２桁が「．」であること、又は、（ｂ）ｉ桁以降ｉ＋２
桁までが数字でありかつｉ＋３桁が「−」であること、
又は、（ｃ）ｉ桁及びｉ＋１桁が数字でありかつｉ＋２
桁が「−」であること、又は、（ｄ）ｉ桁が数字であり
かつｉ＋１桁が「−」であること、である。この条件を
満足すれば、ステップ1401でｉ桁をページ番号の開始桁
として登録する。次いで、ステップ1402はｊ桁がページ
番号の終了桁であるかどうかの判定を行うものであり、
その条件は（ａ）ｊ桁が空白文字であること、又は、
（ｂ）ｊ桁が「，」であること、又は、（ｃ）ｊ桁
が「．」であること、又は、（ｄ）ｊ桁が「（」である
こと、又は、（ｅ）ｊ−３桁が「−」でありかつｊ−２
桁以降ｊ桁までが数字であること、又は、（ｆ）ｊ−２
桁が「−」でありかつｊ−１桁及びｊ桁が数字であるこ
と、又は、（ｇ）ｊ−１桁が「−」でありかつｊ桁が数
字であること、である。この条件を満足すれば、ステッ
プ1403でｊ桁をページ番号の終了桁として登録する。第15図はステップ708で示した発行年を抽出する手順を
説明する図である。ｉ桁が発行年の開始桁であり、ｊ桁
が発行年の終了桁である条件は、次のようである。先
ず、（ａ）ｉ桁が「１」でありかつｉ＋１桁が「９」で
ありかつｉ＋２桁及びｊ（＝ｉ＋３）桁が数字であるこ
と、かつ、（ｂ）ｉ−１桁が空白文字か又は「，」であ
るか又は「（」であるか又は「．」であること、かつ、
（ｃ）ｊ＋１桁が空白文字か又は「，」であることか又
は「（」であることか又は「．」であること、又は、
（ｄ）ｉ−１桁が「’］でありかつｉ桁が４から９まで
の数字でありかつｊ（＝ｉ＋１）桁が数字であること、
である。この条件を満足すれば、ステップ1501でｉ桁を
発行年の開始桁として登録し、ｊ桁を発行年の終了桁と
して登録する。第16図はステップ709で示した発行月を抽出する手順を
説明する図である。先ず、ステップ1600でｊ桁が、発行
年の終了桁の左方で最も近い文字桁であるかどうかを判
定し、もしその条件を満たせば、ｊ桁を発行月の終了桁
として登録する。次いで、ステップ1602でｉ桁が発行月
の開始桁であるかどうかの判定を行う。その条件は、
（ａ）ｉ桁は発行月の左方で最も近い文字桁であるこ
と、かつ、（ｂ）ｉ−１桁は「，」でありかつｉ桁が文
字であること、又は、（ｃ）ｉ−１桁は「（」でありか
つｉ桁が文字であること、である。この条件を満足すれ
ば、ステップ1603でｉ桁を発行月の開始桁として登録す
る。第17図は参考文献から書誌事項を抽出した結果を示す図
である。参考文献1700,1702,1704から、それぞれ、書誌
事項を抽出した結果を、1701,1703,1705に示す。それぞ
れ、文献番号、著者名、タイトル等の書誌事項を抽出し
ている。第18図は文献データベースの画面処理手順を説明する図
である。ステップ1800で、検索入力窓枠を表示し、ステ
ップ1801で検索結果の窓枠を表示する。次いで、ステッ
プ1802で検索条件を入力し、ステップ1803でその検索条
件に従って検索を実行する。ステップ1804では検索結果
の表示のためのメニューを選択し、ステップ1805でその
検索結果を窓枠内に表示する。第19図は著者名を例に取った異形語を生成する流れ図で
ある。ステップ1900で、まず、著者名を入力し、ステッ
プ1901では、ファストネームを抽出する。次いで、ステ
ップ1902でミドルネームを抽出し、ステップ1903でラス
トネームを抽出する。さらに、これらのネームの並び換
えを行う。先ず、ステップ1904で、ファストネーム，ミ
ドルネーム，ラストネームの順に著者名を並び換える。
ステップ1905で省略名を生成する。ステップ1906で著者
名を、ラストネーム，ファストネーム，ミドルネームの
順に並び換える。ステップ1907で省略名を生成する。次
いで、ステップ1908でミドルネーム，ラストネーム，フ
ァストネームの順に並び換え、ステップ1909で省略名を
生成する。なお、異形語を生成する別の方法としては、異形語の辞
書を備え、辞書を探索することにより、見出し語とその
異形語を取り出す方法もある。An embodiment of the present invention will be described below with reference to FIG. The image input unit 100 is composed of an optical input device such as an image scanner, and the surface image of the document printed on the paper is
Image data is input page by page. The input image data is sent to the bibliographic item extraction unit 101, where the title, author name, institution, etc. and reference documents are separated and extracted as bibliographic items to be used for the search. Further, the variant word generation unit 102 generates variant words with different notations for the extracted bibliographic items. In the document file 103, not only the above-mentioned image data and the extracted bibliographic items are stored, but also the heteromorphic words generated for the bibliographic items are stored for use in the search. The search terminal 104 gives an instruction to search a large number of documents stored in the document file 103 using bibliographic items and variants thereof, and displays the search results on the screen. According to the present embodiment, bibliographic items are extracted from documents printed on paper, and a variant word having a different notation from the bibliographic items is generated, and a desired document is searched from a large number of documents using the bibliographic items. When doing so, there is an effect that the omission of the search is eliminated. FIG. 2 is a diagram for explaining a procedure for creating a search command for searching documents. First, in step 200, a user of the document database inputs a word for searching. Next, in step 201, an atypical word is generated for the input search word. Here, although the atypical word is generated for the search word input by the user, there is also a method of generating the atypical word in the literature database in advance. Further, in step 202, the generated variant word is displayed on the screen. This allows the user to correct or add to the generated variant. In step 203, the variant is used to generate a command file for searching the document database. Literature search will be executed using this command file. FIG. 3 is a diagram for explaining the generation of variants of the institution name. The institution name 300 is, for example, "Smith Kline and Fre
If it is "nch", a variant word generation unit 301 of the institution name
Then, an atypical word is generated for the institution name 300. The generated variants are shown at 302. Here, the original institution name is 300
Against the variant 303 “Smith, Kline French”, 304 “Smi
th, Kline & French '', 305 `` Smith, Kline, and Frenc
h '', 306 `` Smith, Kline, and French '', 307 `` Smith Klin
e, French ", 308" SK and F ", 309" SK & F ", 310" SK
F '', 311 `` Smith Kline '', 312 `` Smith Kline and Frenc
h ”and 313“ Smith Kline French ”are generated respectively. According to the present embodiment, even if the names of the institution belonging to the same institution are different, the variants of the institution name are generated. This has the effect of preventing omission of search. FIG. 4 is a diagram for explaining the generation of variants of the author name. When the author name 400 is, for example, “John Miles Smith”, the author name variant word generation unit 401 generates variant words for the author name 400. The generated variant is shown at 402. Here, for the original author name 400, the variant 403 “Miles Sm
ith, John ", 404" Smith, John ", 405" Smith, J. ", 406
"Smith, JM", 407 "M. Smith, J.", 408 "JMSmith",
409 "J. Smith", 410 "John Smith", 411 "John M. Smith
h ”are generated respectively. According to the present embodiment, even if the notation of the author name is different for the same author, a variant of the author's name is generated, so the omission of search due to the difference in notation method. There is an effect that can be prevented. FIG. 5 is an explanatory diagram of a procedure for registering the mutual relevance of documents. In step 500, a surface image is input for the document printed on the paper. Next, in step 501, the character is extracted and the character is recognized. In step 502, the arrangement of a character string having a character sequence is analyzed, and in step 503, bibliographic items are extracted from the character string. Further, a variant word is generated from the bibliographic items extracted in step 504, and in step 505, a relevant document search condition is generated using the variant word. In step 506, a document satisfying the search condition generated from a large number of documents is searched. And
In step 507, the document obtained in the previous step is registered as a related document. According to the present embodiment, it is possible to read the bibliographic items of the document printed on the paper and automatically register the relationship between the document and the document, so that the registration work in the document database can be automated and labor-saving. The effect is that you can do it. FIG. 6 is an explanatory diagram of a search procedure of the document database. In block 600, the bibliographic items are input, and in block 601, a variant word is created from the bibliographic items and a search command is generated. A search condition file is created from the search command generated in block 602, and a search is executed in block 604. This time,
A search execution instruction is given in block 603, and the relevant document is searched from the file 605 in which many documents are stored. The search result is stored in the search result file 606, and block 607 displays the search result. At this time, in block 608, the display format of the search result is instructed, and the result is displayed on the display 609. FIG. 7 is a diagram showing a procedure for extracting bibliographic items from a reference document. In step 700, enter the reference section. Here, the reference column is already extracted from the document printed on the paper for character recognition, and the reference column is a character code. In step 701, first, the blank is deleted from the reference literature section. Next, in step 702, the document number is retrieved. Then, in step 703, the author name is extracted.
In step 704, the title is extracted. Then step 7
The magazine name is extracted at 05, the volume number is extracted at step 706, and the page number is extracted at step 707. In addition, step 7
The issue year is extracted in 08, and the issue month is extracted in step 709. At step 710, the bibliographic items extracted from these reference columns are registered. According to the present embodiment, since the bibliographic items can be automatically extracted from the reference literature section, there is an effect that the work of giving the relation to other documents referred to in the reference literature section is automatically performed. FIG. 8 is a diagram showing a procedure of the blank deletion processing shown in step 701. The processes in and after step 801 are performed until the characters are exhausted in step 800. In step 801, a character is input for each digit, and in step 802, it is determined whether the input character is a blank character. If the character is a blank character, step 803 is performed. Here, if it is not a blank character, the process directly returns to step 801, and the character of the next digit is input. In step 803, it is determined whether the characters before and after the blank character satisfy any of the following conditions. The condition is that the space character is followed by "&", the space character is followed by "and", the space character is preceded by ".", Or the space character It is "," before, or """before the space character, or"("after the space character
That is, If this condition is not satisfied, the blank character is deleted in step 804 and the process returns to step 801. On the other hand, if this condition is satisfied, step 80
Return to 1. FIG. 9 is a diagram showing a procedure of extracting the document number shown in step 702. In step 900, the first digit character is input. Then, in step 901, the first digit is
It is determined whether or not the condition of “[”, “(”, or a number] is satisfied. If the above condition is satisfied, the process proceeds to step 902. If the condition is not satisfied, the extraction of the document number ends as unsuccessful in step 906. In step 902, the processes from step 903 onward are sequentially performed up to the n-th digit.
Is 15, for example. In step 903, the character of the corresponding digit is first input. Next, in step 904, it is determined whether or not the condition that the character of the corresponding digit is “]”, “)”, or “.” Is satisfied. If the above conditions are met, step 905
Then, the relevant digit is registered as the position of the end digit of the document number, and the process ends in step 907. On the other hand, if the condition is not satisfied in step 904, the process proceeds to step 903. Step 902
Repeat the process up to the nth digit with and if the condition is not satisfied,
In step 906, it is registered that the extraction of the document number is unsuccessful. FIG. 10 is a diagram for explaining the author name extraction procedure shown in step 703. In step 1000, it is determined whether or not the i digit satisfies the condition as the starting digit of the author name. The condition is that the i digit is the character digit that is the most recent character digit to the right of the last digit of the reference number. If this condition is satisfied, step 1001
The i digit is registered as the starting digit of the author name. In step 1002, it is determined whether or not the j digit satisfies the condition as the end digit of the author name. The condition is that (a) the j-th digit is a character and the j + 1-th digit is "", or
(B) j digit is a character and j + 1 and subsequent digits are ".:", Or (c) j digit is a character and j + 1 and subsequent digits are ". (" And is on the leftmost side. Or
(D) The j digit is a character, the j + 1 digit is “.”, The j + 2 digit is “,”, and the J + 3 digit is an uppercase letter, and (e) the right most recently “and” is Nothing,
Also, (f) the space character to the right of a certain range to the right is ".,
It is not "blank". If this condition is satisfied, the j digit is registered as the end digit of the author name in step 1003. FIG. 11 is a diagram for explaining the title extraction procedure shown in step 704. In step 1100, it is determined whether or not the i digit is the start digit of the title. The condition is (a)
i-1 digit is "" and i digit is character and i digit is leftmost, or (b) i digit is leftmost and i-1 digit is ":" And there is no space to the left of the i digit up to the most recent period character ".", Or (c)
The i digit is more recent to the right of the end digit of the author's name, and i
-1 digit is blank and i is uppercase, or i-1 is "," and i is uppercase, or i-1 is "." And i is uppercase. That is. If the above conditions are satisfied, the i digit is registered as the start digit of the title in step 1101. In step 1102, it is determined whether the j-th digit is the end digit of the title. The condition is (a) that the j-th digit is a character, the j + 1-th digit is ".", And the j + 2-th digit is a blank character. Or (b) j
Digits are letters and j + 1 digits are "." And j + 2
The digits are lowercase, or (c) the j digit is a character and the j + 1 digit is "," and the j + 2 digit is a blank character and the j + 3 digit is an uppercase letter, or (d) j
Digit is character and j + 1 digit is "," and j + 2
The digits are uppercase. If this condition is satisfied, the j digit is registered as the ending digit of the title in step 1103. FIG. 12 is a diagram for explaining the procedure for extracting the magazine name shown in step 705. In step 1200, it is determined whether the i digit is the start digit of the magazine name. The condition is
(A) i-1 digit is "," or (b) i-
One digit is "", or (c) i-1 digit is a blank character, or (d) i-1 digit is ".". If this condition is satisfied, the i digit is registered as the starting digit of the magazine name in step 1201. Also step
1202 determines whether or not the j digit is the end digit of the magazine name. The condition is (a) j + 1 digit is "(", or (b) j + 1 digit is ",". Yes and j + 2
The digit is a lowercase letter, or (c) j + 1 digit is "," and j + 2 digit is a number, or
(D) j + 1 digit is ",", j + 2 digit is a blank character, and j + 3 digit is a lower case letter, or
(E) j + 1 digit is “,”, j + 2 digit is a blank character, and j + 3 digit is a number. If this condition is satisfied, the j digit is registered as the end digit of the magazine name in step 1203. FIG. 13 is a diagram for explaining the procedure of the volume issue extracting process shown in step 706. Now, assuming that the i digit and the j digit are the digits after the end digit of the magazine name, first, in step 1300, it is determined whether or not the i digit is the start digit of the volume issue. (A) i-1 digit is non-numeric and i-digit and i + 1 digit are numeric and i + 2 digit is non-numeric, or (b) i-1 digit is non-numeric and i-digit is Is a number and the i + 1 digit is a non-number, or
(C) i-1 digit is "," and i digit is "v", or (d) i-1 digit is "," and i digit and i + 1 digit are numeric and i + 2 digit is non-numeric, or (e) i-2 digit is "," and i-1
The digit is a blank character and the i digit is "v", or (f) i-2 digit is "," and i-1 digit is a blank character and i digit is a number. ,. If this condition is satisfied, the i digit is registered as the starting digit of the winding number in step 1301. Then, in step 1302, it is determined whether the j-th digit is the end digit of the volume issue.
(A) j digit is a number and j + 1 digit is ","; or (b) j digit is a number and j + 1 digit is a blank character; or (c) j digit is a number. And J
The +1 digit is "(". If this condition is satisfied, the j digit is registered as the end digit of the volume number in step 1303. Fig. 14 shows the page number extraction procedure shown in step 707. In step 1400, it is determined whether or not the i digit is the start digit of the page number, and the condition is (a) i digit and i + 1 digit are “p” and i +
2 digits are “.”, Or (b) i digits and later i + 2
Number up to the digit and i + 3 digit is "-",
Or (c) i digit and i + 1 digit are numbers and i + 2
The digit is “−”, or (d) the i digit is a number and the i + 1 digit is “−”. If this condition is satisfied, the i digit is registered as the start digit of the page number in step 1401. Next, in step 1402, it is determined whether the j digit is the end digit of the page number.
The conditions are (a) j digits are blank characters, or
(B) j digit is ",", (c) j digit is ".", Or (d) j digit is "(", or (e) j- 3 digits are "-" and j-2
Digit from j digit to j digit, or (f) j-2
The digit is “−” and the j−1 digit and the j digit are numbers, or (g) the j−1 digit is “−” and the j digit is a digit. If this condition is satisfied, the j digit is registered as the end digit of the page number in step 1403. FIG. 15 is a diagram for explaining the procedure for extracting the issue year shown in step 708. The condition that the i-th digit is the start digit of the issuance year and the j-th digit is the end digit of the issuance year is as follows. First, (a) i digit is “1”, i + 1 digit is “9”, i + 2 digit and j (= i + 3) digit are numbers, and (b) i−1 digit is a blank character. Or "," or "(" or ".", And
(C) j + 1 digit is a blank character, ",", "(", or ".", Or
(D) i−1 digit is “′”, i digit is a number from 4 to 9 and j (= i + 1) digit is a number,
Is. If this condition is satisfied, in step 1501 the i digit is registered as the starting digit of the issuing year and the j digit is registered as the ending digit of the issuing year. FIG. 16 is a diagram for explaining the procedure for extracting the issue month shown in step 709. First, in step 1600, it is determined whether or not the j digit is the closest character digit to the left of the end digit of the issuing year, and if the condition is satisfied, the j digit is registered as the end digit of the issuing month. Next, in step 1602, it is determined whether the i digit is the start digit of the issuing month. The condition is
(A) The i digit is the closest character digit to the left of the issue month, and (b) the i-1 digit is "," and the i digit is a character, or (c) i The -1 digit is "(" and the i digit is a character. If this condition is satisfied, the i digit is registered as the start digit of the issuing month in step 1603. From FIG. It is a figure showing the result of extracting the bibliographic items.The results of extracting the bibliographic items from the reference documents 1700, 1702, 1704 are shown in 1701, 1703, 1705. Reference numbers, author names, titles, etc. The bibliographic items are extracted Fig. 18 is a diagram for explaining the screen processing procedure of the document database: the search input window frame is displayed in step 1800, and the search result window frame is displayed in step 1801. Enter the search conditions in step 1802, and execute the search according to the search conditions in step 1803. In step 1804, a menu for displaying the search result is selected, and in step 1805, the search result is displayed in a window frame.Figure 19 is a flow chart for generating a variant word taking the author name as an example. First, in step 1900, the author's name is input, and in step 1901, the fast name is extracted, then in step 1902, the middle name is extracted, and in step 1903, the last name is extracted. First, in step 1904, the author names are sorted in the order of fast name, middle name, and last name.
In step 1905, the short name is generated. In step 1906, the author names are sorted in the order of last name, fast name, middle name. In step 1907, the short name is generated. Next, in step 1908, the middle name, last name, and fast name are rearranged in this order, and in step 1909 an abbreviation name is generated. As another method for generating variant words, there is also a method in which a variant word dictionary is provided and a dictionary is searched to retrieve the entry word and its variant word.

【The invention's effect】

本発明によれば、文献によっては著者名や所属機関名の
表記がまちまちであり異なった表記になっている文献に
対して、紙面から書誌事項を抽出し書誌事項の異形語を
生成できるので、文献データベースの検索漏れを防ぐこ
とができるという効果がある。また、文献の参考文献から、タイトル、著者名、発行年
月等を自動的に抽出できるので、文献と文献の関係を自
動登録でき登録作業を省力化できる効果がある。According to the present invention, since the notation of the author name and the institution name is different depending on the document and the document has a different notation, it is possible to extract bibliographic items from the paper and generate atypical variants of the bibliographic items. This has the effect of preventing omissions in the search of the literature database. Further, since the title, author name, issue date, etc. can be automatically extracted from the reference of the document, there is an effect that the relation between the document can be automatically registered and the registration work can be saved.

[Brief description of drawings]

第１図は本発明の一実施例の構成図、第２図は文献検索
のための検索コマンドの作成手順の説明図、第３図は所
属機関名の異形語生成の説明図、第４図は著者名の異形
語生成の説明図、第５図は文献相互の関連性登録の流れ
図、第６図は文献データベースの検索手順を示す流れ
図、第７図は参考文献から書誌事項を抽出する流れ図、
第８図は空白文字の削除処理の流れ図、第９図は文献番
号の抽出の流れ図、第10図は著者名の抽出の流れ図、第
11図はタイトルの抽出の流れ図、第12図は雑誌名の抽出
の流れ図、第13図は巻号の抽出の流れ図、第14図はペー
ジ番号の抽出の流れ図、第15図は発行年の抽出の流れ
図、第16図は発行月の抽出の流れ図、第17図は参考文献
の解析結果の説明図、第18図は文献データベースの画面
処理手順の説明図、第19図は異形語を生成する流れ図で
ある。符号の説明 100……画像入力部、101……書誌事項抽出部、103……
文献ファイル、201……異形語生成ステップ、501……文
字抽出と文字認識ステップ、700……参考文献欄の入力
ステップ、702……文献番号抽出ステップ、703……著者
名抽出ステップ、704……タイトル抽出ステップ、705…
…雑誌名抽出ステップ、706……巻号抽出スステップ、7
07……ページ番号抽出ステップ、708……発行年抽出ス
テップ、709……発行月抽出ステップ、1700……参考文
献データ、1701……書誌事項抽出データ。FIG. 1 is a configuration diagram of an embodiment of the present invention, FIG. 2 is an explanatory diagram of a procedure for creating a search command for document retrieval, FIG. 3 is an explanatory diagram of variant generation of an institution name, and FIG. Is an explanatory diagram of the variant word generation of the author name, FIG. 5 is a flow chart of registration of mutual relations between documents, FIG. 6 is a flow chart showing a search procedure of a document database, and FIG. 7 is a flow chart of extracting bibliographic items from a reference ,
FIG. 8 is a flow chart of blank character deletion processing, FIG. 9 is a flow chart of reference number extraction, FIG. 10 is a flow chart of author name extraction, and FIG.
Figure 11 is a flow chart for extracting titles, Figure 12 is a flow chart for extracting magazine names, Figure 13 is a flow chart for extracting volume numbers, Figure 14 is a flow chart for extracting page numbers, and Figure 15 is a publication year extraction. Fig. 16, Fig. 16 is a flow chart for extracting the issue month, Fig. 17 is an explanatory diagram of the analysis result of the reference document, Fig. 18 is an explanatory diagram of the screen processing procedure of the document database, and Fig. 19 is a variant generation. It is a flow chart. Explanation of code 100 …… Image input section, 101 …… Bibliographic item extraction section, 103 ……
Reference file, 201 ... variant generation step, 501 ... character extraction and character recognition step, 700 ... reference section input step, 702 ... reference number extraction step, 703 ... author name extraction step, 704 ... Title extraction step, 705 ...
… Magazine name extraction step, 706 …… Volume issue extraction step, 7
07 …… Page number extraction step, 708 …… Issue year extraction step, 709 …… Issue month extraction step, 1700 …… Reference data, 1701 …… Bibliographic item extraction data.

───────────────────────────────────────────────────── フロントページの続き (72)発明者藤澤浩道東京都国分寺市東恋ケ窪１丁目280番地株式会社日立製作所中央研究所内 (56)参考文献特開平１−191258（ＪＰ，Ａ) 特開昭62−11932（ＪＰ，Ａ) 特開昭62−216074（ＪＰ，Ａ) 特開昭63−153630（ＪＰ，Ａ) 情報処理学会第41回（平成２年後期）全国大会講演論文集Ｐ．４−168〜169 オペレーションズリサーチｖｏｌ. 31，Ｎｏ．３（1986）Ｐ．152〜157 伊藤哲郎著「情報検索」（昭61−８− 10）昭晃堂Ｐ．15〜54 ─────────────────────────────────────────────────── ─── Continuation of the front page (72) Inventor Hiromichi Fujisawa 1-280, Higashi-Kengokubo, Kokubunji, Tokyo (56) References, Hitachi, Ltd. Central Research Laboratory (56) Reference JP-A-1-191258 (JP, A) JP-A-62 -11932 (JP, A) JP 62-216074 (JP, A) JP 63-153630 (JP, A) Proc. Of the 41st National Conference of the Information Processing Society of Japan (second half of 1990) 4-168-169 Operations Research vol. 31, No. 3 (1986) P. 152-157 Tetsuro Ito "Information Retrieval" (Sho 61-8-10) Shokoido P. 15-54

Claims

[Claims]

1. Bibliographic information including the author name, title, journal name, and volume of the reference document by recognizing characters in the reference column described in the printed document and analyzing the character string in the reference column. A method for registering bibliographic data that extracts and registers a series of character strings in the bibliographical column according to the appearance of specific characters, symbols, and character types based on the notation rules in the bibliographical column. In advance, specify the separation rule for separation, enter the above-mentioned bibliography field, detect a blank character from the entered bibliographic field, and if there is no separator before or after the blank character, enter the blank character. , The first digit is the parenthesis symbol or number as the start digit, the parenthesis symbol or the delimiter is the end digit, and the reference number is extracted. The character closest to the right of the end digit of the extracted reference number is the start digit, Parentheses or delimiters Issue of extracting the author's name on the left side of the character as the end of digits, extracted the right side of the character of a by delimiters in the right side of the author's name,
Or, the title is extracted with the right uppercase letter of the author name as the first digit and the left side of the delimiter as the end digit, and the character to the right of the extracted title and to the right of the delimiter is the first digit, and the bracket symbol Or, the journal name is extracted with the left digit of the separator as the end digit, the initial letter or number indicating the volume number on the right of the extracted magazine name as the start digit, and the left digit of the bracket symbol or the separator as the end digit. As the starting digit and the left digit of the parenthesis or the separator as the ending digit. Extract the number, detect the number string that includes the preset number at the beginning to extract the issue year, and detect the character string of the month name that is on the left of the issue year to extract the issue month. Extracted document number, author name, title, journal name, Number, page number, year of publication and month of publication are registered, and in the bibliographic information, from the most recent character to the right of the end digit of the reference number to the character to the left of the parenthesis symbol or separator, A method for registering literature data, characterized in that a variant of at least one of rearranging the order of names and generating an abbreviation is applied to generate an atypical word of an author's name, and is registered by a plurality of notations.