JPH02297151A

JPH02297151A - Document editing device

Info

Publication number: JPH02297151A
Application number: JP1036283A
Authority: JP
Inventors: Yosuke Mori; 庸輔森; Mitsuo Takei; 三雄武井; Yukio Funyu; 舟生　幸雄; Shigeru Ogawa; 茂小川
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1989-02-17
Filing date: 1989-02-17
Publication date: 1990-12-07

Abstract

PURPOSE:To eliminate the designation of words and phrases being key words at every time and to reduce the burden of a user by providing a key word extraction part which automatically extracts the key words from the content of a document data buffer and a key word buffer holding the extracted key word. CONSTITUTION:The key word extraction part 11 automatically extracts the words and the phrases registered in dictionaries 5, 6 and 7 among document data by using given document data and the dictionaries 5, 6 and 7 used for KANA (Japanese syllabary)-KANJI (Chinese character) conversion. The dictionaries used at that time are set to be the dictionary 6 to which the user can register the words and the phrases, and the dictionary 7 to which technical terms are previously registered. Then, the efficiency of an extraction processing is improved, and the general words and the phrases which hardly come to the key words are prevented from being extracted. Furthermore, the extracted key word is held in the key word buffer 12. Thus, the key word included in given document data is automatically extracted without being designated as the key word at every input, and it can be used as the key word for generating the index of the document and for retrieving the document.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明はワードプロセッサなどの電子化された文書編集
装置に係り、特に作成された文書の中から自動的にキー
ワードとなり得る語句を抽出することができる文書編集
装置に関する。[Detailed Description of the Invention] [Industrial Application Field] The present invention relates to an electronic document editing device such as a word processor, and in particular, to an electronic document editing device such as a word processor. This document relates to a document editing device that can be edited.

[Conventional technology]

従来の文書編集装置は、特開昭６１−１３１１６３号公
報に記載のように、キーワードの目次・索引となる語句
を入力装置を用いて文書中から指定させることによって
抽出していた。Conventional document editing devices, as described in Japanese Patent Application Laid-open No. 131163/1983, extract words and phrases that serve as a table of contents and index of keywords by specifying them from a document using an input device.

[Problem to be solved by the invention]

上記従来技術は入力操作の簡便化と特にキーワード抽出
の自動化の点について配慮がなされておらず、目次・索
引に掲載すべき語句を文書入力時に指定することによっ
て装置に該キーワードであることを識別させていたので
、入力時にキーワードを指定する操作が必要であって文
書の入力操作が繁雑となり、またキーワードとなる同一
語句が複数の箇所に現われる文書を入力する場合にも。The above-mentioned conventional technology does not take into account the simplification of input operations and the automation of keyword extraction in particular, and by specifying the words and phrases to be included in the table of contents and index at the time of document input, the device identifies the keywords. As a result, it is necessary to specify keywords at the time of input, making the document input operation complicated, and also when inputting a document in which the same word or phrase as a keyword appears in multiple places.

該語句を入力するごとに指定する操作が必要となるので
、入力操作の負担が増大する問題があった。Since a designation operation is required each time the word is input, there is a problem in that the burden of the input operation increases.

本発明の目的は、文書の中からキーワードとなり得る語
句を自動的に抽出することによって、操作の負担を軽減
する文書編集装置を提供するにある。An object of the present invention is to provide a document editing device that reduces the burden of operation by automatically extracting words and phrases that can be used as keywords from a document.

[Means to solve the problem]

上記目的は、キーボードやマウスからのデータ入力部と
、入力されたデータを各種辞書を用いて漢字に変換する
仮名漢字変換部と、かな漢字まじりの日本語文章を編集
する文書編集部と、作成された文章を保持する文書デー
タバッファと、この文章をディスプレイに表示する表示
制御部とから成る文書編集装置において１文書データバ
ッファの内容から自動的にキーワードを抽出するキーワ
ード抽出部と、抽出されたキーワードを保持するキーワ
ードバッファとを備えた文書編集装置により達成され、
また上記キーワード抽出部がキーワードの抽出に仮名漢
字変換に用いる辞書を活用することにより、また特定の
ルールたとえば連続したカタカナ文字などに合致した文
字列をキーワードとして抽出することにより、抽出効率
を高めた文書編集装置により達成される。The above purpose is to create a data input section from a keyboard or mouse, a kana-kanji conversion section that converts the input data into kanji using various dictionaries, and a document editing section that edits Japanese sentences mixed with kana-kanji. A document editing device comprising a document data buffer that holds a text that has been read, and a display control unit that displays this text on a display, a keyword extraction unit that automatically extracts keywords from the contents of one document data buffer, and an extracted keyword. This is accomplished by a document editing device equipped with a keyword buffer that holds
In addition, the keyword extraction section improves extraction efficiency by utilizing the dictionary used for kana-kanji conversion to extract keywords, and by extracting character strings that match specific rules, such as consecutive katakana characters, as keywords. This is accomplished by a document editing device.

[Effect]

上記文書編集装置はキーワード抽出部が与えられた文書
データと仮名漢字変換に用いた辞書を使用して、文書デ
ータの中から辞書に登録されている語句を自動的に抽出
するが、このとき使用する辞書を使用者が語句を登録可
能な辞書（ユーザー辞ｔ）と、専門用語があらかじめＳ
ｔ、録されている辞書（専門用語辞書）とすることによ
り、抽出処理の効率を高めるとともにキーワードとなり
にくい一般的な語句が抽出されるのを避け、さらに抽出
されたキーワードがキーワードバッファに保持され、同
一語句が文書中にあった場合に辞書を参照することなく
該語句がキーワードであるのを認識するために用いられ
、また特定のルールたとえば連続したカタカナ文字など
に合致した文字列の場合には辞書を参照することなく自
動的にキーワードとして抽出するので、したがって与え
られた文書データに含まれるキーワードが入力時に毎回
キーワードとして指定されることなく自動的に抽出され
、また同一語句が複数同視われる場合でも操作が繁雑に
なることがなく、そして抽出されたキーワードが文書の
索引の作成や文書検索用キーワードとして使用可能とな
る。The above document editing device automatically extracts words registered in the dictionary from the document data using the document data given by the keyword extraction unit and the dictionary used for kana-kanji conversion. There is a dictionary in which users can register words and phrases (User Dictionary), and a dictionary in which technical terms can be registered in advance.
By using a recorded dictionary (technical term dictionary), we can improve the efficiency of the extraction process, avoid extracting common words that are difficult to use as keywords, and also keep the extracted keywords in the keyword buffer. , is used to recognize that the same word is a keyword without referring to a dictionary when the same word is found in a document, and is also used when a string of characters matches a specific rule, such as consecutive katakana characters. Because it automatically extracts keywords without referring to a dictionary, the keywords included in the given document data are automatically extracted without being specified as keywords each time they are input, and multiple identical words are treated as the same. The extracted keywords can be used to create document indexes and as keywords for document searches.

〔Example〕

以下に本発明の実施例を第１図から第３図により説明す
る。Embodiments of the present invention will be described below with reference to FIGS. 1 to 3.

第１図は本発明による文書編集装置の一実施例を示す機
能ブロック図である。第１図において、１はキーボード
、２は入力制御部、３は文書編集装置、４は仮名漢字変
換部、５は基本辞書、６はユーザー辞書、７は専門用語
辞書、８は表示制御部、９はディスプレイ、１０は文書
データバッファ、１１はキーワード抽出部、１２はキー
ワードバッファである。FIG. 1 is a functional block diagram showing an embodiment of a document editing device according to the present invention. In FIG. 1, 1 is a keyboard, 2 is an input control unit, 3 is a document editing device, 4 is a kana-kanji conversion unit, 5 is a basic dictionary, 6 is a user dictionary, 7 is a technical term dictionary, 8 is a display control unit, 9 is a display, 10 is a document data buffer, 11 is a keyword extractor, and 12 is a keyword buffer.

上記構成で１文書はキーボード１より入力され、入力制
御部２を介して文書編集部３にて処理される。この通常
入力されるデータは英字数字あるいは仮名文字コードで
あって、これらのコード列より仮名漢字まじりの日本語
文章を生成するため、現在市販のワードプロセッサや文
書編集装置では仮名漢字変換の手法を用いる。仮名漢字
変換部４と各種辞書の基本辞書５とユーザー辞書６と専
門用語辞書７とは上記仮名文字コード列より仮名漢字ま
じりの日本語文章を生成するためのものである。この生
成された日本語文章データは表示制御部８を経由してデ
ィスプレイに表示されるとともに、文書データバッファ
１０に格納されて保持される。With the above configuration, one document is input from the keyboard 1 and processed by the document editing section 3 via the input control section 2. This data that is normally input is alphanumeric or kana character codes, and in order to generate Japanese sentences mixed with kana and kanji from these code strings, currently commercially available word processors and document editing devices use a kana-kanji conversion method. . The kana-kanji conversion unit 4, the basic dictionary 5 of various dictionaries, the user dictionary 6, and the technical term dictionary 7 are used to generate Japanese sentences mixed with kana-kanji characters from the above-mentioned kana character code string. The generated Japanese text data is displayed on the display via the display control section 8, and is also stored and held in the document data buffer 10.

次にキーワード抽出部１１が文書データバッファ１０よ
り日本語文章データを読み出し、この文字列データから
ユーザー辞書６および専門用語辞書７を参照す、ること
により、辞書に登録されている語句と一致する語句を抽
出して、キーワードバッファ１２に登録する。また−度
キーワードとして登録されると、文書中の同一語句は辞
書を参照することなく、キーワードバッファ１２に存在
するか否かによって判定できる。一方の特定のルールた
とえば連続したカタカナ文字などは辞書や登録済みキー
ワードとして抽出できる。従ってキーワード抽出部１１
で実行する処理をまとめると次のようになる。Next, the keyword extraction unit 11 reads Japanese text data from the document data buffer 10 and refers to the user dictionary 6 and technical term dictionary 7 from this character string data, thereby matching words registered in the dictionary. Words are extracted and registered in the keyword buffer 12. Furthermore, when registered as a keyword, the same word or phrase in a document can be determined by whether it exists in the keyword buffer 12 without referring to a dictionary. On the other hand, certain rules, such as consecutive katakana characters, can be extracted as a dictionary or registered keyword. Therefore, the keyword extraction unit 11
The process to be executed is summarized as follows.

（１）特定のルール（連続したカタカナ文字列など）に
合致した文字列をキーワードとして抽出する。(1) Extract character strings that match a specific rule (continuous katakana character strings, etc.) as keywords.

（２）上記（１）にあてはまらない場合に、対象となる
語句が既にキーワードバッファ１２に登録されていた場
合には、これもキーワードとして抽出する。(2) In cases where (1) above is not applicable, if the target phrase has already been registered in the keyword buffer 12, this is also extracted as a keyword.

（３）上記（２）にあてはまらなかった場合に、対象と
なる語句がユーザー辞書６あるいは専門用語辞書７にあ
れば、これをキーワードとして抽出する。(3) If the above (2) does not apply, if the target phrase is in the user dictionary 6 or technical term dictionary 7, it is extracted as a keyword.

さて上記（２）に示したように同一語句が文書の複数箇
所に現われる場合には、その各々はキーワードバッファ
１２内で正しく把握する必要がある。Now, as shown in (2) above, when the same word appears in multiple places in a document, each word needs to be correctly grasped in the keyword buffer 12.

なぜならばキーワードバッファ１２に登録されたキーワ
ードを索引の生成に使用する場合に、その語句が文書の
何ページのどこに現われたかをリストアツブして表示し
なければならないからである。This is because when keywords registered in the keyword buffer 12 are used to generate an index, it is necessary to restore and display how many pages and where of the document the word appears.

一方で同一語句が同−文書内の何箇所に現われるかは各
々の場合によってであり、これを第２図に示すデータ構
造によって管理する。On the other hand, the number of places in which the same word appears in the same document depends on each case, and this is managed by the data structure shown in FIG.

第２図は第１図のキーワードバッファ１２のデータ構造
の説明図である。第２図において、キーワード抽出部１
１にて抽出されたキーワードの文字列はキーワードバッ
ファ１２のキーワード保持部１２ａに代入され、その語
句が現われた文書中の位置つまりページ番号と、段落番
号（当該ページ内の何番目の段落かを示す）と、行番号
（当該段落内の何行目かを示す）と、カラム番号（当該
行内の何文字目かを示す）とを示すデータユニットｔＺ
ａを指し示すポインター１２ｂと対になっている。また
このデータユニット１２ｃは次の同一構造を持つデータ
ユニット１２ｅを指し示すポインター１２ｄを有してお
り、同一語句が文書中に複数同視われた場合にはこのポ
インター１２ｂ、１２ｄを用いて当該語句の文書内にお
ける位置を示すデータユニット１２ｃ。FIG. 2 is an explanatory diagram of the data structure of the keyword buffer 12 of FIG. 1. In FIG. 2, keyword extraction unit 1
The character string of the keyword extracted in step 1 is assigned to the keyword holding section 12a of the keyword buffer 12, and the position in the document where the word appears, that is, the page number, and the paragraph number (the number of paragraph in the page) are stored. data unit tZ indicating the line number (indicating the number of the line in the relevant paragraph), and the column number (indicating the number of the character in the relevant line)
It is paired with a pointer 12b pointing to point a. This data unit 12c also has a pointer 12d that points to the next data unit 12e having the same structure, and when the same word or phrase is seen in a document more than once, the pointer 12b or 12d is used to point to the next data unit 12e with the same structure. A data unit 12c indicating a position within the data unit 12c.

１２ｅを次々と接続できるようになっている。12e can be connected one after another.

第３図は本発明による文書編集装置の他の実施例を示す
機能ブロック図である。第３図において、第１図の実施
例が最初に文書データを作成した後に、文書データから
キーワードを抽出する手順によっていたのに対して、第
３図の実施例が仮名漢字変換処理とキーワード抽出処理
を密接に連係させる手順をとるものを示し、第１図と同
一符号の各ブロックの目的と機能は第１図の実施例と同
一であるが、ただしキーワード抽出部１１は文書データ
バッファ１０と、基本辞書５と、ユーザー辞書６と、専
門用語辞書７のいずれをも参照しない。その代わりに仮
名漢字変換部４より変換が確定した語句についても、そ
の変換がユーザー辞書６あるいは専門用語辞書７によっ
て行なわれた場合には、変換された語句および文書中の
位置の情報をキーワード抽出部１２に与える。また辞書
を使用する以外の処理たとえば特定のルールを用いた抽
出処理などは第１図と全く同様に適用される。FIG. 3 is a functional block diagram showing another embodiment of the document editing device according to the present invention. In Fig. 3, while the embodiment in Fig. 1 involves the procedure of first creating document data and then extracting keywords from the document data, the embodiment in Fig. 3 performs kana-kanji conversion processing and keyword extraction. The purpose and function of each block with the same reference numerals as in FIG. 1 are the same as in the embodiment shown in FIG. , neither the basic dictionary 5, the user dictionary 6, nor the technical term dictionary 7 is referred to. Alternatively, even for words that have been confirmed to be converted by the kana-kanji converter 4, if the conversion is performed by the user dictionary 6 or the technical term dictionary 7, keywords will be extracted from information about the converted words and their positions in the document. Section 12. Further, processing other than using a dictionary, such as extraction processing using specific rules, is applied in exactly the same manner as in FIG. 1.

上記構成で１次のような文章が入力された例をもとに動
作を説明する。The operation will be explained based on an example in which the following sentence is input with the above configuration.

「サンプリングした標本値を最小二乗法によって・・・
・・・」このような文章を作成するとき、一般的には仮名文字に
よって次のようにキーボード１より入力する。"The sampled values are calculated using the least squares method...
``...'' When creating a sentence like this, the following is generally input using the keyboard 1 using kana characters.

「サンプリングしたひようほんちをさいしょうじじよう
ほうによって・・・・・・」ここで仮名漢字変換部４にて変換操作を行い、最初に示
したような正しい変換結果を得ることができる。``By converting the sampled Japanese characters to the original...'' Here, the conversion operation is performed in the kana-kanji conversion unit 4, and the correct conversion result as shown at the beginning can be obtained.

このとき最初の６文字「サンプリング」が連続したカタ
カナ文字であるため、まずキーワード抽出部１１にてキ
ーワードとして抽出される。つぎに「標本値」あるいは
「最小二乗法」がユーザー辞書６または専門用語辞書７
に登録されており、それを用いて変換された場合に、キ
ーワード抽出部１１にてキーワードと判定される。ここ
で一般的に文書のキーワードとなるＮＩ句°は特別な意
味を持つた用語である場合が多いため、基本辞書５では
なくて専門用語辞書７あるいは使用者が独自に語句を登
録しているユーザー辞書６にある場合が多い。At this time, since the first six characters "sampling" are consecutive katakana characters, they are first extracted as a keyword by the keyword extraction unit 11. Next, "sample value" or "least squares method" is used in the user dictionary 6 or technical term dictionary 7.
is registered in , and when it is converted using it, the keyword extraction unit 11 determines that it is a keyword. Here, the NI phrases, which are generally keywords in documents, are often terms with special meanings, so instead of using the basic dictionary 5, the terminology dictionary 7 or the user independently registers the phrases. It is often found in the user dictionary 6.

したがって本発明によるキーワード抽出に辞書とくに専
門用語辞書７とユーザー辞書６を流用しているものはそ
こに理由がある。Therefore, there is a reason why dictionaries, especially the technical term dictionary 7 and the user dictionary 6, are used for keyword extraction according to the present invention.

〔Effect of the invention〕

本発明によれば、文書からキーワードとなる語句をいち
いち指定することなく、文書作成後にあるいは文書作成
中の仮名漢字変換処理とともに自動的に抽出することが
できるので、使用者の負担を軽減する効果があり、また
既に存在する辞書を流用することから新たなコスト発生
を最小限にする効果がある。According to the present invention, it is possible to automatically extract keyword words from a document after the document is created or along with the kana-kanji conversion process during the document creation, thereby reducing the burden on the user. In addition, since existing dictionaries are used, new costs can be minimized.

[Brief explanation of the drawing]

第１図は本発明による文書編集装置の一実施例を示す機
能ブロック図、第２図は第１図のキーワードバッファの
データ構造説明図、第３図は本発明による文書編集装置
の他の実施例を示す機能ブロック図である。１・・・キーボード、２・・・入力制御部、３・・・文
書編集部、４・・・仮名漢字変換部、５・・・基本辞書
、６・・ユーザー辞書、７・・・専門用語辞書、８・・
・表示制御部、９・・・ディスプレイ、１０・・・文書
データバッファ、１１・・・キーワード抽出部、１２・
・・キーワードバッファ。FIG. 1 is a functional block diagram showing one embodiment of the document editing device according to the present invention, FIG. 2 is an explanatory diagram of the data structure of the keyword buffer in FIG. 1, and FIG. 3 is another embodiment of the document editing device according to the present invention. FIG. 2 is a functional block diagram illustrating an example. 1...Keyboard, 2...Input control section, 3...Document editing section, 4...Kana-kanji conversion section, 5...Basic dictionary, 6...User dictionary, 7...Special terminology Dictionary, 8...
- Display control unit, 9... Display, 10... Document data buffer, 11... Keyword extraction unit, 12.
...Keyword buffer.

Claims

[Claims] 1. A data input section from a keyboard or mouse, a kana-kanji conversion section that converts the input data into kanji using various dictionaries, and a document editing section that edits Japanese sentences mixed with kana-kanji. A document editing device comprising a document data buffer that holds created sentences, and a display control section that displays the sentences on a display, a keyword extraction section that automatically extracts keywords from the contents of the document data buffer. , and a keyword buffer that holds extracted keywords. 2. A claim characterized in that the keyword extraction unit utilizes a dictionary used for kana-kanji conversion to extract keywords, thereby increasing extraction efficiency without requiring the user to create a new dictionary for keyword extraction. 1. The document editing device according to 1. 3. The document editing device according to claim 1, wherein the keyword extraction section extracts a character string that matches a specific rule, such as consecutive katakana characters, as a keyword.