JPH08221440A

JPH08221440A - Method and device for extracting keyword from network news article

Info

Publication number: JPH08221440A
Application number: JP7026881A
Authority: JP
Inventors: Toshihiko Shirokaze; 敏彦城風; Hiromi Haniyuda; 博美羽生田; Tetsuo Kinoshita; 哲男木下
Original assignee: GIJUTSU KENKYU KUMIAI SHINJOHO; GIJUTSU KENKYU KUMIAI SHINJOHO SHIYORI KAIHATSU KIKO; Oki Electric Industry Co Ltd
Current assignee: GIJUTSU KENKYU KUMIAI SHINJOHO; GIJUTSU KENKYU KUMIAI SHINJOHO SHIYORI KAIHATSU KIKO; Oki Electric Industry Co Ltd
Priority date: 1995-02-15
Filing date: 1995-02-15
Publication date: 1996-08-30

Abstract

PURPOSE: To provide a method with which a retrieval key code suitable for retrieval can be extracted from a network news article. CONSTITUTION: Processing for excluding a header and a signature from the news article is performed (101-104). Any part to be quoted in the news article is specified and processing for raising a flag at that part is performed. A line feeding part in the news article is analyzed and an original sentence end is decided. Afterwards, a character string is extracted from the news article by executing morpheme analysis 107 regarding the partition of the types of characters important. The extracted character string is defined as an index keyword.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】この発明はネットワークで扱われ
ているニュース記事から索引キーワードを抽出するため
の方法およびその実施に好適な装置に関するものであ
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for extracting index keywords from news articles handled on a network and a device suitable for implementing the method.

【０００２】[0002]

【従来の技術】インターネットやパソコン通信網等のネ
ットワークを流れているニュース記事から所望の記事を
キーワードによって検索できるようにするためには、予
め記事から索引キーワードを抽出して記事に索引付けを
する必要がある。そのための一方法として自然言語処理
で開発された形態素解析であって辞書を用いる形態素解
析により索引キーワード候補を抽出し、その後、この候
補から最少一致法或は最尤評価法などにより最適な索引
キーワードを決める方法が考えられる。2. Description of the Related Art In order to be able to search for a desired article from a news article flowing through a network such as the Internet or a personal computer communication network by a keyword, an index keyword is extracted from the article in advance to index the article. There is a need. As one method for this, a morphological analysis developed by natural language processing, and a morphological analysis using a dictionary is used to extract index keyword candidates, and then an optimal index keyword is extracted from this candidate by the minimum matching method or maximum likelihood evaluation method. A possible method is to decide.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、上述の
様な形態素解析手法を単に適用すると、以下に説明する
様な問題点が生じる。However, when the morphological analysis method as described above is simply applied, the following problems occur.

【０００４】(1).先ず、ネットワークを流れているニュ
ース記事は、ヘッダ、シグネーチャ、引用部および本文
で構成されているというように特有な構造のものとなっ
ている。ここでヘッダとはその記事のグループ名、その
記事を発信した発信者が誰か等を示す部分、シグネーチ
ャとは発信者を説明する部分、引用部とは第三者の記事
などを引用する場合にそれが示されている部分、本文と
は発信者が流したい情報である。このようなニュース記
事から索引キーコードを抽出する場合、ヘッダおよびシ
グネーチャは本文の索引キーコードを抽出する対象から
外した方が良く、また、引用部もそれと分かるよう区分
けできるのが良い。このようにした方が、本文検索に適
した索引キーコードをニュース記事から抽出できるから
である。しかし、従来においてはこれを実現する具体的
な方法はなかった。(1). First, a news article flowing through the network has a unique structure such that it is composed of a header, a signature, a quotation part and a text. Here, the header is the group name of the article, the part that indicates who is the sender of the article, the signature is the part that describes the sender, and the citation section is when quoting a third-party article. The part where it is shown, the text, is the information that the sender wants to pass. When extracting the index key code from such a news article, it is better to exclude the header and the signature from the target for extracting the index key code of the text, and the quoting part can also be divided so that it can be recognized. This is because the index key code suitable for the text search can be extracted from the news article in this way. However, there has been no concrete method for achieving this in the past.

【０００５】(2).また、ニュース記事の作成に当たって
は、現在の所、文字通信の制約から１文が完結しなくと
も改行して１行を全角文字で３０文字で構成することが
推奨されている。このため、本来抽出されるべき索引キ
ーワードが改行で分割されてしまう場合が生じ、この結
果意味のない索引キーワードを抽出してしまうことがあ
る。これを回避するため改行コードを完全に無視して句
読点までを１文としてそこから索引キーワード抽出を行
う方法も考えられる。しかし、そうすると、例えば箇条
書きされた記事など句読点のない記事の場合に、今度は
分割されるべきキーワードが連結されてしまいやはり意
味のないキーワードを抽出してしまう。従って、ネット
ワークニュース記事中の改行記号に対する対策が望まれ
る。(2) Further, at the time of creating a news article, it is currently recommended that line breaks be made up of 30 full-width characters even if one sentence is not completed due to restrictions on character communication. ing. Therefore, the index keyword that should be originally extracted may be divided by a line feed, and as a result, a meaningless index keyword may be extracted. In order to avoid this, a method of completely ignoring the line feed code and setting the punctuation mark as one sentence and extracting the index keyword from there can be considered. However, in this case, for example, in the case of an article without punctuation marks such as a bulleted article, the keywords that should be divided this time are connected and the meaningless keyword is extracted. Therefore, it is desired to take measures against line feed symbols in network news articles.

【０００６】(3).また、上述の形態素解析手法を用いる
抽出方法の場合はまず単語辞書を参照して文章中のキー
ワード候補を文章から分割した後、最少一致法などで最
適なキーワードを抽出することになるが、漢字を連結す
ることで構成された複合語例えば「電子化辞書」などの
場合はこれを分割することは意味的にも検索効率的にも
極力なくすことが好ましい。なぜなら、この「電子化辞
書」の場合、辞書に「電子化」なり「電子化辞書」が登
録されておらず「電子」しか登録されていないと、「電
子」および「化辞書」を索引キーワード候補としてしま
うので、無意味な索引キーワード「化辞書」を生成して
しまうという不具合が生じ、ひいてはこれを対象とする
無意味な検索作業がなされる等の不具合が生じるからで
ある。これを回避するためには、辞書に複合語を適時登
録すれば良いが、日々生成される造語を辞書に登録する
ことは現実的ではない。(3) Further, in the case of the extraction method using the above-mentioned morphological analysis method, first the keyword candidates in the sentence are divided from the sentence by referring to the word dictionary, and then the optimum keyword is extracted by the minimum matching method or the like. However, in the case of a compound word formed by connecting kanji, for example, "electronic dictionary", it is preferable to divide it into the meaning and the retrieval efficiency as much as possible. Because, in the case of this "electronic dictionary", if "electronic" is not registered in the dictionary and only "electronic" is registered, "electronic" and "chemical dictionary" are index keywords. This is because the candidate is generated as a candidate, which causes a problem that a meaningless index keyword “chemical dictionary” is generated, and thus a meaningless search work is performed on the index keyword. In order to avoid this, compound words should be registered in the dictionary at appropriate times, but it is not realistic to register coined words generated every day in the dictionary.

【０００７】[0007]

【課題を解決するための手段】そこで、この出願の第一
発明のネットワークニュース記事からのキーワード抽出
方法によれば、ネットワークニュース記事から索引キー
ワードを抽出するに当たり、先ず、ニュース記事からヘ
ッダを抽出する処理、前記ニュース記事からシグネーチ
ャを抽出する処理、前記ニュース記事における引用部に
ついて解釈をする処理および前記ニュース記事における
改行について解釈をする処理の各処理をそれぞれ予め定
めた所定規則に基づき少なくとも実施する。次に、これ
ら各処理が実施されたニュース記事に対し字種の区切り
を重視した形態素解析を実施して該ニュース記事から文
字列（１字の場合も含む。以下同様。）を抽出する。そ
して、該文字列（一次キーワードと称する。）を索引キ
ーワードとする。According to the method of extracting keywords from a network news article of the first invention of this application, when extracting an index keyword from a network news article, first, a header is extracted from the news article. At least the processing, the processing of extracting the signature from the news article, the processing of interpreting the quoted part in the news article, and the processing of interpreting the line break in the news article are performed at least based on predetermined rules. Next, a morpheme analysis that emphasizes character type delimitation is performed on the news article on which each of these processes has been performed, and a character string (including the case of one character; the same applies below) is extracted from the news article. Then, the character string (referred to as a primary keyword) is used as an index keyword.

【０００８】なお、この第一発明の実施に当たり、前記
抽出された文字列を辞書および予め定めた統計的分割規
則を用いてさらに分割する処理をし、該分割で得られた
文字列を索引キーワード（二次キーワードと称する。）
とするのが好適である。もちろん、さらに分割すると
は、上記統計的分割規則に適合した場合は分割をする意
味である（以下の第二発明において同じ。）。また、一
次キーワードの中で分割されなかったものは索引キーワ
ードとしてそのまま用いることができる（以下の第二発
明において同じ。）。In implementing the first aspect of the present invention, the extracted character string is further divided using a dictionary and a predetermined statistical division rule, and the character string obtained by the division is used as an index keyword. (It is called a secondary keyword.)
Is preferred. Of course, further dividing means dividing when the above-mentioned statistical dividing rule is satisfied (the same applies to the second invention below). Further, the primary keywords that have not been divided can be used as they are as index keywords (the same applies to the second invention below).

【０００９】また、この出願の第二発明のネットワーク
ニュース記事からのキーワード抽出装置（以下、抽出装
置と略称することもある。）によれば、ニュース記事か
らヘッダを抽出するヘッダ抽出部と、前記ニュース記事
からシグネーチャを抽出するシグネーチャ抽出部と、前
記ニュース記事における引用部について解釈をする引用
部解釈部と、前記ニュース記事における改行について解
釈をする改行解釈部と、前記ヘッダ抽出部、シグネーチ
ャ抽出部、引用部解釈部および改行解釈部による処理が
少なくともなされたニュース記事に対し字種の区切りを
重視した形態素解析をする形態素解析部とを具えたこと
を特徴とする。Further, according to the keyword extracting apparatus for network news articles of the second invention of the present application (hereinafter, sometimes abbreviated as an extracting apparatus), a header extracting section for extracting a header from a news article, and A signature extracting unit that extracts a signature from a news article, a quoting unit interpreting unit that interprets the quoting unit in the news article, a line break interpreting unit that interprets a line break in the news article, the header extracting unit, and the signature extracting unit. , A morphological analysis unit for performing a morphological analysis that emphasizes character type delimitation for a news article that has been at least processed by the quotation unit interpretation unit and the line feed interpretation unit.

【００１０】[0010]

【作用】この出願の第一および第二発明によれば、ネッ
トワークニュース記事からヘッダおよびシグネーチャが
分離される。また、ネットワークニュース記事における
引用部および改行の状態を把握出来る。そして、ヘッダ
およびシグネーチャが分離されたニュース記事であって
引用部および改行の状態が把握されたニュース記事か
ら、字種の固まり単位で文字列が抽出される。According to the first and second inventions of this application, the header and the signature are separated from the network news article. In addition, it is possible to grasp the state of quotes and line breaks in network news articles. Then, from the news article in which the header and the signature are separated and the state of the quoted portion and the line feed is grasped, the character string is extracted in the unit of the character type.

【００１１】[0011]

【実施例】以下、図面を参照してこの出願のネットワー
クニュース記事からのキーワード抽出方法および装置の
各発明の実施例について併せて説明する。しかしなが
ら、説明に用いる各図はこれら発明を理解出来る程度に
概略的に示してあるにすぎない。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Embodiments of each invention of a keyword extracting method and apparatus from a network news article of this application will be described below with reference to the drawings. However, the drawings used in the description are merely schematic representations so that the invention can be understood.

【００１２】図１は、実施例の抽出装置１００の構成を
示した図である。また、図２はニュース記事の一例を示
した図である。また、図３〜図９は、ヘッダ抽出処理、
シグネーチャ抽出処理、引用部解釈処理、改行解釈処
理、形態素解析、二次キーワード抽出処理などの各処理
の説明図である。FIG. 1 is a diagram showing the structure of an extracting apparatus 100 of the embodiment. 2 is a diagram showing an example of a news article. 3 to 9 show header extraction processing,
It is explanatory drawing of each processing, such as a signature extraction process, a quotation part interpretation process, a line feed interpretation process, a morpheme analysis, and a secondary keyword extraction process.

【００１３】実施例の抽出装置１００は、一次キーワー
ド抽出部１００ａと二次キーワード抽出部１００ｂとを
具える。一次キーワード抽出部１００ａは、この実施例
の場合、ヘッダ抽出部１０１と、ヘッダ項目索引ベース
１０２と、シグネーチャ抽出部１０３と、シグネーチャ
ベース１０４と、引用部解釈部１０５と、改行解釈部１
０６と、形態素解析部１０７と、オートマトン１０８
と、一次キーワードベース１０９とを具える。また、二
次キーワード抽出部１００ｂは、この実施例の場合、キ
ーワード分割部１１０と、辞書１１１と、分割規則記憶
部１１２と、二次キーワードベース１１３とを具える。The extraction device 100 of the embodiment comprises a primary keyword extraction unit 100a and a secondary keyword extraction unit 100b. In this embodiment, the primary keyword extracting unit 100a includes a header extracting unit 101, a header item index base 102, a signature extracting unit 103, a signature base 104, a quoting unit interpreting unit 105, and a line break interpreting unit 1.
06, the morphological analysis unit 107, and the automaton 108
And a primary keyword base 109. The secondary keyword extraction unit 100b, in this embodiment, includes a keyword division unit 110, a dictionary 111, a division rule storage unit 112, and a secondary keyword base 113.

【００１４】１．一次キーワードの抽出一次キーワード抽出部１００ａに備わるヘッダ抽出部１
０１は、ニュース記事１２０からヘッダ１２０ａを抽出
するものである。この実施例の場合は、ニュース記事１
２０（図２参照）からヘッダ部１２０ａを抽出してこれ
をニュース記事１２０から取り除くと共に、さらに、ヘ
ッダ１２０ａからニュース記事の発信者、サブジェク
ト、発信時刻などの内容自体を抽出しこれらをヘッダ項
目索引ベース１０２に格納するものとしてある。ヘッダ
抽出部１０１におけるこの一連の処理は例えば以下に図
３を参照して説明する様な手順で行なえる。ニュース記
事１２０においては、ヘッダ１２０ａと本文１２０ｄと
は通常空行で分けられており、また、ヘッダ１２０ａに
おけるヘッダ項目とヘッダ項目に示された内容とは通常
「：」記号で区切られている。そこで、ニュース記事が
格納されている図示しないファイル（例えばフロッピー
ディスクなど任意のもの。以下同様。）から、キーワー
ド抽出対象の記事の１行分のデータを読み取る（図３の
ステップＳ１）。次に、行頭が英単語およびこれに続く
「：」記号で構成されているか否かを判定する（図３の
ステップＳ２）。そうであった場合、ヘッダ１２０ａで
あるので、このデータをヘッダ項目索引ベース（ヘッダ
項目辞書ともいう。）に登録する（図３のステップＳ
３）。そして、記事から１行分のデータを読み取るとい
うステップＳ１の処理に戻りステップＳ２以後の処理を
繰り返す。また、ステップＳ２で否となった場合、具体
的には空行となった場合、ヘッダ抽出処理を終了する
（図３のステップＳ４）。このようなヘッダ処理部１０
１は例えば公知のコンピュータ技術や回路技術で構成出
来る。また、ヘッダ項目索引ベース１０２も公知の記憶
手段で構成出来る。なお、この実施例では、ヘッダ１２
０ａをニュース記事１２０から取り除く例で説明した
が、ニュース記事１２０を示すデータのヘッダ該当部分
にヘッダとわかるフラグをたててニュース記事に残すよ
うな、ヘッダ処理部としても良い。こうした場合は後の
形態素解析の際にこのフラグを利用してヘッダを解析対
象から除く等の処理を行なえば良い。1. Extraction of primary keyword Header extraction unit 1 provided in the primary keyword extraction unit 100a
01 is for extracting the header 120a from the news article 120. In this example, news article 1
20 (see FIG. 2), the header section 120a is extracted and removed from the news article 120. Further, the content itself such as the sender, the subject, and the transmission time of the news article is extracted from the header 120a, and these are extracted as the header item index. It is stored in the base 102. This series of processes in the header extraction unit 101 can be performed by the procedure described below with reference to FIG. 3, for example. In the news article 120, the header 120a and the body 120d are usually separated by a blank line, and the header item in the header 120a and the content shown in the header item are usually separated by a ":" symbol. Therefore, the data for one line of the article for which the keyword is to be extracted is read from a file (for example, an arbitrary file such as a floppy disk. The same applies hereinafter) in which news articles are stored (step S1 in FIG. 3). Next, it is determined whether or not the beginning of a line is composed of an English word and the following ":" symbol (step S2 in FIG. 3). If so, it is the header 120a, so this data is registered in the header item index base (also called the header item dictionary) (step S in FIG. 3).
3). Then, the process returns to step S1 of reading one line of data from the article, and the processes after step S2 are repeated. In addition, if the answer is NO in step S2, specifically, if the line is a blank line, the header extraction process ends (step S4 in FIG. 3). Such a header processing unit 10
1 can be configured by known computer technology or circuit technology, for example. Further, the header item index base 102 can also be configured by a known storage means. In this embodiment, the header 12
The example in which 0a is removed from the news article 120 has been described, but a header processing unit may be used in which the flag corresponding to the header of the data indicating the news article 120 is flagged and left in the news article. In such a case, this flag may be used in the subsequent morphological analysis to perform processing such as removing the header from the analysis target.

【００１５】また、シグネーチャ抽出部１０３は、ニュ
ース記事１２０（図２参照）からシグネーチャ１２０ｄ
を抽出するものである。この実施例の場合、ニュース記
事１２０からシグネーチャ１２０ｄを抽出しこれをニュ
ース記事１２０から取り除く処理をし、さらに、このシ
グネーチャすなわちニュース記事の発信者、サブジェク
ト、発信時刻などのデータをシグネーチャーベース１０
４に保存するものとしてある。記事中にシグネーチャが
あるか否かは、ネットワークにおける各発信者からの記
事それぞれの最後の部分が同じである２以上の記事があ
る場合のその同じ部分をシグネーチャーとみなすことで
行なえる。具体的には以下に図４を参照して説明する様
な手順でシグネーチャ抽出は行なえる。先ず、シグネー
チャ抽出処理のためのバッファをクリアする（図４のス
テップＳ１１）。次に、シグネーチャベース１０４（図
１参照）中に何らかのシグネーチャファイルがあるか否
かを判定する（図４のステップＳ１２）。この判定は、
例えばヘッダ抽出処理において抽出してある発信者を示
すデータによりシグネーチャーベースを検索してシグネ
ーチャーファイルの有無を判定する事で行なえる。シグ
ネーチャファイルが存在した場合は、キーワード抽出対
象の記事の末尾の１行分のデータを読み取りこれをＡと
し、かつ、シグネーチャファイルの末尾から１行分のデ
ータを読み取りこれをＢとする（図４のステップＳ１
３）。次に、両データＡおよびＢが一致するか否かを判
定する（図４のステップＳ１４）。両データＡおよびＢ
が一致した場合はデータＡをバッファの先頭に付加する
（図４のステップＳ１５）。そして、ステップＳ１３に
戻りＳ１４、Ｓ１５、Ｓ１３の処理を繰り返す。ここで
ステップＳ１４において両データＡおよびＢが不一致と
なつた場合は、いままでバッファに付加したデータをそ
の発信者のシグネーチャだとしてこれをシグネーチャベ
ース１０４に保存する（図４のステップＳ１８）。ま
た、ステップＳ１２においてシグネーチャファイルが存
在しないと判定された場合は、ある発信者の記事が一つ
である場合であるので、このニュース記事全文をバッフ
ァにコピーし（図４のステップＳ１６）、そして、これ
をシグネーチャベース１０４に保存する（図４のステッ
プＳ１７）。ただし、ステップＳ１２においてシグネー
チャファイルが存在しないと判定された場合にシグネー
チャベースに記憶した記事は、キーワード抽出時におい
て本文として扱う。このようなシグネーチャ抽出処理部
１０１は例えば公知のコンピュータ技術や回路技術で構
成出来る。なお、この実施例では、シグネーチャ１２０
ｂをニュース記事１２０から取り除く例で説明したが、
ニュース記事１２０を示すデータのシグネーチャ該当部
分にシグネーチャとわかるフラグをたててニュース記事
に残すような、シグネーチャ抽出処理部としても良い。
こうした場合は後の形態素解析の際にこのフラグを利用
してシグネーチャを解析対象から除く等の処理を行なえ
ば良い。Further, the signature extraction unit 103 detects the signature 120d from the news article 120 (see FIG. 2).
Is to be extracted. In the case of this embodiment, the signature 120d is extracted from the news article 120 and is removed from the news article 120. Further, data such as the sender of this news article, that is, the subject, and the time at which the signature was sent are stored in the signature base 10.
It is supposed to be stored in 4. Whether or not there is a signature in an article can be determined by regarding the same part when the last part of each article from each sender in the network is the same as the signature. Specifically, signature extraction can be performed by the procedure described below with reference to FIG. First, the buffer for signature extraction processing is cleared (step S11 in FIG. 4). Next, it is determined whether or not there is any signature file in the signature base 104 (see FIG. 1) (step S12 in FIG. 4). This judgment is
For example, it can be performed by searching the signature base with the data indicating the sender extracted in the header extraction processing and determining the presence or absence of the signature file. If there is a signature file, read the data for one line at the end of the article from which keywords are to be extracted and set this as A, and read the data for one line from the end of the signature file as B. (Step S1 in FIG. 4)
3). Next, it is determined whether or not the two data A and B match (step S14 in FIG. 4). Both data A and B
If they match, the data A is added to the beginning of the buffer (step S15 in FIG. 4). Then, the process returns to step S13 and the processes of S14, S15, and S13 are repeated. If the data A and B do not match in step S14, the data added to the buffer is stored as the signature of the caller in the signature base 104 (step S18 in FIG. 4). . If it is determined in step S12 that the signature file does not exist, it means that there is only one article by a certain sender, so the entire news article is copied to the buffer (step S16 in FIG. 4). , And stores it in the signature base 104 (step S17 in FIG. 4). However, the article stored in the signature base when it is determined that the signature file does not exist in step S12 is treated as the text when the keyword is extracted. Such a signature extraction processing unit 101 can be configured by, for example, known computer technology or circuit technology. In this embodiment, the signature 120
As described in the example of removing b from the news article 120,
The signature extraction processing unit may set a flag indicating that the signature is present in the relevant portion of the data indicating the news article 120 and leave the flag in the news article.
In such a case, this flag may be used in the subsequent morphological analysis to perform processing such as removing the signature from the analysis target.

【００１６】また、引用部解釈部１０５は、ニュース記
事１２０中における引用部１２０ｃについて解釈をする
ものである。この実施例の場合、ニュース記事１２０に
おける引用部１２０ｃを特定する処理を行なうものとし
ている。この処理は例えば以下に図５を参照して説明す
る様な手順で行なえる。一般に引用部はその行頭に引用
部を示す記号が付されている。たとえば、スペース記号
と所定の記号とを連続させた記号とか、「＞」記号と
か。「：」記号とか、「≫」記号とかである。図２のニ
ュース記事１２０の場合は、引用記号が「≫」とされて
いる例を示している。そこで、キーワード抽出対象の記
事から１行分のデータを読み取る（図５のステップＳ２
１）。次に、行頭が引用記号か否かを判定する（図５の
ステップＳ２２）。そうであった場合は引用記号を削除
しこの行のデータに対し引用である旨を示すフラグを立
てこのデータを引用文として例えば所定のメモリ（図示
せず）に保存する（図５のステップＳ２３）。そして、
記事から１行分のデータを読み取るというステップＳ２
１の処理に戻りステップＳ２２以後の処理を繰り返す。
また、ステップＳ２２で否となった場合、具体的には空
行となった場合、引用部解釈処理を終了する（図５のス
テップＳ２４）。このような引用部解釈部１０５は例え
ば公知のコンピュータ技術および回路技術で構成出来
る。なお、引用部については、これも索引キーコードの
抽出対象文として本文同様に後述の改行解釈処理や形態
素解析処理を行なう。もちろん、引用部から索引キーコ
ードを抽出しない場合は、改行解釈処理や形態素解析処
理は行なわずとも良い。The quoting section interpreting section 105 interprets the quoting section 120c in the news article 120. In the case of this embodiment, it is assumed that the process of identifying the quoted portion 120c in the news article 120 is performed. This processing can be performed, for example, by the procedure described below with reference to FIG. In general, a quoted part has a symbol at the beginning of the line indicating the quoted part. For example, a symbol in which a space symbol and a predetermined symbol are continuous, or a “>” symbol. It is a ":" symbol or a ">>" symbol. In the case of the news article 120 in FIG. 2, an example in which the quotation mark is “>>” is shown. Therefore, one line of data is read from the article subject to keyword extraction (step S2 in FIG. 5).
1). Next, it is determined whether the line head is a quotation mark (step S22 in FIG. 5). If so, the quotation mark is deleted, a flag indicating that the data in this line is a quotation is set, and this data is stored as a quotation in, for example, a predetermined memory (not shown) (step S23 in FIG. 5). ). And
Step S2 of reading one line of data from the article
The process returns to the process 1 and the processes after step S22 are repeated.
If the answer is NO in step S22, specifically, if the line is a blank line, the quotation part interpretation process is terminated (step S24 in FIG. 5). Such a quotation section interpreting section 105 can be configured by known computer technology and circuit technology, for example. As for the quoted portion, the line feed interpretation processing and the morpheme analysis processing, which will be described later, are also performed in the same manner as the main text as the extraction target sentence of the index key code. Of course, when the index key code is not extracted from the quotation part, the line feed interpretation process and the morphological analysis process may not be performed.

【００１７】また、改行解釈部１０６は、記事中の改行
を解釈するものである。この実施例では、１文が完結し
なくても現在の文字通信の制約からある文字数で改行さ
れてしまうために、索引キーワードとなるべき文字列が
分割されてしまうという不具合を補い、本来の文末を決
める処理をするためのものとしている。この処理は例え
ば以下に図６を参照して説明するような手順で行なえ
る。通常、本文は１行ごとに空行を挿入して構成され
る。そこで、先ず、キーワード抽出対象のニュース記事
における本文全文を読み取る（図６のステップＳ３
１）。なお、ヘッダ抽出処理、シグネーチャ抽出処理、
引用部解釈処理および改行解釈処理それぞれが済んでい
るなら、このステップＳ３１での本文の読み取りは容易
である。次に、１行ごとに改行が行なわれているか否か
（すなわち１行ごとに空行があるか否か）を判定する
（図６のステップＳ３２）。そうであった場合は、本文
の１行ごとに対し、行末にある改行は無視し、および、
１行だけの改行も無視し、改行の連続したところを文末
とする（図６のステップＳ３３）。一方、１行ごとに改
行が行なわれていない場合、本文の１行ごとに対し、行
末にある改行は無視し、および、改行だけの行を文末と
する（図６のステップＳ３４）。その後、改行解釈処理
を終了する（図６のステップＳ３５））。このような改
行解釈処理部は例えば公知のコンピュータ技術および回
路技術により構成出来る。The line break interpreter 106 interprets a line break in an article. In this embodiment, even if one sentence is not completed, a line is broken by a certain number of characters due to the current character communication restriction, so that the problem that the character string to be the index keyword is divided is compensated for. It is intended for the process of determining. This processing can be performed, for example, in the procedure described below with reference to FIG. Normally, the text is composed by inserting blank lines for each line. Therefore, first, the full text of the news article targeted for keyword extraction is read (step S3 in FIG. 6).
1). Note that header extraction processing, signature extraction processing,
If the quoting portion interpretation processing and the line feed interpretation processing have been completed, the reading of the text in step S31 is easy. Next, it is determined whether or not a line feed is performed for each line (that is, whether or not there is a blank line for each line) (step S32 in FIG. 6). If so, ignore line breaks at the end of each line of the text, and
The line feed of only one line is also ignored, and the place where the line feed continues is the end of the sentence (step S33 in FIG. 6). On the other hand, when the line feed is not performed for each line, the line feed at the end of the line is ignored for each line of the text, and the line containing only the line feed is set as the end of sentence (step S34 in FIG. 6). After that, the line feed interpretation process ends (step S35 in FIG. 6). Such a line break interpretation processing unit can be configured by known computer technology and circuit technology, for example.

【００１８】なお、ヘッダ抽出処理、シグネーチャ抽出
処理、引用部解釈処理および改行解釈処それぞれの実施
順序は、各処理が形態素解析前になされていれば本来は
どのような順序でも構わないが、効率的には、ヘッダ抽
出処理、シグネーチャ抽出処理、引用部解釈処理および
改行解釈処理をこの順に実施するのが好適である。It should be noted that the header extraction process, the signature extraction process, the quotation part interpretation process, and the line feed interpretation process may be performed in any order as long as each process is performed before the morphological analysis. Specifically, it is preferable to execute the header extraction process, the signature extraction process, the quotation part interpretation process, and the line feed interpretation process in this order.

【００１９】また、形態素解析部１０７は、ヘッダ抽出
処理、シグネーチャ抽出処理、引用部解釈処理および改
行解釈処理の各処理が済んだニュース記事に対し、字種
の区切りを主として用いて形態素解析をするものであ
る。この形態素解析部１０７における処理は、例えばオ
ートマトンたとえばこれに限られないが図７に示したオ
ートマトンを用いた、以下に説明する様な方法により行
なえる。先ず、この形態素解析にあたっての前処理とし
てニュース記事の本文における全角の英数字を半角に変
換する。そして、図７のような状態遷移を行なうオート
マトンを用い字種の区切りを文字列の区切りとして本文
を文字列に分割する。すなわちオートマトンにおける各
状態（図７中のkanji、hiraganaなど）に、その状態が担
当している字種でない字が現れると状態Ｎｏｎｅに戻
り、そこまでの文字列を分割する（出力する）。これに
より本文を字種の区切りで分割できる。このように分割
した文字列を索引キーワード（以下の二次キーワードと
の関係でこれを一次キーワードとも呼ぶ。）とする。こ
の一次キーワードを一次キーワードベース１０９に格納
する。なお、あまりにも長い文字列が分割された場合
は、例えば箇条書きが連結されたなどの異常であると判
断し、後処理を行う。この後処理として例えば改行解釈
処理をやり直してこの文字列を分割する処理が挙げられ
る。後処理で分割した文字列に対しては、再び字種の区
切りを主として用いた上記形態素解析を行う。The morpheme analysis unit 107 also performs morpheme analysis on the news article that has been subjected to the header extraction process, the signature extraction process, the quotation part interpretation process, and the line feed interpretation process, mainly using the character type delimiters. It is a thing. The processing in the morphological analysis unit 107 can be performed by a method described below using an automaton, for example, but not limited to the automaton shown in FIG. First, as pre-processing for this morphological analysis, full-width alphanumeric characters in the text of the news article are converted to half-width characters. Then, the body is divided into character strings by using an automaton that performs a state transition as shown in FIG. That is, when a character that is not in charge of the character type appears in each state (kanji, hiragana, etc. in FIG. 7) in the automaton, the state returns to State None, and the character string up to that point is divided (output). This allows the text to be divided at the character type. The character string thus divided is used as an index keyword (which is also referred to as a primary keyword in the following relationship with a secondary keyword). This primary keyword is stored in the primary keyword base 109. If a character string that is too long is divided, it is determined that there is an abnormality, for example, that bullets are connected, and post-processing is performed. As this post-processing, for example, a processing for re-performing the line feed interpretation processing and dividing this character string can be cited. For the character string divided by the post-processing, the morphological analysis is performed again using mainly character type delimiters.

【００２０】図２に示したニュース記事１２０の本文１
２０ｄから抽出される一次キーワードは、以下の様なも
のとなる。フリーソフトｆｔｐｆｔｐ．ｏｋｉ．ｃｏ．ｊｐ
／ｐｕｂ／ｍａｃｅｕｄｏｒａ−１．２．２Ｊイン
ターネット入場合手渡。また、引用部１２０ｃについて本文同様に処理をして抽
出された一次キーワードは以下のようなものとなる。Ｍａｃｉｎｔｏｓｈインターネットメール読Ｅ
ｕｄｏｒａソフトどうやって手入。ファイルに他のニュース記事がある場合これらに対して
も上記と同様に各処理を実施してそれぞれ一次キーワー
ドを抽出する。The text 1 of the news article 120 shown in FIG.
The primary keywords extracted from 20d are as follows. Free software ftp ftp. oki. co. jp
/ Pub / mac eudora-1.2.2J Handed over when entering the Internet. Further, the primary keywords extracted by processing the citation portion 120c in the same manner as the text are as follows. Macintosh Internet mail reading E
udora software How to get it. When there are other news articles in the file, the respective processes are performed on these as well, and the primary keywords are extracted.

【００２１】ここまでの処理で得られた一次キーワード
は、ヘッダ、シグネーチャ、引用部および改行記号の影
響を除去した本文（さらに引用部の場合もある。）か
ら、かつ、字種の区切りを主に用いた形態素解析によっ
て、抽出したものであるので、ヘッダなどの影響がなく
かつ文字列が必要以上に分割されていない（たとえば電
子化辞書などは分割されることなくそのままの）索引キ
ーワードとなる。The primary keywords obtained by the processing up to this point are from the main body (which may also be a quoted part) in which the effects of the header, the signature, the quoted part, and the line feed symbol have been removed, and the delimiter of the character type is mainly used. Since it is extracted by the morphological analysis used for, it is not affected by the header, etc., and the character string is not divided more than necessary (for example, it becomes an index keyword as it is without dividing electronic dictionary etc.) .

【００２２】２．二次キーワード上述のとおり、一次キーワードは索引キーワードとなり
得るものであるが、より検索効率を高める等のために上
記一次キーワードから二次キーワードをさらに抽出する
ようにしても良い。これについて以下に説明する。2. Secondary Keywords As described above, primary keywords can be index keywords, but secondary keywords may be further extracted from the primary keywords in order to further improve search efficiency. This will be described below.

【００２３】二次キーワード抽出部１００ｂ（図１参
照）のキーワード分割部１１０は、一次キーワードベー
ス１０９に格納されている一次キーワードそれぞれを辞
書１１１と、分割規則記憶部１１２に格納されている予
め定めた統計的分割規則とを用いてさらに分割する処理
をし、該分割で得られた文字列を索引キーワードとして
の二次キーワードとする。ここでこの実施例では、予め
定めた統計的分割規則を、次の〜の条件の規則とし
ている。：一次キーワードベースに格納されている一
次キーワードとされている各文字列を漢字コード順にソ
ートし、このソートした文字列群の中に２文字以上の特
定の文字列ｓを語頭（文字列の頭）に有した一次キーワ
ード（文字列）が連続してＮ以上出現している部分（連
続出現部分とい。）があるときはこの連続出現部分を上
記文字列群から取り除く。ここでＮは予め定めた正の整
数であり。このＮを出現数しきい値と称することにす
る。：この取り出した文字列群の各文字列をｓ部分と
残りの部分とに分割した際の該残りの部分それぞれのう
ちに、一次キーワードベース１０９或は辞書１１１にな
い文字列がいくつあるかを計数し、その計数値ｍが予め
定めた数Ｍ以上の場合は上記ｓ部分を取り取り除く処理
（分割処理）は行わずそうでない場合は上記文字列を取
り除く処理（分割処理）を行う。このＭを非登録数しき
い値と称することにする。：他にも連続出現部分があ
る場合同様にの処理をする。：一次キーワードベー
スに格納されている各文字列を今度はそれぞれ逆に並び
変える（例えば「文字列」であれば「列字文」とす
る。）。この〜の条件の分割規則を用いた二次キー
ワード抽出処理について、図８および図９を参照してよ
り具体的に説明する。The keyword dividing unit 110 of the secondary keyword extracting unit 100b (see FIG. 1) stores the primary keywords stored in the primary keyword base 109 in the dictionary 111 and the dividing rule storage unit 112 in advance. Further processing is performed using the statistical division rule described above, and the character string obtained by the division is used as a secondary keyword as an index keyword. Here, in this embodiment, the predetermined statistical division rule is the rule of the following conditions (1) to (4). : Each character string stored as a primary keyword in the primary keyword base is sorted in the order of Kanji code, and a specific character string s of 2 or more characters is included in the sorted character string group (the head of the character string). If there is a portion (referred to as a continuous appearance portion) in which the primary keyword (character string) included in () is consecutively appearing N or more, this continuous appearance portion is removed from the character string group. Here, N is a predetermined positive integer. This N will be referred to as an appearance count threshold value. : How many character strings are not in the primary keyword base 109 or the dictionary 111 in each of the remaining parts when the respective character strings of the extracted character string group are divided into the s part and the remaining part. When the count value m is equal to or larger than a predetermined number M, the process for removing the s portion (division process) is not performed, and otherwise, the process for removing the character string (division process) is performed. This M will be referred to as an unregistered number threshold value. : If there are other consecutive occurrences, perform the same process. : This time, the character strings stored in the primary keyword base are rearranged in reverse order (for example, "character string" means "string character sentence"). The secondary keyword extraction processing using the division rules of the conditions 1 to 3 will be described more specifically with reference to FIGS. 8 and 9.

【００２４】先ず、出現数しきい値Ｎおよび非登録数し
きい値Ｍをそれぞれ設定する。ここではＮをたとえば１
００、Ｍをたとえば１とする。なお、説明の都合上一次
キーワードベース１０９に格納してある一次キーワード
群をＡと称し、辞書中の単語群をＢと称する（図８のス
テップＳ４１）。First, the appearance number threshold N and the non-registration number threshold M are set. Here, N is 1
00 and M are set to 1, for example. For convenience of explanation, the primary keyword group stored in the primary keyword base 109 is referred to as A, and the word group in the dictionary is referred to as B (step S41 in FIG. 8).

【００２５】次に、一次キーワード群Ａを漢字コード順
にソートする（図８のステップＳ４２）。Next, the primary keyword group A is sorted in Kanji code order (step S42 in FIG. 8).

【００２６】次に、ソートが済んだ一次キーワード群Ａ
の中に２文字以上の特定の文字列ｓを語頭に有したキー
ワードが連続してＮ以上出現している部分がある場合は
その部分を取り出す（図８のステップＳ４３）。これに
ついて具体例で説明する。ソートが済んだ一次キーワー
ド群のなかに特定の文字列ｓとしての例えば「電子」と
いう文字列を語頭に有したキーワードが１００以上あっ
た場合はそれらは取り出される。取り出されたキーワー
ド群の一部を以下に示した。なお、語頭が「電子」であ
るこれらキーワード群を説明の都合上Ｃと称する。電子電子化辞書電子会議：：電子通信次に、電子で始まるキーワード群Ｃそれぞれから「電
子」の部分を取り除く。「電子」の文字列を取り除いた
文字列群をＤと称する（図８のステップＳ４４）。この
文字列群Ｄは次の様なものとなる。化辞書会議：：通信次に、「電子」の文字列を取り除いた文字列群Ｄの各文
字列のなかで一次キーワード群Ａか辞書の単語群Ｂに含
まれないものの数を計数しｍを求める（図８のステップ
Ｓ４５）。この例の場合は文字列群Ｄのなかで一次キー
ワード群Ａか辞書の単語群Ｂに含まれないものは「化辞
書」であるので、ｍ＝１となる。Next, the sorted primary keyword group A
If there is a portion in which N or more consecutive keywords each having a specific character string s of two or more characters appear in N, the portion is extracted (step S43 in FIG. 8). This will be described with a specific example. If there are 100 or more keywords having the character string “electronic” as the specific character string s at the beginning of the sorted primary keyword group, those are extracted. Some of the extracted keywords are shown below. Note that these keyword groups with the initials of "electronic" are referred to as C for convenience of explanation. Electronic Electronic Dictionary Electronic Conference :: Electronic Communication Next, the "electronic" part is removed from each of the keyword groups C starting with electronic. The character string group from which the character string of "electronic" is removed is called D (step S44 in FIG. 8). This character string group D is as follows. Next, the number of words that are not included in the primary keyword group A or the word group B of the dictionary among the character strings of the character string group D from which the “electronic” character string is removed is counted and m is calculated. Obtained (step S45 in FIG. 8). In this example, the character string group D that is not included in the primary keyword group A or the word group B of the dictionary is the "chemical dictionary", so that m = 1.

【００２７】次に、このｍと上記設定した非登録数しき
い値Ｍとを比較する（図８のステップＳ４６）。ここで
は、Ｍ＝１でありまたｍ＝１であるので、処理はステッ
プＳ４８に飛ぶ。すなわち、一次キーワード群Ａはその
ままとされるのである。したがって、「電子化辞書」な
どはそのまま一次キーワード群に含まれるので、「化辞
書」のような不要なキーコードは抽出されずに済む。な
お、もしここで「電子化辞書」が一次キーコード群にな
かった場合はｍ＝０となるのでｍ≧Ｍを満足しないので
処理はステップＳ４７に飛ぶ。すなわち、「電子」と、
電子の文字列を取り除いた文字列群Ｄ（「辞書」、「会
議」、、・・・、「通信」）とが、電子で始まるキーワ
ード群Ｃの代わりに一次キーコード群Ａに入れられ、Ａ
に残っている他の一次キーワード群と合わされてこれが
Ａとされ、その後、このＡを漢字コード順にソートする
（図８のステップＳ４７）。後者の例の場合は、元のキ
ーワード群Ｃ（電子会議などを含んでいたもの）に比
べ、各キーワードの文字数は少なくなるから、たとえば
キーワードを格納するメモリ容量を低減できるなどの効
果が得られる。また、その場合でも例えば「電子」と
「会議」とからキーワード「電子会議」はえられるので
ある。また、検索効率の良いキーワードが得られる。Next, this m is compared with the non-registration number threshold M set above (step S46 in FIG. 8). Here, since M = 1 and m = 1, the process jumps to step S48. That is, the primary keyword group A is left as it is. Therefore, since the “electronic dictionary” and the like are included in the primary keyword group as they are, unnecessary key codes such as the “chemical dictionary” need not be extracted. If "electronic dictionary" is not in the primary key code group, m = 0 and m ≧ M is not satisfied, and the process jumps to step S47. That is, "electronic"
A character string group D (“dictionary”, “meeting”, ..., “Communication”) from which electronic character strings are removed is put in the primary key code group A instead of the keyword group C starting with an electronic character, A
This is combined with the other primary keyword groups remaining in A to make it A, and then this A is sorted in Kanji code order (step S47 in FIG. 8). In the case of the latter example, the number of characters of each keyword is smaller than that of the original keyword group C (which includes the electronic conference), so that the memory capacity for storing the keyword can be reduced, for example. . Even in that case, the keyword "electronic conference" can be obtained from "electronic" and "meeting". Also, keywords with good search efficiency can be obtained.

【００２８】次に、一次キーコード群Ａの各キーコード
それぞれの文字の並び順を逆順にする（図９のステップ
Ｓ４８）。文字列の末尾側に来る確立の高いキーワード
を抽出するための準備を行なうためである。次に、この
各キーコードそれぞれが逆順とされた一次キーコード群
を漢字コード順にソートする（ステップＳ４９）。次
に、上記語頭の場合と同様な手順で連続出現部分を検出
してキーワード群Ｃを得、その各キーワード群の各キー
ワードから特定文字ｓを取り除き文字列群Ｄを得、この
文字列群Ｄからｍを求め、このｍとＭとの比較をし、こ
の比較結果に応じた処理をする（図９のステップＳ５
０）。このＳ４８〜Ｓ５０の一連の処理では、文字列の
末尾側に来る確立の高いキーワードを抽出出来る。Next, the arrangement order of the characters of each key code of the primary key code group A is reversed (step S48 in FIG. 9). This is to prepare for extracting highly probable keywords that come to the end of the character string. Next, the primary key code groups in which the respective key codes are in the reverse order are sorted in the Kanji code order (step S49). Next, a sequence of occurrences is detected in the same procedure as in the case of the above word to obtain a keyword group C, the specific character s is removed from each keyword of each keyword group to obtain a character string group D, and this character string group D is obtained. Then, m is obtained from this, and this m is compared with M, and processing is performed according to the result of this comparison (step S5 in FIG. 9).
0). In the series of processes of S48 to S50, highly probable keywords that come to the end of the character string can be extracted.

【００２９】次に、ステップＳ４８で逆順とした各キー
ワードそれぞれの文字の並びを再び逆順にし各キーワー
ドの文字の並びを元に戻す（図９のステップＳ５１）。Next, the character arrangement of each keyword, which has been reversed in step S48, is reversed again to restore the original character arrangement of each keyword (step S51 in FIG. 9).

【００３０】次に、分割対象があるか否かを判定する
（図９のステップＳ５２）。この判定結果に応じステッ
プＳ４２若しくはステップＳ５３のいずれかに飛ぶ。Next, it is determined whether or not there is a division target (step S52 in FIG. 9). Depending on the determination result, the process jumps to either step S42 or step S53.

【００３１】この２次キーワード抽出処理によれば、一
次キーワードにおいて語頭にきやすい品詞であって完全
に分割できる品詞を分割でき、また、一次キーワードに
おいて語尾にきやすい品詞であって完全に分割できる品
詞を分割でき、これら分割したものを二次キーワードと
して二次キーワードベース１０９に格納出来る。然も、
専門用語などの複合語は分割されることなく二次キーワ
ードを抽出出来る。According to this secondary keyword extraction processing, it is possible to divide a part-of-speech that is easy to start at the primary keyword and can be completely divided, and a part-of-speech that is easy to end at the primary keyword and can be completely divided. It can be divided and these divided ones can be stored in the secondary keyword base 109 as secondary keywords. Of course,
Secondary keywords can be extracted from compound words such as technical terms without being divided.

【００３２】図２のニュース記事１２０の本文から得ら
れた一次キーワードに対し上述の二次キーワード抽出処
理を行った後に得られる二次キーワードは「場合」、
「手渡」となる。したがって、図２のニュース記事１２
０の本文から抽出される索引キーコードは、以下の様な
ものとなる。すなわち、「場合手渡」から「場合」、
「手渡」が追加されることになる。フリーソフトｆｔｐｆｔｐ．ｏｋｉ．ｃｏ．ｊｐ
／ｐｕｂ／ｍａｃｅｕｄｏｒａ−１．２．２Ｊイン
ターネット入場合手渡。The secondary keyword obtained after performing the above-described secondary keyword extraction processing on the primary keyword obtained from the text of the news article 120 in FIG. 2 is “case”,
It will be "handed". Therefore, news article 12 in FIG.
The index key code extracted from the body of 0 is as follows. In other words, from "case handing" to "case",
"Handing" will be added. Free software ftp ftp. oki. co. jp
/ Pub / mac eudora-1.2.2J Internet handed over.

【００３３】上述においてはこの出願のネットワークニ
ュース記事からのキーワード抽出方法および装置の実施
例について説明したがこれらの発明は詳述の実施例に限
られない。例えば、ヘッダ抽出処理、シグネーチャ抽出
処理、引用部解釈処理および改行解釈処理の各処理をの
手順は上述の例に限られずこの発明の目的に即した他の
手順でも良い。また、この発明はプログラム合成、エキ
スパートシステムのための知識獲得システムなどに応用
可能である。Although the embodiments of the method and apparatus for extracting keywords from the network news article of this application have been described above, these inventions are not limited to the detailed embodiments. For example, the procedures of the header extraction processing, the signature extraction processing, the quotation portion interpretation processing, and the line feed interpretation processing are not limited to the above-mentioned examples, and other procedures according to the object of the present invention may be used. Further, the present invention can be applied to a program synthesis, a knowledge acquisition system for an expert system, and the like.

【００３４】[0034]

【発明の効果】上述した説明か明らかな様に第一発明の
ネットワークニュース記事からのキーワード抽出方法に
よれば、ニュース記事に対しヘッダ抽出処理、シグネー
チャ抽出処理、引用部解釈処理および改行解釈処理の各
処理を少なくとも実施した後に、字種の区切りを重視し
た形態素解析を実施して該ニュース記事から文字列を抽
出し、該文字列を索引キーワードとするので、ヘッダお
よびシグネーチャが分離されたニュース記事であって引
用部および改行の状態が把握されたニュース記事から、
字種の固まり単位で文字列が抽出される。このため、ヘ
ッダ、シグネーチャ、引用部および改行の影響を防止出
来、かつ、無意味に分割された検索キーワードが抽出さ
れることを防止できるので、複合語をはじめニュース記
事にふさわしい索引キーワードを抽出することができ
る。As is apparent from the above description, according to the keyword extracting method from the network news article of the first invention, the header extracting process, the signature extracting process, the quoting part interpreting process and the line feed interpreting process are performed on the news article. After at least performing each process, a morphological analysis that emphasizes character type separation is performed to extract a character string from the news article, and the character string is used as an index keyword. Therefore, the news article in which the header and the signature are separated is used. And from the news article in which the state of the quotation part and the line feed was grasped,
A character string is extracted for each unit of character type. For this reason, it is possible to prevent the influence of the header, signature, quotation, and line feed, and to prevent the search keywords that are meaninglessly divided from being extracted. Therefore, it is possible to extract index keywords suitable for news articles including compound words. be able to.

【００３５】また、第二発明のネットワークニュース記
事からのキーワード抽出装置によれば、第一発明の実施
を容易にする。Further, the keyword extracting device from the network news article of the second invention facilitates the implementation of the first invention.

[Brief description of drawings]

【図１】実施例の抽出装置の構成を示した図である。FIG. 1 is a diagram showing a configuration of an extraction device according to an embodiment.

【図２】ニュース記事の説明図である。FIG. 2 is an explanatory diagram of a news article.

【図３】ヘッダ抽出処理の説明図である。FIG. 3 is an explanatory diagram of header extraction processing.

【図４】シグネーチャ抽出処理の説明図である。FIG. 4 is an explanatory diagram of signature extraction processing.

【図５】引用部解釈処理の説明図である。FIG. 5 is an explanatory diagram of a quotation portion interpretation process.

【図６】改行解釈処理の説明図である。FIG. 6 is an explanatory diagram of line feed interpretation processing.

【図７】形態素解析の説明図であり、それに用いたオー
トマトンの一例の説明図である。FIG. 7 is an explanatory diagram of a morpheme analysis, and is an explanatory diagram of an example of an automaton used for it.

【図８】二次キーワード抽出処理の説明図である。FIG. 8 is an explanatory diagram of secondary keyword extraction processing.

【図９】二次キーワード抽出処理の図８に続く説明図で
ある。FIG. 9 is an explanatory diagram of the secondary keyword extraction processing continued from FIG. 8;

[Explanation of symbols]

１００：実施例の抽出装置１００ａ：一次キーワード抽出部１００ｂ：二次キーワード抽出部１２０：ニュース記事１２０ａ：ヘッダ１２０ｂ：シグネーチャ１２０ｃ：引用部１２０ｄ：本文 100: Extraction Device of Example 100a: Primary Keyword Extraction Unit 100b: Secondary Keyword Extraction Unit 120: News Article 120a: Header 120b: Signature 120c: Citation Section 120d: Body

───────────────────────────────────────────────────── フロントページの続き (72)発明者羽生田博美東京都港区虎ノ門１丁目７番12号沖電気工業株式会社内 (72)発明者木下哲男東京都港区虎ノ門１丁目７番12号沖電気工業株式会社内 ─────────────────────────────────────────────────── ─── Continuation of the front page (72) Inventor Hiromi Hanyuda 1-7-12 Toranomon, Minato-ku, Tokyo Oki Electric Industry Co., Ltd. (72) Inventor Tetsuo Kinoshita 1-12-12 Toranomon, Minato-ku, Tokyo Oki Electric Industry Co., Ltd.

Claims

[Claims]

1. When extracting an index keyword from a network news article, a process of extracting a header from a news article, a process of extracting a signature from the news article, a process of interpreting a quoted part in the news article, and the news article. At least each process of interpreting a line break in the above is performed, and a morpheme analysis that emphasizes the delimitation of the character type is performed on the news article on which each process is performed to extract a character string from the news article, A method for extracting a keyword from a network news article, which is characterized by using a character string as an index keyword.

2. The method of extracting keywords from a network news article according to claim 1, wherein the extracted character string is further divided using a dictionary and a predetermined statistical division rule, and the division is performed. A method for extracting a keyword from a network news article, characterized in that the obtained character string is used as an index keyword.

3. The method of extracting keywords from a network news article according to claim 2, wherein the division processing using a dictionary and a statistical division rule sorts the extracted character string groups in Kanji code order. Each character string of the character string group in which the character string having a specific character string s of 2 or more characters at the beginning of the sorted character string group continues for N or more is divided into the character string s and the rest. A method of extracting a keyword from a network news article, wherein the division is performed when the number that the remaining portion does not exist in the extracted character string group or dictionary is m. Only when m is smaller than a predetermined value M, where N and M are predetermined positive integers.).

4. The method of extracting keywords from a network news article according to claim 3, wherein a character string group in which the character arrangement order of the character strings of the character string group is reversed is generated, and the character string group is generated. A method of extracting a keyword from a network news article, further comprising:

5. An apparatus for extracting an index keyword from a network news article, a header extracting section for extracting a header from a news article, a signature extracting section for extracting a signature from the news article, and a quoting section in the news article. A quote section interpreting section that interprets a line break in the news article, a line break interpreting section that interprets a line break in the news article, and a news article that is at least processed by the header extraction section, the signature extraction section, the quote section interpretation section, and the line break interpretation section. A keyword extraction device from a network news article, comprising: a morphological analysis unit that performs morphological analysis with an emphasis on character type delimiters.

6. The keyword extraction device from a network news article according to claim 5, wherein the character string divided by the morpheme analysis unit from the news article is divided using a dictionary and a predetermined statistical division rule. A keyword extracting device from a network news article, further comprising a secondary keyword extracting section for obtaining a secondary keyword as an index key code.

7. The apparatus for extracting keywords from network news articles according to claim 6, wherein the secondary keyword extraction unit sorts the group of character strings divided by the morpheme analysis unit in Kanji code order. Each character string of the character string group in which a character string having a specific character string s of two or more characters at the beginning of the completed character string group continues for N or more is divided into the character string s and the remaining portion, A device for extracting keywords from network news articles, characterized in that these divided character strings are used as secondary keywords (however, in the division, the remaining part is a group of the extracted character strings or in a dictionary). It is limited to the case where m is a number that does not exist in M, and m is smaller than a predetermined value M. Here, N and M are predetermined positive integers.).

8. The apparatus for extracting keywords from a network news article according to claim 7, wherein the secondary keyword extracting unit is a character string group in which the character arrangement order of the character strings of the character string group is reversed. And a keyword extracting device from a network news article, characterized in that the dividing process is further performed on the character string group.