JP2010231272A

JP2010231272A - Keyword string detection method and apparatus

Info

Publication number: JP2010231272A
Application number: JP2009075053A
Authority: JP
Inventors: Tetsuro Sato; 哲朗佐藤; Fuminori Kawaguchi; 文法河口
Original assignee: NODC KK
Current assignee: NODC KK
Priority date: 2009-03-25
Filing date: 2009-03-25
Publication date: 2010-10-14

Abstract

<P>PROBLEM TO BE SOLVED: To check whether many registered keyword strings are included in an input stream with a comparatively simple configuration even if many keyword strings are to be detected. <P>SOLUTION: A keyword detection unit 20 of first hierarchy detects a keyword (KW) from an input stream STM to output an identifier KW thereof. Then, a keyword string detection unit 30 of second hierarchy detects a keyword string (KWA) composed of a plurality of keywords (KW) to output an identifier KWA thereof. Then, a W/G/B integrated determination unit 40 of third hierarchy determines whether one or more detected keyword string identifiers KWA belonging to the same group are totally any of white W, gray G, or black B, and outputs a result thereof. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、キーワード列検出方法及び装置に関する。 The present invention relates to a keyword string detection method and apparatus.

インターネットを介して様々な情報を入手したり多数の人とコミュニケーションをとることができる反面、ソーシャル・ネットワーキング・サービス（ＳＮＳ）やブログなどのコミュニティサイトで、個人や組織を誹謗中傷したり、公序良俗に反したりする文章が書き込まれると、社会問題となる。意に反する書き込みにより自社の悪評が世間に広まると、そのイメージを払拭するのは容易でなく、企業の存亡に関わる。そこで、ウェブページ上の書き込みを監視するウェブパトロールが、業として行われている。 While various information can be obtained and communicated with many people via the Internet, community sites such as social networking services (SNS) and blogs slander individuals, organizations, and public order and morality. If a text that goes against it is written, it becomes a social problem. When a company's bad reputation spreads to the public due to unintentional writing, it is not easy to dispel the image, and it is related to the existence of a company. Therefore, web patrol for monitoring writing on a web page is performed as a business.

検索エンジンにキーワードを入力し、その結果のリストから人がリンクをクリックしてウェブページを閲覧することによりウェブパトロールを行うのは、効率が悪い。ソフトウェアによりウェブパトロールを自動化することもできるが、大量の情報を高速に処理することは困難である。 It is inefficient to perform a web patrol by inputting a keyword into a search engine and browsing a web page by clicking a link from a list of results. Web patrol can be automated by software, but it is difficult to process a large amount of information at high speed.

正規表現を用いてキーワード検索を行えば、１つのキーワードで多数の単純キーワードを検索することができる。正規表現検索は、非決定性オートマトン（ＮＦＡ）を用いて実行することができる。 If keyword search is performed using a regular expression, a large number of simple keywords can be searched with one keyword. A regular expression search can be performed using a non-deterministic automaton (NFA).

下記特許文献１には、ワンチップ上に数千のＮＦＡを並列動作させることが可能な状態機械アーキテクチャが開示されている。また、ＮＦＡを階層構造にすることも開示されている。 Patent Document 1 below discloses a state machine architecture capable of operating thousands of NFAs in parallel on one chip. It is also disclosed that the NFA has a hierarchical structure.

この技術によれば、高速処理が可能であるので、ネットワーク上のデータをコンピュータに入力する前に、データをフィルタリング可能である。
特表２００５−５３７５５０号公報 According to this technique, since high-speed processing is possible, data can be filtered before data on the network is input to the computer.
JP 2005-537550 A

しかし、ウェブページ上の書き込み監視を上記特許文献１の状態機械アーキテクチャで行おうとすると、次のような問題が生ずる。 However, when the write monitoring on the web page is performed by the state machine architecture disclosed in Patent Document 1, the following problem occurs.

すなわち、この文献の実施例では、１つのＮＦＡの状態遷移評価シンボルの数（文字数）はキーワード列検出状態記憶部３２で、１シンボルは１バイトである。日本語のように文字セットの文字数が数千で且つ１文字が２バイトである場合に、この状態機械アーキテクチャを適用すると、１つのＮＦＡが大規模になる。このため、多数のチップを結合させ、それら全体を制御する必要があり、構成が複雑になる。 That is, in the embodiment of this document, the number (number of characters) of state transition evaluation symbols of one NFA is the keyword string detection state storage unit 32, and one symbol is 1 byte. When the number of characters in a character set is thousands and one character is 2 bytes as in Japanese, applying this state machine architecture makes one NFA large. For this reason, it is necessary to couple a large number of chips and control them as a whole, which complicates the configuration.

一方、ＮＦＡは、理論的にはＤＦＡ（決定性オートマトン）に変換することができる。しかし、例えば、第１〜３キーワードのそれぞれが１００個の単純キーワードを含み、第１〜３キーワードがこの順に、文章に含まれているか否かを調べる場合、ＤＦＡでは、この１つの状態遷移評価キーワード列（パターン）だけでも、１００×１００×１００＝１００万の状態遷移を想定して、ステートマシンを構成しなければならない。このため、例えば状態遷移評価キーワード列数が不適切書き込み検出装置１０万である場合、キーワード列をＤＦＡで検出しようとすると状態爆発が起こり、実用的でない。 On the other hand, NFA can theoretically be converted to DFA (deterministic automaton). However, for example, in the case where each of the first to third keywords includes 100 simple keywords and it is checked whether or not the first to third keywords are included in the sentence in this order, in DFA, this one state transition evaluation is performed. The state machine must be configured assuming 100 × 100 × 100 = 1 million state transitions using only the keyword string (pattern). For this reason, for example, when the number of state transition evaluation keyword strings is 100,000, the inappropriate write detection device 100,000, a state explosion occurs when trying to detect the keyword strings by DFA, which is not practical.

このような問題は、Ｗｅｂページ上の不適切な書き込みだけではなく、好評の書き込みや、インターネット上での市場ニーズ調査や、書類を分類して整理する場合や、テキストマイニングにおいても生ずる。キーワード列を構成するそれぞれのキーワードが文字以外のコードで構成される場合にも、同様の問題が生ずる。 Such a problem occurs not only in inappropriate writing on a Web page, but also in popular writing, surveying market needs on the Internet, sorting and organizing documents, and text mining. A similar problem occurs when each keyword constituting the keyword string is composed of codes other than characters.

本発明の目的は、上記問題点に鑑み、検出すべきキーワード列が多数であっても、比較的簡単な構成で、登録された多数のキーワード列が入力ストリームに含まれているか否かを調べることが可能なキーワード列検出方法及び装置を提供することにある。 In view of the above problems, an object of the present invention is to check whether or not a large number of registered keyword strings are included in an input stream with a relatively simple configuration even if there are a large number of keyword strings to be detected. It is an object of the present invention to provide a keyword string detection method and apparatus capable of performing the above.

本発明の第１態様では、入力ストリームが供給され、複数登録されているキーワード列のどれが該入力ストリームに含まれているかを調べてその識別子を出力するキーワード列検出装置において、
該入力ストリームに、複数登録されているキーワードのどれと一致するものが含まれているかを調べて、一致するキーワードの識別子を順次出力するキーワード検出手段と、
該複数のキーワード識別子のそれぞれに対応して、キーワードが該複数のキーワード列のどれのどの位置に含まれているかを示すキーワード列識別子・列内位置情報が格納された変換情報記憶手段と、
該複数のキーワード列のそれぞれがどのキーワードまで検出されているかを示す現状態が格納されるキーワード列検出状態記憶手段と、
該キーワード検出手段で検出されたキーワード識別子に対応したキーワード列識別子・列内位置情報を該変換情報記憶手段から読み出させ、読み出されたキーワード列識別子・列内位置情報に基づいて、該キーワード列検出状態記憶手段内の対応する現状態を読み出させ、この現状態がこの情報の列内位置情報に対応していればこの現状態を次状態に遷移させ、遷移後の現状態が出力状態であればキーワード列識別子を出力する状態遷移制御手段とを有する。 In a first aspect of the present invention, an input stream is supplied, and a keyword string detection device that checks which one of a plurality of registered keyword strings is included in the input stream and outputs an identifier thereof,
A keyword detecting means for checking which one of a plurality of registered keywords is included in the input stream and sequentially outputting identifiers of the matching keywords;
Corresponding to each of the plurality of keyword identifiers, conversion information storage means storing keyword column identifier / in-column position information indicating in which position of the plurality of keyword columns the keyword is included,
A keyword string detection state storage means for storing a current state indicating to which keyword each of the plurality of keyword strings has been detected;
The keyword string identifier / in-column position information corresponding to the keyword identifier detected by the keyword detecting unit is read from the conversion information storage unit, and the keyword is determined based on the read keyword column identifier / in-column position information. The corresponding current state in the column detection state storage means is read, and if this current state corresponds to the in-column position information of this information, this current state is transitioned to the next state, and the current state after the transition is output If it is in a state, it has a state transition control means for outputting a keyword string identifier.

上記第１態様の構成によれば、第１階層のキーワード検出手段で、入力ストリームに含まれているキーワードを検出してその識別子を順次出力し、第２階層において、キーワード識別子に対応したキーワード列識別子・列内位置情報を変換情報記憶手段から読み出させ、読み出されたキーワード列識別子・列内位置情報に基づいて、キーワード列検出状態記憶手段内の対応する現状態を読み出させ、この現状態がこの情報の列内位置情報に対応していればこの現状態を次状態に遷移させ、遷移後の現状態が出力状態であればキーワード列識別子を出力するので、検出すべきキーワード列が多数であっても、比較的簡単な構成で、登録された多数のキーワード列が入力ストリームに含まれているか否かを調べることができるという効果を奏する。 According to the configuration of the first aspect, the keyword detection means in the first hierarchy detects the keywords included in the input stream and sequentially outputs the identifiers, and the keyword string corresponding to the keyword identifiers in the second hierarchy. The identifier / in-column position information is read from the conversion information storage means, and the corresponding current state in the keyword string detection state storage means is read based on the read keyword string identifier / in-column position information. If the current state corresponds to the position information in the column of this information, this current state is transitioned to the next state, and if the current state after the transition is an output state, a keyword column identifier is output. Even if there are a large number, it is possible to check whether or not a large number of registered keyword strings are included in the input stream with a relatively simple configuration.

本発明の他の目的、構成及び効果は以下の説明から明らかになる。 Other objects, configurations and effects of the present invention will become apparent from the following description.

キーワード識別子をＫＷ、キーワード列識別子をＫＷＡで表し、複数のＫＷ及びＫＷＡのそれぞれにインデクスを付加してそれらを区別し、識別子の対象を（識別子）で表す。例えば、キーワード列（ＫＷＡ３）が１つの文章中にキーワード（ＫＷ５）、（ＫＷｍ）及び（ＫＷ３）をこの順に含んでいることを表し、キーワード間に０個以上の任意の文字が存在していてもよいとする。これは、正規表現を用いて、
（ＫＷＡ３）＝（ＫＷ５）．＊（ＫＷｍ）．＊（ＫＷ３）・・・（１）
と表すことができる。「．」は改行を含む任意の１文字であり、＊は直前の文字の０個以上の繰り返しを意味する。なお、「．」は一般的な定義のように、改行を含まない任意の１文字とし、改行を１文章の終わりとみなしてもよい。 The keyword identifier is represented by KW, the keyword string identifier is represented by KWA, an index is added to each of the plurality of KWs and KWAs to distinguish them, and the identifier target is represented by (identifier). For example, the keyword string (KWA3) indicates that one sentence includes the keywords (KW5), (KWm), and (KW3) in this order, and there are zero or more arbitrary characters between the keywords. It is also good. This uses regular expressions,
(KWA3) = (KW5). * (KWm). * (KW3) (1)
It can be expressed as. “.” Is any single character including a line feed, and * means zero or more repetitions of the immediately preceding character. Note that “.” May be an arbitrary character that does not include a line break, as in a general definition, and the line break may be regarded as the end of one sentence.

図１は、本発明の実施例１に係る不適切書き込み検出装置１０の概略ブロック図である。 FIG. 1 is a schematic block diagram of an inappropriate writing detection apparatus 10 according to the first embodiment of the present invention.

この装置１０は、掲示板やブログなどのＷｅｂ頁上の不適切な書き込みを３階層で自動検出するものであり、第１階層のキーワード検出部２０により、入力ストリームＳＴＭからキーワード（ＫＷ）を検出してその識別子ＫＷを出力し、次いで第２階層のキーワード列検出部３０により、複数のキーワード（ＫＷ）からなるキーワード列（ＫＷＡ）を検出してその識別子ＫＷＡを出力し、次いで第３階層のＷ／Ｇ／Ｂ総合判定部４０により、同一グループに属する１個以上の検出されたキーワード列識別子ＫＷＡが全体としてホワイト（問題なし）Ｗ、グレイ（人の判断要）Ｇ又はブラック（不適切表現）Ｂのいずれであるかを判定してその結果を出力する。 This device 10 automatically detects inappropriate writing on a Web page such as a bulletin board or a blog in three layers, and a keyword (KW) is detected from an input stream STM by a keyword detection unit 20 in the first layer. The identifier KW is output, and then the keyword string detection unit 30 in the second hierarchy detects the keyword string (KWA) composed of a plurality of keywords (KW), outputs the identifier KWA, and then outputs the identifier KWA. By the / G / B comprehensive determination unit 40, one or more detected keyword string identifiers KWA belonging to the same group as a whole are white (no problem) W, gray (human judgment required) G or black (inappropriate expression). B is determined and the result is output.

例えば、ブラックＢのキーワード列（ＫＷＡ１）＝「モルヒネ」．＊「買いたい」が検出されても、ホワイトＷのキーワード列（ＫＷＡ２）＝「疼痛治療」．＊「モルヒネ」も検出され、ＫＷＡ１とＫＷＡ２とが同一グループに属すると定義されていれば、全体としてホワイトＷと判定される。 For example, Black B keyword string (KWA1) = “morphine”. * Even if “I want to buy” is detected, the white W keyword string (KWA2) = “pain treatment”. * "Morphine" is also detected, and if KWA1 and KWA2 are defined as belonging to the same group, it is determined as white W as a whole.

キーワード検出部２０には、インターネットを介してコンピュータ内のメモリ又は補助記憶装置に格納されたＷｅｂ頁の書き込みデータが、入力ストリームＳＴＭとして供給される。 The keyword detection unit 20 is supplied with Web page write data stored in a memory in the computer or an auxiliary storage device as an input stream STM via the Internet.

キーワード検出部２０は、例えばハードウエア構成の有限状態オートマトン記憶装置２１を備えており、入力ストリームＳＴＭがクロックに同期して例えばバイト単位で入力シンボルとして有限状態オートマトン記憶装置２１に順次供給され、入力シンボルと現状態とに基づいて次状態が読み出される処理が繰り返され、現状態が出力状態であることを意味していれば、すなわち、予め登録されたキーワード（ＫＷ）のいずれかに一致するものが入力ストリームＳＴＭに含まれていると判定されると、この出力状態が格納されているメモリのアドレスが、キーワード識別子ＫＷとしてキーワード検出部２０から出力され、キーワード列検出部３０に供給される。 The keyword detection unit 20 includes a finite state automaton storage device 21 having a hardware configuration, for example, and the input stream STM is sequentially supplied to the finite state automaton storage device 21 as an input symbol, for example, in units of bytes in synchronization with the clock. If the process of reading the next state based on the symbol and the current state is repeated and it means that the current state is an output state, that is, one that matches one of the keywords (KW) registered in advance Is included in the input stream STM, the address of the memory storing this output state is output from the keyword detection unit 20 as the keyword identifier KW and supplied to the keyword string detection unit 30.

有限状態オートマトン記憶装置２１は、例えば１０万個のキーワードを検出可能となっている。キーワードは、正規表現を含まない単純キーワードであっても、正規表現キーワードであってもよい。正規表現キーワードについては、これを検出するＮＦＡがＤＦＡに展開される。さらに、ＡＣ法（Aho-Corasick algorithm）により、全ＤＦＡが１つのＤＦＡに纏められており、入力ストリームＳＴＭをキーワード検出部２０へ１回供給するだけで、キーワードを検出可能となっている。 The finite state automaton storage device 21 can detect, for example, 100,000 keywords. The keyword may be a simple keyword that does not include a regular expression or a regular expression keyword. For regular expression keywords, the NFA that detects this is expanded into DFA. Further, all the DFAs are combined into one DFA by the AC method (Aho-Corasick algorithm), and the keyword can be detected only by supplying the input stream STM once to the keyword detection unit 20.

キーワード列検出部３０は、本発明の実施例の特徴部分であり、変換情報記憶部３１と、キーワード列検出状態記憶部３２と、状態遷移制御部３３とで構成されている。 The keyword string detection unit 30 is a characteristic part of the embodiment of the present invention, and includes a conversion information storage unit 31, a keyword string detection state storage unit 32, and a state transition control unit 33.

変換情報記憶部３１には、キーワード検出部２０で検出可能な全てのキーワードのそれぞれについて、そのキーワード（ＫＷ）が、予め登録された多数のキーワード列（ＫＷＡ）のどれのどの位置ＰＯＳに含まれているかを示す情報が格納されている。 The conversion information storage unit 31 includes, for each of all the keywords that can be detected by the keyword detection unit 20, the keyword (KW) included in any position POS in a number of keyword strings (KWA) registered in advance. Is stored.

図２は、この情報を視覚的に表すＫＷ／ＫＷＡ・ＰＯＳマトリックステーブルを示す。このテーブルのセルには、その列に対応したキーワード列（ＫＷＡ）に、その行に対応したキーワード（ＫＷ）がどの位置に含まれているかを示す値ＰＯＳが記入されている。 FIG. 2 shows a KW / KWA · POS matrix table that visually represents this information. In the cell of this table, a value POS indicating the position where the keyword (KW) corresponding to the row is included is entered in the keyword column (KWA) corresponding to the column.

キーワード識別子ＫＷ及びキーワード列識別子ＫＷＡをそれぞれ変換情報記憶部３１の行アドレス及び列アドレスとして、変換情報記憶部３１に列内キーワード位置ＰＯＳの値を格納することも可能であるが、図２の空白セルに対応した無駄な領域が多くなるとともに、各キーワード識別子ＫＷについて、全ての列アドレス（キーワード列識別子ＫＷＡ）のそれぞれに、列内キーワード位置ＰＯＳが格納されているか否かを調べなければならないので、処理時間が長くなる。 It is possible to store the value of the keyword position POS in the column in the conversion information storage unit 31 using the keyword identifier KW and the keyword column identifier KWA as the row address and column address of the conversion information storage unit 31, respectively. Since the useless area corresponding to the cell increases, it is necessary to check whether or not the keyword position POS in the column is stored in each column address (keyword column identifier KWA) for each keyword identifier KW. , Processing time becomes longer.

そこで、変換情報記憶部３１には、図３に示すようなインデックステーブル３１１と、キーワード列識別子・列内位置情報３１２とが格納されている。インデックステーブル３１１のワードアドレスＡＤＤＲ＝ｉには、キーワード識別子ＫＷｉと、これに関係したキーワード列識別子・列内位置情報３１２の先頭アドレスＳＡＤＲｉと、この先頭アドレスからのワード数ＷＮｉとが格納されている。例えばＫＷ２は、キーワード列識別子・列内位置情報３１２のワードアドレスＡＤＤＲ２から４ワードに渡って各ワードに、ＫＷ２に関係したキーワード列識別子ＫＷＡと列内キーワード位置ＰＯＳとが格納されている。 Therefore, the conversion information storage unit 31 stores an index table 311 as shown in FIG. 3 and keyword string identifier / in-column position information 312. The word address ADDR = i of the index table 311 stores the keyword identifier KWi, the head address SADRi of the keyword column identifier / in-column position information 312 related thereto, and the number of words WNi from this head address. . For example, in KW2, a keyword column identifier KWA related to KW2 and an in-column keyword position POS are stored in each word from the word address ADDR2 of the keyword column identifier / in-column position information 312 to 4 words.

図１のキーワード列検出状態記憶部３２には、図４（Ａ）に示すように、登録された全キーワード列のそれぞれの識別子ＫＷＡの値に対応した（等しい又は一定値シフトした値等）アドレスに、現状態（現検出状態）ＳＴが格納された現状態アレイ３２１と、初期状態ＳＴ０が格納された初期状態アレイ３２２とが格納されている。図４（Ａ）では、ｉ番目のキーワード列識別子ＫＷＡｉの現状態をＳＴｉ、初期状態をＳＴ０ｉで表している。各ｉの値について、現状態ＳＴｉのアドレスは、初期状態ＳＴｉ０のそれに一定値を加算した値である。 In the keyword string detection state storage unit 32 of FIG. 1, as shown in FIG. 4A, addresses corresponding to the values of the identifiers KWA of all registered keyword strings (values that are equal or shifted by a constant value, etc.) In addition, a current state array 321 in which the current state (current detection state) ST is stored and an initial state array 322 in which the initial state ST0 is stored are stored. In FIG. 4A, the current state of the i-th keyword string identifier KWAi is represented by STi, and the initial state is represented by ST0i. For each i value, the address of the current state STi is a value obtained by adding a certain value to that of the initial state STi0.

図１に戻って、入力ストリームＳＴＭが文章の終わりを示すと、状態遷移制御部３３に供給されるリセット信号ＲＳＴが活性になる。状態遷移制御部３３は、これに応答してキーワード列検出状態記憶部３２に対し、図４（Ｂ）に示す初期化処理３３２を実行する。以下、括弧内は図中のステップ識別符号である。 Returning to FIG. 1, when the input stream STM indicates the end of the sentence, the reset signal RST supplied to the state transition control unit 33 becomes active. In response to this, the state transition control unit 33 executes the initialization process 332 shown in FIG. 4B for the keyword string detection state storage unit 32. In the following, the step identification codes in the figure are shown in parentheses.

（Ｓ０）初期化中フラグＦをセットする。 (S0) The initialization flag F is set.

（Ｓ１）図４（Ａ）において、ｉ＝１〜ｎのそれぞれにつき現状態ＳＴｉに初期状態ＳＴ０ｉを代入する。この処理は、ソフトウエアでＣＰＵを介し行っても、ＤＭＡＣ（ダイレクトメモリアクセス制御）で行ってもよい。 (S1) In FIG. 4A, the initial state ST0i is assigned to the current state STi for each of i = 1 to n. This processing may be performed by software via a CPU or by DMAC (direct memory access control).

ここで、図１に示すように、状態遷移制御部３３はＫＷキュー３３１を備えており、有限状態オートマトン記憶装置２１が上述のように出力状態となると、そのアドレスがキーワード識別子ＫＷとしてキーワード検出部２０から出力され、状態遷移制御部３３のＫＷキュー３３１に入力される。このアドレスを所定値シフトしたり、ＣＡＭメモリを介してアドレス間を詰めたものをキーワード識別子ＫＷとしてもよい。 Here, as shown in FIG. 1, the state transition control unit 33 includes a KW queue 331, and when the finite state automaton storage device 21 is in the output state as described above, its address is the keyword identifier KW and the keyword detection unit. 20 and input to the KW queue 331 of the state transition control unit 33. A keyword identifier KW may be obtained by shifting this address by a predetermined value or by closing between addresses via a CAM memory.

図５は、状態遷移制御部３３によるキーワード列状態遷移処理３３３を示すフローチャートであり、このキーワード列状態遷移処理３３３は、ソフトウエアで行ってもハードウエアで行ってもよい。 FIG. 5 is a flowchart showing the keyword string state transition process 333 by the state transition control unit 33. This keyword string state transition process 333 may be performed by software or hardware.

（Ｓ１０）ＫＷキュー３３１からキーワード識別子ＫＷを１つ読み出す。 (S10) One keyword identifier KW is read from the KW queue 331.

（Ｓ１１）ＫＷキュー３３１が空（キーワード識別子ＫＷがＮＵＬＬ）であればステップＳ１０へ戻り、そうでなければステップＳ１２へ進む。 (S11) If the KW queue 331 is empty (keyword identifier KW is NULL), the process returns to step S10; otherwise, the process proceeds to step S12.

（Ｓ１２）ステップＳ１０で読み出されたキーワード識別子ＫＷｉが格納されているインデックステーブル３１１のワードアドレスＡＤＤＲを検出する。これは、例えば２分探索法（バイナリサーチ）により検出することができる。インデックステーブル３１１を有限オートマトン装置のメモリに反映して、この探索を有限オートマトン装置で行う構成であってもよい。また、ＣＡＭメモリで１回のアクセスによりワードアドレスＡＤＤＲを検出する構成であってもよい。 (S12) The word address ADDR of the index table 311 in which the keyword identifier KWi read in step S10 is stored is detected. This can be detected by, for example, a binary search method (binary search). The index table 311 may be reflected in the memory of the finite automaton device, and this search may be performed by the finite automaton device. Further, the word address ADDR may be detected by one access in the CAM memory.

（Ｓ１３）検出されたワードアドレスＡＤＤＲから、先頭アドレスＳＡＤＲｉ及びワード数ＷＮｉを読出し、それぞれ変数ＡＤＲ及びＣＮＴに代入する。以下、ステップＳ１４〜Ｓ２３の処理を、ＷＮｉ回繰り返す。 (S13) The head address SADRi and the number of words WNi are read from the detected word address ADDR, and are substituted into variables ADR and CNT, respectively. Thereafter, the processes in steps S14 to S23 are repeated WNi times.

（Ｓ１４）ＡＤＲでキーワード列識別子・列内位置情報３１２をアドレス指定して、キーワード列識別子・列内位置情報３１２からキーワード列識別子ＫＷＡ及び列内キーワード位置ＰＯＳを読出す。 (S14) The keyword column identifier / in-column position information 312 is addressed by ADR, and the keyword column identifier KWA and the in-column keyword position POS are read from the keyword column identifier / in-column position information 312.

（Ｓ１５）Ｆ＝‘１’であれば、すなわち図４（Ｂ）の初期化処理３３２が実行中であれば、その処理が終了するのを待ってステップＳ１６へ進み、そうでなければ直接、ステップＳ１６へ進む。 (S15) If F = '1, that is, if the initialization process 332 in FIG. 4B is being executed, the process waits for the process to end and proceeds to step S16; Proceed to step S16.

（Ｓ１６）キーワード列検出状態記憶部３２のキーワード列識別子ＫＷＡｉから現状態ＳＴｉを読み出す。 (S16) The current state STi is read from the keyword string identifier KWAi in the keyword string detection state storage unit 32.

（Ｓ１７）読み出した現状態ＳＴｉが、ステップＳ１４で読み出した列内キーワード位置ＰＯＳｉの値に対応していれば、ステップＳ１８へ進み、そうでなければステップＳ２１へ進む。 (S17) If the read current state STi corresponds to the value of the in-column keyword position POSi read in step S14, the process proceeds to step S18. Otherwise, the process proceeds to step S21.

ここで、図６（Ａ）を参照して、現状態ＳＴ及び列内キーワード位置ＰＯＳの表現の具体例を説明する。 Here, a specific example of the expression of the current state ST and the in-column keyword position POS will be described with reference to FIG.

１つのキーワード列（ＫＷＡ）に含まれるキーワード数は最大８であるとし、現状態ＳＴ及び列内キーワード位置ＰＯＳをそれぞれ１バイトで表す。キーワード列識別子ＫＷＡはこのバイトのアドレスに対応している。任意のキーワード列（ＫＷＡ）について、その最後のキーワード、上式（１）の（ＫＷＡ３）の場合はキーワード（ＫＷ３）を、１バイトの最上位ビット（ＭＳＢ）に対応させ、それからｊ個前のキーワードを（７−ｊ）ビットに対応させる。例えば、対応したビットのみを‘１’とし、他のビットを全て‘０’とする。すなわち、ＫＷＡ３の現状態ＳＴ３の初期状態ＳＴ０３を‘００１０００００’と表す。 The maximum number of keywords included in one keyword column (KWA) is 8, and the current state ST and the keyword position POS in the column are each represented by 1 byte. The keyword string identifier KWA corresponds to the address of this byte. For an arbitrary keyword string (KWA), in the case of (KWA3) in the above expression (1), the keyword (KW3) corresponds to the most significant bit (MSB) of 1 byte, and then j previous The keyword is made to correspond to (7-j) bits. For example, only the corresponding bit is set to ‘1’ and all other bits are set to ‘0’. That is, the initial state ST03 of the current state ST3 of KWA3 is represented as “00100000”.

（Ｓ１８）現状態ＳＴをＭＳＢ側へ１ビットシフトさせる。 (S18) The current state ST is shifted by 1 bit to the MSB side.

上記の例の場合、図６（Ｂ）に示すように、キーワード（ＫＷ５）が検出され且つＳＴ３＝ＰＯＳであれば、現状態ＳＴをＭＳＢ側へ１ビットシフトさせて、ＳＴ３＝‘０１００００００’とする。ＳＴ３がこの値のときに、キーワード（ＫＷｍ）が検出されれば現状態ＳＴをＭＳＢ側へ１ビットシフトさせてＳＴ３＝‘１０００００００’とする。ＳＴ３がこの値のときに、キーワード（ＫＷ３）が検出されれば現状態ＳＴをＭＳＢ側へ１ビットシフトさせてＳＴ３＝‘１０００００００’とする。このとき、キャリーＣが‘１’になる。 In the case of the above example, as shown in FIG. 6B, if the keyword (KW5) is detected and ST3 = POS, the current state ST is shifted by 1 bit to the MSB side, and ST3 = '01000000' To do. If the keyword (KWm) is detected when ST3 is this value, the current state ST is shifted by 1 bit to the MSB side, and ST3 = '10000000'. If the keyword (KW3) is detected when ST3 is this value, the current state ST is shifted by 1 bit to the MSB side, and ST3 = '10000000'. At this time, the carry C becomes “1”.

（Ｓ１９）キャリーＣが‘１’であればステップＳ０へ進み、そうでなければステップＳ２１へ進む。 (S19) If carry C is "1", the process proceeds to step S0, and if not, the process proceeds to step S21.

（Ｓ２０）キーワード列識別子ＫＷＡに対応したキーワード列（ＫＷＡ）が検出されたことを示すために、キーワード列識別子ＫＷＡを出力して、Ｗ／Ｇ／Ｂ総合判定部４０へ供給する。 (S20) In order to indicate that the keyword string (KWA) corresponding to the keyword string identifier KWA has been detected, the keyword string identifier KWA is output and supplied to the W / G / B comprehensive determination unit 40.

（Ｓ２１）カウンタＣＮＴを１だけデクリメントする。 (S21) The counter CNT is decremented by 1.

（Ｓ２２）ＣＮＴ＜０であればステップＳ１０へ戻り、そうでなければステップＳ２３へ進む。 (S22) If CNT <0, the process returns to step S10; otherwise, the process proceeds to step S23.

（Ｓ２３）ワードアドレスＡＤＲを１だけインクリメントし、ステップＳ１４へ戻る。 (S23) The word address ADR is incremented by 1, and the process returns to step S14.

このようにして、比較的簡単な構成で、多数の登録されたキーワード列（ＫＷＡ）のいずれかに一致するキーワード列が入力ストリームＳＴＭに含まれていることを検出することができる。 In this way, it is possible to detect that the input stream STM includes a keyword string that matches any of a large number of registered keyword strings (KWA) with a relatively simple configuration.

Ｗ／Ｇ／Ｂ総合判定部４０は、その記憶部に、登録された全キーワード列（ＫＷＡ）のそれぞれについて、キーワード列識別子ＫＷＡと、それがホワイトＷ、グレイＧ、ブラックＢの何れであるかを示す情報と、キーワード列識別子ＫＷＡが属するグループ識別ＩＤとが対応して格納されており、同一グループＩＤのキーワード列識別子ＫＷＡに基づいて、検出されたキーワード列識別子ＫＷＡがＷであるかＧであるかＢであるかを総合判定する。例えば、同一グループ内の複数のキーワード列識別子ＫＷＡに、Ｂのキーワード列識別子ＫＷＡが含まれていてもＷのキーワード列識別子ＫＷＡが含まれていれば、このグループに含まれる複数のキーワード列識別子ＫＷＡは、全体としてＷであると判定する。 The W / G / B comprehensive determination unit 40 stores, for each of all the keyword strings (KWA) registered in the storage unit, the keyword string identifier KWA and whether it is white W, gray G, or black B. And the group identification ID to which the keyword string identifier KWA belongs are stored correspondingly. Based on the keyword string identifier KWA of the same group ID, whether the detected keyword string identifier KWA is W or G It is comprehensively determined whether it is B or B. For example, if a keyword column identifier KWA of B is included in a plurality of keyword column identifiers KWA in the same group but a keyword column identifier KWA of W is included, a plurality of keyword column identifiers KWA included in this group are included. Is determined to be W as a whole.

Ｗｅｂ頁上の書き込みが掲示板である場合、一般に１文章が短いので、リセット信号ＲＳＴが頻繁に活性化される。このため、キーワード列識別子ＫＷＡの数が例えば１０万と多いと、図４（Ｂ）の初期化処理３３２の実行時間割合が比較的大きくなる。一方、このような場合には、リセット直前において、現状態が初期状態である割合が比較的大きいと考えられるので、初期化処理３３２に無駄が生ずる。 When the writing on the Web page is a bulletin board, since one sentence is generally short, the reset signal RST is frequently activated. For this reason, when the number of keyword string identifiers KWA is as large as 100,000, for example, the execution time ratio of the initialization process 332 in FIG. 4B becomes relatively large. On the other hand, in such a case, it is considered that the ratio that the current state is the initial state is relatively large immediately before the reset, and therefore the initialization process 332 is wasted.

そこで、本発明の実施例２では、図４（Ａ）のキーワード列検出状態記憶部３２にさらに、図８に示すような初期化済ビップマップ３２３を備え、リセット信号ＲＳＴの活性化に応答してこの初期化済ビップマップ３２３のみを初期化、例えばゼロクリアする。 Therefore, in the second embodiment of the present invention, the keyword string detection state storage unit 32 of FIG. 4A further includes an initialized bip map 323 as shown in FIG. 8, and responds to the activation of the reset signal RST. Only the initialized VIP map 323 is initialized, for example, cleared to zero.

ビップマップ３２３の先頭バイトアドレスをＡ０とすると、ビップマップ３２３のバイトアドレスＡ＝Ａ０＋ｉの第ｊビットは、現状態アレイ３２１のキーワード列識別子ＫＷＡ（８＊ｉ＋ｊ）に対応しており、この第ｊビットが‘０’であれば、その現状態ＳＴ（８＊ｉ＋ｊ）のみを初期化する。すなわち、図５のステップＳ１５の代わりに、図７に示す処理を実行する。 If the first byte address of the VIP map 323 is A0, the j-th bit of the byte address A = A0 + i of the VIP map 323 corresponds to the keyword column identifier KWA (8 * i + j) of the current state array 321. If the bit is '0', only the current state ST (8 * i + j) is initialized. That is, the process shown in FIG. 7 is executed instead of step S15 in FIG.

（Ｓ１５０）上記のように、キーワード列識別子ＫＷＡに対応するビットを初期化済ビップマップ３２３から読み出す。 (S150) As described above, the bit corresponding to the keyword string identifier KWA is read from the initialized bip map 323.

（Ｓ１５１）この値が‘０’であればステップＳ１５２へ進み、そうでなければ図５のステップＳ１６へ進む。 (S151) If this value is ‘0’, the process proceeds to step S152; otherwise, the process proceeds to step S16 in FIG.

（Ｓ１５２）キーワード列識別子ＫＷＡに所定アドレスを加算したアドレスから初期状態ＳＴ０を読み出して、これをアドレスＫＷＡの現状態ＳＴへ代入し、ステップＳ１６へ進む。 (S152) The initial state ST0 is read from the address obtained by adding a predetermined address to the keyword string identifier KWA, and is substituted for the current state ST of the address KWA, and the process proceeds to step S16.

他の点は、実施例１と同様である。 Other points are the same as in the first embodiment.

上記実施例１及び２では、現状態ＳＴのカウンタをシフトカウンタで構成する場合を説明したが、バイナリカウンタで構成することもできる。１つのキーワード列（ＫＷＡ）に含まれる最大キーワード数を例えば７とすると、このカウンタを３ビットで構成することができ、１バイトに２個のカウンタを格納することができる。バイト単位でアクセスすると、２ビットの無駄が生ずるので、６４ビットを１ワードとしてアクセスする。この場合、３×２１＝６３であるので、余りは１ビットであり、これをＭＳＢとする。 In the first and second embodiments, the case where the counter in the current state ST is configured by a shift counter has been described. However, the counter may be configured by a binary counter. If the maximum number of keywords included in one keyword column (KWA) is 7, for example, this counter can be configured with 3 bits, and two counters can be stored in one byte. When accessing in units of bytes, 2 bits are wasted, so 64 bits are accessed as one word. In this case, since 3 × 21 = 63, the remainder is 1 bit, and this is MSB.

図９は、このような構成のキーワード列検出状態記憶部３２Ａの現状態アレイ３２１Ａを示す。図９の現状態アレイ３２１Ａにおいて、ＭＳＢのセルに記入された斜線は、この余りを示し、また、初期状態から遷移したもののみそのカウントを示している。３ビットカウンタのアドレスは、このカウンタを含むワードのアドレスと、ワード内カウンタ先頭ビットのアドレスとで指定される。初期状態アレイ３２２Ａも現態アレイ３２１Ａに対応した構成である。 FIG. 9 shows the current state array 321A of the keyword string detection state storage unit 32A having such a configuration. In the current state array 321A of FIG. 9, the hatched lines written in the MSB cells indicate this remainder, and only the transition from the initial state indicates the count. The address of the 3-bit counter is specified by the address of the word including this counter and the address of the counter bit in the word. The initial state array 322A also has a configuration corresponding to the current array 321A.

図５のステップＳ１８では、現状態ＳＴを１ビットシフトする替わりに、現状態ＳＴを１だけインクリメントする。現状態ＳＴの初期化は、６４ビットのワード単位で、実施例１又は実施例２と同様に行う。 In step S18 of FIG. 5, instead of shifting the current state ST by 1 bit, the current state ST is incremented by one. The initialization of the current state ST is performed in the same manner as in the first or second embodiment in units of 64 bits.

本実施例３によれば、キーワード列検出状態記憶部３２Ａの記憶容量を低減できるので、特にキーワード列識別子ＫＷＡの数が多い場合に好適である。 According to the third embodiment, the storage capacity of the keyword string detection state storage unit 32A can be reduced, which is particularly suitable when the number of keyword string identifiers KWA is large.

実施例２では、初期化済ビップマップ３２３の１ビットを１つのキーワード列識別子ＫＷＡに対応させたが、実施例３では１ワード＝６４ビットに２１個の状態が対応するので、実施例３に実施例２の初期化済ビップマップ３２３を適用する場合、現状態をワード単位で初期化した方が効率がよい。 In the second embodiment, one bit of the initialized bip map 323 is associated with one keyword string identifier KWA. However, in the third embodiment, 21 states correspond to one word = 64 bits. When applying the initialized bip map 323 of the second embodiment, it is more efficient to initialize the current state in units of words.

そこで、本発明の実施例４では、この１ワード毎に、すなわち２１個の現状態ＳＴ毎に、図１０に示す初期化済ビップマップ３２３Ａの１ビットを対応させ、リセット信号ＲＳＴの活性化に応答して初期化済ビップマップ３２３Ａのみを初期化し、図５のステップＳ１５の代わりに、図７と同様の処理をワード単位で行う。 Therefore, in the fourth embodiment of the present invention, one bit of the initialized Bip map 323A shown in FIG. 10 is associated with each word, that is, for each 21 current states ST, to activate the reset signal RST. In response, only the initialized bip map 323A is initialized, and the same processing as in FIG. 7 is performed in units of words instead of step S15 in FIG.

初期化済ビップマップ３２３Ａの各ワードのＭＳＢは、有効ビットＶＢであり、ＶＢ＝‘０’であれば、このワードの有効ビットＶＢを除いた残り６３ビットが全て‘０’であることを意味し、ＶＢ＝‘１’であれば、この６３ビットの少なくとも１ビットが‘１’であることを示している。まず有効ビットＶＢの値により、初期化要否の概略を判定できる。キーワード列検出状態記憶部３２Ｂは、この初期化済ビップマップ３２３Ａと、図９の初期状態アレイ３２２Ａ及び現状態アレイ３２１Ａと同一の構成とからなる。 The MSB of each word of the initialized Bipmap 323A is a valid bit VB. If VB = '0', it means that the remaining 63 bits excluding the valid bit VB of this word are all '0'. If VB = “1”, it means that at least one of the 63 bits is “1”. First, the outline of necessity of initialization can be determined from the value of the effective bit VB. The keyword string detection state storage unit 32B has the same configuration as the initialized bip map 323A and the initial state array 322A and the current state array 321A of FIG.

初期化済ビップマップ３２３Ａの、ＭＳＢを除いたｋ番目のビットは、現状態アレイ３２１Ａの上からｋ番目のワードの２１個の現状態ＳＴに対応しており、ビップマップ３２３Ａ及び現状態アレイ３２１Ａのそれぞれの先頭ワードアドレスをＷＡ０及びＷＡ１で表すと、ビップマップ３２３Ａのワードアドレス（ＷＡ０＋ｉ）の第ｊビットは、現状態アレイ３２１Ａのワードアドレス（ＷＡ１＋ｉ＊６３＋ｊ）のワードの、２１個の現状態ＳＴに対応している。 The k-th bit of the initialized Bip map 323A excluding the MSB corresponds to the 21 current states ST of the k-th word from the top of the current state array 321A, and the Bip map 323A and the current state array 321A. When the first word address of each is represented by WA0 and WA1, the jth bit of the word address (WA0 + i) of the VIP map 323A is the 21 current states of the word of the word address (WA1 + i * 63 + j) of the current state array 321A. Corresponds to ST.

図７のステップＳ１５１でこの第ｊビットが‘０’であれば、ステップＳ１５１において、初期状態アレイ３２２Ａのワードアドレス（ＷＡ２＋ｉ＊６３＋ｊ）の値を、現状態アレイ３２１Ａのワードアドレス（ＷＡ１＋ｉ＊６３＋ｊ）に代入して、２１個の現状態ＳＴを同時に初期化する。ここにＷＡ２は、初期状態アレイ３２２Ａの先頭ワードアドレスである。 If the j-th bit is “0” in step S151 in FIG. 7, in step S151, the value of the word address (WA2 + i * 63 + j) of the initial state array 322A is changed to the word address (WA1 + i * 63 + j) of the current state array 321A. And 21 current states ST are initialized at the same time. Here, WA2 is the leading word address of the initial state array 322A.

また、図３のキーワード列識別子・列内位置情報３１２の代わりに、図１１に示すようなキーワード列識別子・列内位置情報３１２Ａを用いる。例えば図３の情報ブロック３１２−２は、図１１の情報ブロック３１２Ａ−２に対応している。この情報ブロック３１２Ａ−２は、有効ＰＯＳビットマップ３１３と、ＰＯＳワード群３１４とからなる。 Further, in place of the keyword string identifier / in-column position information 312 shown in FIG. 3, a keyword string identifier / in-column position information 312A as shown in FIG. 11 is used. For example, the information block 312-2 in FIG. 3 corresponds to the information block 312A-2 in FIG. The information block 312A-2 includes a valid POS bitmap 313 and a POS word group 314.

ビットマップ３１３は、構成としては形式上、ビップマップ３２３Ａと同一である。しかし、ビットマップ３１３は初期化と関係がなく、ビット‘１’は、その位置に対応した現状態アレイ３２１Ａのワードに対応した、２１個のＰＯＳ（ＰＯＳワード）が、ＰＯＳワード群３１４に含まれていることを意味している。 The bit map 313 is structurally identical to the bit map 323A in form. However, the bitmap 313 is not related to initialization, and the bit ‘1’ includes 21 POS (POS words) corresponding to the word of the current state array 321 </ b> A corresponding to the position in the POS word group 314. It means that

ＰＯＳワード群３１４には、ビットマップ３１３のＭＳＢを除くビットのうち‘１’に対応したＰＯＳワードのみが含まれ、有効ＰＯＳビットマップ３１３の‘１’のみのビット群についてｋ番目の‘１’は、ＰＯＳワード群３１４のｋ番目のワードに対応している。 The POS word group 314 includes only the POS word corresponding to “1” among the bits excluding the MSB of the bitmap 313, and the kth “1” for the bit group of only “1” in the valid POS bitmap 313. Corresponds to the kth word of the POS word group 314.

より具体的には、ビットマップ３１３及びＰＯＳワード群３１４の先頭ワードアドレスをそれぞれＷＡ３及びＷＡ４で表すと、ワードアドレス（ＷＡ３＋ｉ）の第ｊビットが、ビットマップ３１３上の‘１’のみについてｋ番目の‘１’（ＭＳＢの‘１’を除く）であれば、ＰＯＳワード群３１４のワードアドレス（ＷＡ４＋ｋ−１）のＰＯＳワードは、現状態アレイ３２１Ａのワードアドレス（ＷＡ１＋ｉ＊６３＋ｊ）の２１個の現状態ＳＴに対応している。このＰＯＳワードは、値が０でないＰＯＳのみ有効であり、有効なＰＯＳが、これに対応する現状態ＳＴと、図５のステップＳ１７で比較される。 More specifically, when the leading word addresses of the bitmap 313 and the POS word group 314 are represented by WA3 and WA4, respectively, the j-th bit of the word address (WA3 + i) is k-th for only “1” on the bitmap 313. POS word group 314 word address (WA4 + k-1) POS word is 21 words address (WA1 + i * 63 + j) of current state array 321A. Corresponds to the current state ST. This POS word is valid only for a POS whose value is not 0, and the valid POS is compared with the corresponding current state ST in step S17 of FIG.

ビットマップ３１３から図３のワード数ＷＮが分かるので、図３のワード数ＷＮはなくてもよい。 Since the number of words WN in FIG. 3 is known from the bitmap 313, the number of words WN in FIG.

本実施例４では特に、インデックステーブル３１１の１つの先頭アドレスＳＡＤＲに対応したキーワード列識別子ＫＷＡの数が比較的大きくかつ互いに接近している場合に、ＰＯＳワード群３１４の１ワードで多数のＰＯＳを処理することができるので、特にハードウエアでこの処理を並列に行うことにより、高速処理が可能となる。ソフトウエアで処理する場合であっても、メモリアクセス回数を低減してＣＰＵ内のレジスタで処理できる量が増えるので、処理を高速化することが可能である。 In the fourth embodiment, in particular, when the number of keyword string identifiers KWA corresponding to one head address SADR in the index table 311 is relatively large and close to each other, a large number of POSs are stored in one word of the POS word group 314. Since processing can be performed, high-speed processing is possible by performing this processing in parallel, particularly with hardware. Even when processing is performed by software, the number of memory accesses can be reduced and the amount that can be processed by a register in the CPU is increased, so that the processing speed can be increased.

本発明の好適な実施例を説明したが、本発明には他にも種々の変形例が含まれ、上記複数の実施例で述べた構成要素の他の組み合わせ、各構成要素の機能を実現する他の構成を用いたもの、当業者であればこれらの構成又は機能から想到するであろう他の構成も、本発明に含まれる。例えば、本発明には以下のような変形例が含まれる。 Although the preferred embodiment of the present invention has been described, the present invention includes various other modifications, and realizes other combinations of the components described in the above-described embodiments and the functions of the components. Other configurations using other configurations, and other configurations that would be conceived by those skilled in the art from these configurations or functions, are also included in the present invention. For example, the present invention includes the following modifications.

例えば、実施例１又は２のキーワード列識別子・列内位置情報３１２及びキーワード列検出状態記憶部３２と実施例４のキーワード列識別子・列内位置情報３１２Ａ及びキーワード列検出状態記憶部３２Ｂとの両方を備え、１つのキーワード（ＫＷ）が含まれるキーワード列（ＫＷＡ）の数が設定値以下であれば実施例１又は２のキーワード列識別子・列内位置情報３１２及びキーワード列検出状態記憶部３２を用い、設定値を超えれば実施例４のキーワード列識別子・列内位置情報３１２Ａ及びキーワード列検出状態記憶部３２Ｂを用いる構成であってもよい。 For example, both the keyword string identifier / in-column position information 312 and the keyword string detection state storage unit 32 of Example 1 or 2 and the keyword string identifier / in-column position information 312A and the keyword string detection state storage unit 32B of Example 4 are both used. If the number of keyword strings (KWA) including one keyword (KW) is equal to or less than a set value, the keyword string identifier / in-column position information 312 and the keyword string detection state storage unit 32 of the first or second embodiment are used. If the set value is exceeded, the configuration using the keyword string identifier / in-column position information 312A and the keyword string detection state storage unit 32B of the fourth embodiment may be used.

また、現状態ＳＴとして、アップカウンタの替わりにダウンカウンタを用い、所定値、例えば０を出力状態とする構成であってもよい。 Further, the current state ST may be configured such that a down counter is used instead of the up counter, and a predetermined value, for example, 0 is output.

さらに、現状態ＳＴと位置ＰＯＳの一方をシフトカウンタとし、他方をバイナリカウンタとし、制御部で両者が実質的に一致（対応）しているか否かを判定する構成であってもよい。 Furthermore, a configuration may be adopted in which one of the current state ST and the position POS is a shift counter and the other is a binary counter, and the control unit determines whether or not both are substantially matched (corresponding).

また、上記いずれの実施例も本発明を不適切書き込み検出装置１０に適用した場合を説明したが、よい評判の書き込みを検出する装置や、文章や書類を分類して整理する装置や、要望表現を検出して市場ニーズ調査を行う装置や、テキストマイニング装置などにも適用することができる。さらに、文章以外のバイナリデータに本発明を適用することもできる。 In any of the above embodiments, the case where the present invention is applied to the inappropriate writing detection device 10 has been described. However, a device that detects writing with a good reputation, a device that classifies and organizes sentences and documents, and a desired expression. It can also be applied to devices that detect market needs and perform market needs surveys, text mining devices, and the like. Furthermore, the present invention can also be applied to binary data other than text.

本発明の実施例１に係る不適切書き込み検出装置の概略ブロック図である。1 is a schematic block diagram of an inappropriate writing detection apparatus according to a first embodiment of the present invention. 図１の変換情報記憶部に論理的に格納される情報を視覚的に表すキーワード識別子ＫＷ／キーワード列識別子ＫＷＡ・列内位置ＰＯＳマトリックステーブルを示す図である。FIG. 2 is a diagram showing a keyword identifier KW / keyword column identifier KWA / in-column position POS matrix table that visually represents information logically stored in a conversion information storage unit of FIG. 1. 図１の変換情報記憶部に格納される情報の構造説明図である。It is structure explanatory drawing of the information stored in the conversion information storage part of FIG. （Ａ）は図１のキーワード列検出状態記憶部に格納される情報の構造説明図であり、（Ｂ）は図１の状態遷移制御部による現状態初期化処理を示すフローチャートである。(A) is structure explanatory drawing of the information stored in the keyword sequence detection state memory | storage part of FIG. 1, (B) is a flowchart which shows the present state initialization process by the state transition control part of FIG. 図１の状態遷移制御部によるキーワード列状態遷移処理を示すフローチャートである。It is a flowchart which shows the keyword sequence state transition process by the state transition control part of FIG. （Ａ）は現状態ＳＴ及び列内キーワード位置ＰＯＳの表現の具体例説明図であり、（Ｂ）は列内キーワード位置ＰＯＳによる現状態ＳＴの遷移説明図である。(A) is a specific example explanatory diagram of the expression of the current state ST and in-column keyword position POS, and (B) is a transition explanatory diagram of the current state ST by the in-column keyword position POS. 本発明の実施例２に係る現状態初期化処理を示す部分フローチャートである。It is a partial flowchart which shows the present state initialization process which concerns on Example 2 of this invention. この現状態初期化処理で用いられる初期化済ビップマップの説明図である。It is explanatory drawing of the initialized bip map used by this present state initialization process. 本発明の実施例３に係るキーワード列検出状態記憶部に格納される情報の構造説明図である。It is structure explanatory drawing of the information stored in the keyword sequence detection state memory | storage part which concerns on Example 3 of this invention. 本発明の実施例４に係るキーワード列検出状態記憶部に格納される情報の構造説明図である。It is structure explanatory drawing of the information stored in the keyword sequence detection state memory | storage part which concerns on Example 4 of this invention. 本発明の実施例４に係るキーワード列識別子・列内位置情報記憶部に格納される情報の構造説明図である。It is structure explanatory drawing of the information stored in the keyword sequence identifier and the position information storage part in a column based on Example 4 of this invention.

１０不適切書き込み検出装置
２０キーワード検出部
２１有限状態オートマトン記憶装置
３０キーワード列検出部
３１変換情報記憶部
３１１インデックステーブル
３１２、３１２Ａキーワード列識別子・列内位置情報記憶部
３１２−２、３１２Ａ−２情報ブロック
３１３有効ＰＯＳビットマップ
３１４ＰＯＳワード群
３２、３２Ａ、３２Ｂキーワード列検出状態記憶部
３２１、３２１Ａ現状態アレイ
３２２、３２２Ａ初期状態アレイ
３２３、３２３Ａ初期化済ビップマップ
３３状態遷移制御部
３３１ＫＷキュー
３３２初期化処理
３３３キーワード列状態遷移処理
４０Ｗ／Ｇ／Ｂ総合判定部
Ｆ初期化中フラグ
ＳＡＤＲ、ＳＡＤＲｉ先頭アドレス
ＷＮ、ＷＮｉワード数
ＡＤＤＲワードアドレス
Ｃキャリー
ＫＷ、ＫＷｉキーワード識別子
ＫＷＡ、ＫＷＡｉキーワード列識別子
ＰＯＳ、ＰＯＳｉ列内キーワード位置
ＳＴ、ＳＴｉ現状態
ＳＴ０、ＳＴ０ｉ初期状態
ＶＢ有効ビット DESCRIPTION OF SYMBOLS 10 Improper write detection apparatus 20 Keyword detection part 21 Finite state automaton storage apparatus 30 Keyword sequence detection part 31 Conversion information storage part 311 Index table 312, 312A Keyword sequence identifier and in-column position information storage unit 312-2, 312A-2 Information Block 313 Valid POS bitmap 314 POS word group 32, 32A, 32B Keyword string detection state storage unit 321 321A Current state array 322 322A Initial state array 323 323A Initialized VIP map 33 State transition control unit 331 KW queue 332 Initialization process 333 Keyword string state transition process 40 W / G / B comprehensive determination unit F Initializing flag SADR, SADRi Start address WN, WNi Number of words ADDR Word address C Carry KW, K i keyword identifier KWA, Kwai keyword string identifier POS, POSi columns in keyword position ST, STi current state ST0, ST0i initial state VB valid bit

Claims

In a keyword string detection device that is supplied with an input stream and checks which one of a plurality of registered keyword strings is included in the input stream and outputs the identifier thereof,
A keyword detecting means for checking which one of a plurality of registered keywords is included in the input stream and sequentially outputting identifiers of the matching keywords;
Corresponding to each of the plurality of keyword identifiers, conversion information storage means storing keyword column identifier / in-column position information indicating in which position of the plurality of keyword columns the keyword is included,
A keyword string detection state storage means for storing a current state indicating to which keyword each of the plurality of keyword strings has been detected;
The keyword string identifier / in-column position information corresponding to the keyword identifier detected by the keyword detecting unit is read from the conversion information storage unit, and the keyword is determined based on the read keyword column identifier / in-column position information. The corresponding current state in the column detection state storage means is read, and if this current state corresponds to the in-column position information of this information, this current state is transitioned to the next state, and the current state after the transition is output A state transition control means for outputting a keyword string identifier if it is in a state;
A keyword string detection device comprising:

The conversion information storage means includes an index table storage unit and a keyword column identifier / in-column position information storage unit,
The index table storage unit stores the head address and word number information of the information block in the keyword column identifier / in-column position information storage unit corresponding to each of the keyword identifiers,
The information block includes a keyword string identifier / in-column position information indicating in which position of a plurality of keyword strings the keyword identified by the corresponding keyword identifier is included, and the number of words indicated by the word number information. Have
The keyword string detection apparatus according to claim 1.

The keyword string detection state storage means stores a current state array in which an address corresponding to each keyword string identifier stores as a current state a counter indicating which keyword of the keyword string identified by the keyword string identifier has been detected. Having
The keyword string detection apparatus according to claim 1 or 2, wherein

4. The keyword string detection apparatus according to claim 3, wherein an address corresponding to each keyword string identifier is represented by a word address and a position of an in-word counter.

The keyword string detection state storage means further includes an initial state array corresponding to the current state array,
The state transition control means assigns a part or all of the contents of the initial state array to a corresponding part or all of the current state array at a predetermined timing, and changes the contents of the counter to make a transition to the next state. When it is changed, the counter becomes a predetermined value or when the carry from the counter is set, the output state is determined.
The keyword string detection apparatus according to claim 3 or 4, wherein

It further includes a bitmap for determining necessity of initialization in which each word including one or more current states corresponds to one bit,
The state transition control means initializes the initialization necessity determination bitmap in response to the activation of the supplied reset signal, and reads the contents of the word including the one or more current states. If the corresponding bit of the initialization necessity determination bitmap is initialized, the word is initialized with the corresponding word in the initial state and the bit is inverted.
The keyword string detection device according to any one of claims 1 to 5, wherein

Each information block
A current state word group including one or more counter words that include a plurality of counters and can determine whether the counter is valid or invalid by a count value;
A multiple counter word presence / absence bitmap indicating that each of the multiple counter words corresponds to 1 bit, and the bit corresponding to each of the multiple counter words included in the current state word group is valid;
6. The keyword string detection device according to claim 3, wherein the keyword string detection device includes:

In a keyword string detection method in which an input stream is supplied, and which of a plurality of registered keyword strings is included in the input stream and an identifier thereof is output,
Causing the keyword detection means to check which of the plurality of registered keywords is included in the input stream, and sequentially output matching keyword identifiers;
For state transition control means,
Corresponding to each of the plurality of keyword identifiers, conversion information storage means storing keyword column identifier / in-column position information indicating which position of the plurality of keyword columns the keyword is included in, The keyword string identifier corresponding to the keyword identifier detected by the keyword detection means and the position information in the string are read out,
Based on the read keyword string identifier and in-column position information, the corresponding current state in the keyword string detection state storage means for storing the current state indicating up to which keyword each of the plurality of keyword strings has been detected. If the current state corresponds to the position information in the column of this information, the current state is transitioned to the next state, and if the current state after the transition is the output state, the keyword column identifier is output. ,
A keyword string detection method characterized by the above.

The keyword string detection state storage means stores a current state array in which an address corresponding to each keyword string identifier stores as a current state a counter indicating which keyword of the keyword string identified by the keyword string identifier has been detected. Have
The state transition control means causes a part or all of the contents of the initial state array corresponding to the current state array to be assigned to the corresponding part or all of the current state array at a predetermined timing, and transits to the next state. To change the contents of the counter in order to make the counter become a predetermined value or to determine the output state when the carry from the counter is set,
The keyword string detection method according to claim 8.

It further includes a bitmap for determining necessity of initialization in which each word including one or more current states corresponds to one bit,
When the state transition control means initializes the initialization necessity determination bitmap in response to the activation of the supplied reset signal, and reads the contents of the word including the one or more current states If the corresponding bit of the initialization necessity determination bitmap is initialized, the word is initialized with the corresponding initial state word and the bit is inverted.
The keyword string detection method according to claim 8 or 9, characterized in that: