JPH06301725A

JPH06301725A - Retrieval device for character-string of hierarchized document

Info

Publication number: JPH06301725A
Application number: JP5110900A
Authority: JP
Inventors: Motoyoshi Sawatani; 元喜澤谷
Original assignee: Nippon Steel Corp
Current assignee: Nippon Steel Corp
Priority date: 1993-04-13
Filing date: 1993-04-13
Publication date: 1994-10-28
Anticipated expiration: 2017-08-19
Also published as: JP3315755B2

Abstract

PURPOSE:To constitute the device so that not only each document retrieval result bus also a retrieval result of each stage can be obtained by a single retrieval by displaying stepwise or simultaneously the degree of coincidence of every document and the degree of coincidence of every character-string set of each stage. CONSTITUTION:A client machine 4 executes an access to a server 1, and by a retrieving part 11 of the server 1, the retrieval is executed to all documents. In this case, by generating self-correlation information and only collating it with a map, the retrieval can be executed at a high speed. A result of retrieval executed by the retrieving part 11 is sent as it is to a sum-up processing part 12. By this sum-up processing part 12, from the result of retrieval from the retrieving part 11, at every document, the degree of coincidence to a character-string set of each stage is summed up and the processing for setting the highest degree of coincidence to the degree of coincidence of its stage, and also, setting the degree of coincidence of the uppermost stage of each document to the degree of coincidence of its document is executed. Its result is sent to the client machine 4, and the document having the degree of coincidence exceeding a set threshold are displayed in the lump on its display.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は階層化文書の文字列検索
装置に関し、特に各段の文字列集合が、その下段側の１
つ若しくは２つ以上の文字列集合から構成された１段若
しくは２段以上の階層化文書が１文書若しくは２文書以
上記憶された記憶装置に於て、各文書及び各文書中の各
段の各文字列集合に対して、対象文字列の不完全一致を
も含むあいまい検索を行うための階層化文書の文字列検
索装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a character string search device for a hierarchical document, and in particular, a character string set in each stage is located on the lower side of the sequence.
In a storage device in which one or two or more layered documents each composed of one or two or more character string sets are stored in one document or two or more documents, each document and each column in each document are stored. The present invention relates to a character string search device for a hierarchical document for performing a fuzzy search including an incomplete match of a target character string with respect to a character string set.

【０００２】[0002]

【従来の技術】従来、例えば多数の文書が記憶された記
憶装置に於ける各文書内に特定の文字列が含まれている
か否かを調べる場合、通常はその文字列全てが含まれて
いるか否かのみを調べる完全一致の検索が行われていた
が、特にカタカナ表記された長い外来語等の検索を行う
場合、表記の微妙な違いにより検索できないことがあっ
た。また、検索対象の文書が多いなど、検索対象となる
文書の全体量が大きいと検索が著しく遅くなると云う問
題があった。2. Description of the Related Art Conventionally, when checking whether or not a particular character string is included in each document in a storage device in which a large number of documents are stored, it is usual that all the character strings are included. Although an exact match search was performed to check only whether or not there was a case where the search could not be performed due to a subtle difference in the description, especially when searching for a long foreign word written in katakana. Further, there is a problem that the search becomes significantly slow when the total amount of documents to be searched is large, such as the number of documents to be searched.

【０００３】そこで、本願出願人と同一出願人による特
開平４−３２６１６４号公報には、文書の記憶時に、同
時に各文字（コード）の自己相関情報を文書毎に記憶し
ておき、検索時に検索文字列の各文字の自己相関情報を
求めて、その有無を検出する構造とすることで、各検索
対象文書内に於ける検索文字列の有無のみならずその一
致度をも容易に、かつ高速に調べることが可能な検索シ
ステムが開示されている。Therefore, in Japanese Patent Application Laid-Open No. 4-326164 filed by the same applicant as the present applicant, the autocorrelation information of each character (code) is stored for each document at the same time when the document is stored, and the document is searched at the time of retrieval. By obtaining the autocorrelation information of each character in the character string and detecting the presence or absence of it, not only the presence or absence of the search character string in each search target document but also the degree of matching can be easily and quickly A search system that can be searched is disclosed.

【０００４】上記システムにより各文書に対する特定文
字列の検索が高速化されるが、例えば１文書が非常に大
きく、「タイトル」、「前書き」、「本文１」、「本文
２」、「後書き」などの項目に分かれている階層化文書
の場合、そのいずれの項目に所望の文字列があるのかを
知ることができれば後の処理が容易になる場合がある。
また一致度の高いものがない場合、どの項目にどの程度
特定文字列と一致する文字列があるのかが検索終了を判
断する際に重要になる場合がある。The above system speeds up the search for a specific character string in each document. For example, one document is very large, and "title", "preface", "text 1", "text 2", "postscript". In the case of a hierarchical document divided into items such as, if it is possible to know which item has a desired character string, subsequent processing may be facilitated.
In addition, when there is no item with a high degree of matching, it may be important in determining the end of the search as to which item has a character string that matches the specific character string.

【０００５】[0005]

【発明が解決しようとする課題】本発明は上記したよう
な従来技術の問題点に鑑みなされたものであり、その主
な目的は、単に文書中に特定の文字列があるか否かを判
断するのみでなく、検索対象となる各文書のどの項目
に、特定の文字列とどの程度一致する文字列があるのか
を容易に、かつ高速に検索することが可能な階層化文書
の文字列検索装置を提供することにある。SUMMARY OF THE INVENTION The present invention has been made in view of the problems of the prior art as described above, and its main purpose is simply to determine whether or not there is a specific character string in a document. In addition to the above, you can easily and quickly search for which item in each document to be searched has a character string that matches a specific character string. To provide a device.

【０００６】[0006]

【課題を解決するための手段】上述した目的は本発明に
よれば、１段若しくは２段以上に階層化された文字列集
合からなり、かつ前記各段の文字列集合が、その下段側
の１つ若しくは２つ以上の文字列集合から構成された階
層化文書が１文書若しくは２文書以上記憶された記憶装
置に於ける前記各文書及び前記各文書中の前記各文字列
集合に対して、特定文字列の不完全一致をも含むあいま
い検索を行うための階層化文書の文字列検索装置であっ
て、前記各文書の全ての文字を対象として前記特定文字
列を検索し、その一致度を判断する検索部と、前記各文
書毎に、最下段側から各段の文字列集合に対する前記一
致度を集計して最も高い一致度をその段の一致度とし、
更に前記各文書の最上段の一致度をその文書の一致度と
する集計処理部とを有し、前記各文書毎の一致度及びそ
の文書の各段の文字列集合毎の一致度を段階的に、また
は同時に表示することを特徴とする階層化文書の文字列
検索装置を提供することにより達成される。According to the present invention, the above-mentioned object is composed of a character string set which is hierarchized in one stage or two or more stages, and the character string set of each stage is on the lower side thereof. For each document and each character string set in each document in a storage device in which one or two or more layered documents composed of one or more character string sets are stored, A character string search device for a hierarchical document for performing a fuzzy search that also includes an incomplete match of a specific character string, wherein the specific character string is searched for all the characters of each document, and the degree of coincidence is searched. For each document, the search unit to determine, the highest degree of coincidence is calculated as the highest degree of coincidence by aggregating the degree of coincidence with respect to the character string set of each stage from the bottom side,
The document further includes a totalization processing unit that sets the degree of coincidence at the top of each document as the degree of coincidence of the document, and gradually calculates the degree of coincidence for each document and the degree of coincidence for each character string set at each stage of the document. The present invention is achieved by providing a character string search device for a hierarchical document, which is displayed simultaneously or simultaneously.

【０００７】[0007]

【作用】このように、例えば項目などにより分けられた
複数の階層化文書の検索対象となる全ての文字に対して
あいまい検索をし、その結果を最下段の文字列集合から
集計し、各段の検索結果、更に各文書の検索結果を求め
ることで、１度の検索で各文書検索結果と共にその各段
の検索結果をも得られる。In this way, for example, fuzzy search is performed for all the characters to be searched in a plurality of hierarchical documents divided by items, and the results are totaled from the character string set at the bottom, By further obtaining the search result of, and the search result of each document, the search result of each stage can be obtained together with the search result of each document by one search.

【０００８】[0008]

【実施例】以下、本発明の好適実施例を添付の図面につ
いて詳しく説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT A preferred embodiment of the present invention will now be described in detail with reference to the accompanying drawings.

【０００９】図１は、本発明が適用されたサーバ・クラ
イアント型のワークステーションのシステム構成を示す
ブロック図である。このシステムは、大容量記憶装置２
を有するサーバ１と、このサーバ１に公知のネットワー
ク３を介して接続された複数のクライアント機４とを有
している。FIG. 1 is a block diagram showing the system configuration of a server / client type workstation to which the present invention is applied. This system is a mass storage device 2
And a plurality of client machines 4 connected to the server 1 via a known network 3.

【００１０】記憶装置２内には多数の文書が記憶されて
いる。ここで、図２に示すように、各文書は１段若しく
は２段以上に階層化された文字列集合からなり、かつそ
の各段の文字列集合が、その下段側の１つ若しくは２つ
以上の文字列集合から構成された階層化文書からなる。
本実施例では「出願書類」なる文書が多数記憶されてい
るものとする。この「出願書類」は「願書」、「明細
書」及び「要約書」から構成され、更に「願書」は、
「書類名」、「整理番号」、「発明者」、「特許出願
人」、「代理人」などから構成され、例えば「特許出願
人」は、「識別番号」、「郵便番号」、「住所又は居
所」、「氏名又は名称」、「代表者」などから構成され
ている。また、「明細書」は、「書類名」、「発明の名
称」、「特許請求の範囲」、「発明の詳細な説明」、
「図面の簡単な説明」などから構成され、例えば「発明
の詳細な説明」は、「産業上の利用分野」、「従来の技
術」、「発明が解決しようとする課題」、「課題を解決
するための手段」、「作用」、「実施例」、「発明の効
果」から構成されている。また、記憶装置２内に記憶さ
れた文書には、記憶時に同時に各文字（コード）の自己
相関情報がマップとして作成され、一種のインデックス
としてその文書と共に記憶され、サーバ１に管理されて
いる。A large number of documents are stored in the storage device 2. Here, as shown in FIG. 2, each document is composed of a set of character strings hierarchically arranged in one stage or two or more stages, and the character string set in each stage is one or more in the lower stage. It consists of a layered document composed of a set of character strings.
In this embodiment, it is assumed that a large number of documents “application documents” are stored. This "application document" consists of "application", "specification" and "abstract", and "application" is
It consists of "document name", "reference number", "inventor", "patent applicant", "agent", etc. For example, "patent applicant" means "identification number", "zip code", "address". Or "place of residence", "name or name", "representative", etc. Further, the "specification" means "document name", "title of invention", "claims", "detailed description of invention",
"Detailed description of the invention" includes, for example, "industrial application field", "prior art", "problem to be solved by the invention", and "solve the problem". "Means for carrying out", "action", "example", "effect of the invention". Further, in the document stored in the storage device 2, autocorrelation information of each character (code) is simultaneously created as a map at the time of storage and stored as a kind of index together with the document and managed by the server 1.

【００１１】図３に示すように、サーバ１には上記した
自己相関情報から特定文字列を検索し、その一致度を判
断するための検索部１１と、該検索部１１からの検索結
果から、各文書毎に、その最下段側から各段の文字列集
合に対する一致度を集計して最も高い一致度をその段の
一致度とし、かつ各文書の最上段の一致度をその文書の
一致度とする集計処理部１２とが設けられている。As shown in FIG. 3, the server 1 searches for a specific character string from the above-mentioned autocorrelation information, and a search unit 11 for determining the degree of matching, and a search result from the search unit 11, For each document, the matching scores for the character string sets in each column are added up from the bottom, and the highest matching score is set as the matching score for that document, and the matching score at the top of each document is the matching score for that document. And a totalization processing unit 12 that is provided.

【００１２】以下に、本実施例の作動要領の概略につい
て説明する。或るクライアント機４から特定の文字列、
例えば文字列「フィードフォワード」の検索を記憶装置
２に記憶された全文書に対して行う場合、クライアント
機４から文字列を「検索キー」として入力すると共に後
記する一致度の閾値を例えば７０％以上と設定する。そ
して、このクライアント機４がサーバ１にアクセスし、
サーバ１の検索部１１にて全文書に対して検索が行われ
る。このとき、上記したように予め各文書の自己相関情
報がマップとして作成され記憶されていることから、文
字列「フィードフォワード」についても自己相関情報を
作成して上記マップに照合するのみで高速な検索を行う
ことができるようになっている。この検索の速度は全文
書の容量には殆ど依存せず、検索する文字列の長さに依
存するものである。The outline of the operating procedure of this embodiment will be described below. A specific character string from a client machine 4,
For example, when performing a search for the character string “feedforward” on all the documents stored in the storage device 2, the character string is input as a “search key” from the client machine 4 and the threshold of the degree of coincidence described later is, for example, 70%. Set as above. Then, this client machine 4 accesses the server 1,
The search unit 11 of the server 1 searches all documents. At this time, since the autocorrelation information of each document is created and stored in advance as a map as described above, only the autocorrelation information of the character string "feedforward" is also created and collated with the above map for high speed. You can search. The speed of this search hardly depends on the capacity of the entire document, but depends on the length of the character string to be searched.

【００１３】検索部１１にて行われた検索結果はそのま
ま集計処理部１２に送られる。この集計処理部１２にて
検索部１１からの検索結果から、各文書毎に、その最下
段側から各段の文字列集合に対する一致度を集計して最
も高い一致度をその段の一致度とし、かつ各文書の最上
段の一致度をその文書の一致度とする処理が行われる。
そして、その結果が図４に示すように、クライアント機
４に送られ、そのディスプレイに、まず上記設定閾値以
上の一致度の文書を一括表示する。そして、操作者が例
えば図４に於ける「ソート」キーをマウスなどのポイン
ティングデバイスによりクリックすることにより一致度
の高い順に並べ換えて表示する。そして、操作者は表示
された文書のうちの一つ、例えば「浮上支持装置」を選
択する。すると、図５（ａ）に示すように、「願書」、
「明細書」及び「要約書」の各々についての一致度がサ
ーバ１からクライアント機４に送られ、それが表示され
る。次に、例えば「明細書」を選択すると図５（ｂ）に
示すように、「書類名」、「発明の名称」、「特許請求
の範囲」、「発明の詳細な説明」及び「図面の簡単な説
明」の各々についての一致度がサーバ１からクライアン
ト機４に送られ、それが表示される。更に、例えば「発
明の詳細な説明」を選択すると図５（ｃ）に示すよう
に、「産業上の利用分野」、「従来の技術」、「発明が
解決しようとする課題」、「課題を解決するための手
段」、「作用」、「実施例」及び「発明の効果」の各々
についての一致度がサーバ１からクライアント機４に送
られ、それが表示される。このようにして、操作者は検
索したい文字列「フィードフォワード」の含まれる部分
を徐々に絞り込むことができ、例えば「課題を解決する
ための手段」及び「作用」の部分には文字列「フィード
フォワード」があるが、「実施例」の部分には文字列
「フィードフォワド」があり、「発明の効果」の部分に
は文字列「フィードホワード」があるなど、同じ文書内
で表現が一致しておらず、これを修正したい場合などに
有効である。The search result obtained by the search unit 11 is sent to the totalization processing unit 12 as it is. From the search result from the search unit 11 in the totalization processing unit 12, the degree of coincidence with respect to each character string set for each document is aggregated for each document, and the highest degree of coincidence is taken as the degree of coincidence for that stage. In addition, processing is performed in which the top-level matching degree of each document is set as the matching degree of the document.
Then, as shown in FIG. 4, the result is sent to the client machine 4, and the documents having the degree of coincidence equal to or higher than the above-mentioned set threshold value are collectively displayed on the display. Then, the operator clicks the "sort" key in FIG. 4 with a pointing device such as a mouse to rearrange and display the images in descending order of the degree of coincidence. Then, the operator selects one of the displayed documents, for example, "levitation support device". Then, as shown in FIG. 5A, the “application”,
The degree of coincidence for each of the “specification” and the “summary” is sent from the server 1 to the client machine 4 and displayed. Next, for example, when “specification” is selected, as shown in FIG. 5B, “document name”, “invention title”, “claims”, “detailed description of invention” and “drawing” are shown. The degree of coincidence for each of the "brief explanations" is sent from the server 1 to the client machine 4 and displayed. Further, for example, when “Detailed description of the invention” is selected, as shown in FIG. 5C, “industrial application field”, “conventional technology”, “problem to be solved by the invention”, “problem to be solved” The degree of coincidence for each of "means for solving", "action", "embodiment" and "effect of the invention" is sent from the server 1 to the client machine 4 and displayed. In this way, the operator can gradually narrow down the part that includes the character string "feed forward" that he / she wants to search. For example, in the "means for solving problems" and "action" parts, the character string "feed "Forward" is included, but the character string "Feedforward" is included in the "Example" portion, and the character string "Feed Howard" is included in the "Effect of invention". It is effective when you do not do it and want to correct this.

【００１４】本実施例では各文書毎の一致度及びその文
書の各段の文字列集合毎の一致度を段階的に表示した
が、表示可能であれば、これを同時に表示しても良いこ
とは云うまでもない。In the present embodiment, the degree of coincidence for each document and the degree of coincidence for each character string set at each stage of the document are displayed stepwise, but if they can be displayed, they may be displayed simultaneously. Needless to say.

【００１５】一方、図４に示すような画面上で検索する
特定文字列（検索キー）を複数個入力し、ＡＮＤ、Ｏ
Ｒ、ＡＮＤＮＯＴの条件で複合検索することも容易にで
きる。例えば文字列「微分」と、文字列「フィードフォ
ワード」と、文字列「制御」とをＡＮＤ条件で検索した
ときに「従来の技術」の部分には文字列「微分」のみが
あり、「作用」の部分には文字列「フィードフォワー
ド」のみがあり、「発明が解決しようとする課題」の部
分には文字列「制御」のみがある場合、文字列「微分」
と、文字列「フィードフォワード」と、文字列「制御」
とを各々別々に検索し、その一致度同士をたし合わせて
検索文字列の数（この場合は３）で割った結果（この場
合は３３％）をＡＮＤ条件での検索結果とする。また、
その上段の集合「発明の詳細な説明」では、最下段の文
字列集合の検索文字列の一致度同士をたし合わせて検索
文字列の数（この場合は３）で割った結果（この場合は
１００％）をＡＮＤ条件での検索結果とする。即ち、Ａ
ＮＤ、ＯＲ、ＡＮＤＮＯＴの条件で複合検索する場合も
最下段の文字列集合に於ける検索文字列の一致度のみ求
めれば良く、インデックスとしての自己相関情報のマッ
プも１つあれば良いこととなる。ここで、ＡＮＤ、Ｏ
Ｒ、ＡＮＤＮＯＴの条件での複合検索結果の出し方は上
記に限定されず、用途に応じて様々な方法があることは
云うまでもなく、例えば一度検索した結果に更にＡＮ
Ｄ、ＯＲ、ＡＮＤＮＯＴの条件で検索を行う場合と、一
度に全ての条件を入力してＡＮＤ、ＯＲ、ＡＮＤＮＯＴ
の条件で検索を行う場合とでその一致度を同じにしても
変えても良い。On the other hand, by inputting a plurality of specific character strings (search keys) to be searched on the screen as shown in FIG. 4, AND, O
It is also possible to easily perform a composite search under the conditions of R and ANDNOT. For example, when searching for the character string "differential", the character string "feedforward", and the character string "control" under the AND condition, there is only the character string "differential" in the "conventional technique" part, and When there is only the character string "feedforward" in the part of "" and only the character string "control" in the part of "issue to be solved by the invention", the character string "differential"
And the string "feedforward" and the string "control"
And are searched separately, and the matching degrees are added together and divided by the number of search character strings (3 in this case) (33% in this case) to be the search result under the AND condition. Also,
In the set "Detailed description of the invention" in the upper row, the result of dividing the matching degrees of the search character strings in the character string set in the bottom row by the number of search character strings (in this case, 3) (in this case, Is 100%) as the search result under the AND condition. That is, A
In the case of a composite search under the conditions of ND, OR, and AND, only the degree of coincidence of the search character string in the character string set at the bottom is required, and only one map of autocorrelation information as an index is required. . Where AND, O
It is needless to say that the method of outputting the composite search result under the conditions of R and ANDNOT is not limited to the above, and there are various methods depending on the application.
When searching with D, OR, ANDNOT conditions, and when all conditions are entered at once, AND, OR, ANDNOT
The degree of coincidence may be the same or different when the search is performed under the condition of.

【００１６】[0016]

【発明の効果】以上の説明により明らかなように、本発
明による階層化文書の文字列検索装置によれば、階層化
文書の全ての文字を対象として特定文字列を検索し、そ
の一致度を判断各文書毎に、最下段側から各段の文字列
集合に対する一致度を集計して最も高い一致度をその段
の一致度とし、更に各文書の最上段の一致度をその文書
の一致度とし、各文書毎の一致度及びその文書の各段の
文字列集合毎の一致度を段階的に、または同時に表示す
ることにより、１度の検索で各文書検索結果と共にその
各段の検索結果をも得られ、検索対象となる各文書のど
の項目に、特定の文字列とどの程度一致する文字列があ
るのかを容易に、かつ高速に検索することが可能とな
る。As is apparent from the above description, according to the character string search device for a hierarchical document according to the present invention, a specific character string is searched for for all the characters of a hierarchical document, and the matching degree is searched. Judgment For each document, the degree of coincidence for the character string set of each row is aggregated from the bottom, the highest degree of coincidence is taken as the degree of coincidence, and the degree of coincidence at the top of each document is the degree of coincidence of that document. Then, by displaying the degree of coincidence of each document and the degree of coincidence of each character string set of each stage of the document stepwise or simultaneously, the retrieval result of each stage together with the retrieval result of each document can be obtained by one retrieval. Thus, it is possible to easily and quickly search which item in each document to be searched has a character string that matches a specific character string.

[Brief description of drawings]

【図１】本発明が適用されたサーバ・クライアント型の
ワークステーションのシステム構成を示すブロック図で
ある。FIG. 1 is a block diagram showing a system configuration of a server / client type workstation to which the present invention is applied.

【図２】記憶装置に記憶された階層化文書の構造を示す
説明図である。FIG. 2 is an explanatory diagram showing a structure of a hierarchical document stored in a storage device.

【図３】本発明が適用されたサーバ・クライアント型の
ワークステーションに於けるサーバ及びクライアント機
の機能構成の一部を示すブロック図である。FIG. 3 is a block diagram showing a part of a functional configuration of a server and a client machine in a server / client type workstation to which the present invention is applied.

【図４】クライアント機のディスプレイ画面の表示状態
を示す説明図である。FIG. 4 is an explanatory diagram showing a display state of a display screen of a client machine.

【図５】（ａ）〜（ｃ）は図４の要部のみを示す説明図
である。5 (a) to (c) are explanatory views showing only the main part of FIG.

[Explanation of symbols]

１サーバ２記憶装置３ネットワーク４クライアント機１１検索部１２集計処理部 1 Server 2 Storage Device 3 Network 4 Client Device 11 Search Unit 12 Aggregation Processing Unit

Claims

[Claims]

1. A character string set hierarchically arranged in one stage or two or more stages, and the character string set in each stage is composed of one or two or more character string sets on the lower stage side. A fuzzy search including an incomplete match of a specific character string is performed for each document and each character string set in each document in a storage device in which one or more hierarchical documents are stored. A character string search device for a hierarchical document for searching the specific character string for all the characters of each document, and a search unit for determining the degree of coincidence, and for each document, the lowest stage From the side, the degree of coincidence with respect to the character string set of each stage is totaled, the highest degree of coincidence is set to the degree of coincidence of that stage, and the highest degree of coincidence of each document is set to the degree of coincidence of the document. And the degree of agreement for each document and each of the documents. A character string search device for a hierarchical document, wherein the degree of coincidence for each set of character strings in columns is displayed stepwise or simultaneously.