WO2003046765A1

WO2003046765A1 - Method for automatically extracting related words

Info

Publication number: WO2003046765A1
Application number: PCT/JP2002/012504
Authority: WO
Inventors: Genichiro Sueki; Hiroaki Fujiki; Naoko Yoshino; Kazuko Adachi
Original assignee: Mitsubishi Space Software Co., Ltd.
Priority date: 2001-11-30
Filing date: 2002-11-29
Publication date: 2003-06-05
Also published as: JP3553543B2; JP2003167894A

Abstract

A document group of the field specified by a user is stored in a database (1). An important word analysis unit (2) selects an important word having a high importance from the document group in the database (1). A count unit (3) creates a count list as statistic information for an important word or an important word pair. According to this count lint, a related word extraction unit (4) judges a correlation degree between important words.

Description

Specification

Related word automatic extraction method

The present invention relates to an automatic related word extraction method for automatically extracting words closely related to a word specified by a user based on statistical information of words included in a database. The present invention relates to an automatic related word extraction method and a related word automatic extraction device that enable extraction of technical terms appearing in a specific field designated by a user, new words and buzzwords, which are not described in the above. Background art

The conventional related word automatic extraction device has an existing thesaurus dictionary as its internal component, and simply searches the thesaurus specified by the user from the thesaurus dictionary and displays the result as a related word extraction result. However, conventional related word automatic extraction devices have the disadvantage that technical terms, new words, and buzzwords that are not described in existing thesaurus dictionaries cannot be extracted regardless of their importance. there were.

Also, when related words in multiple fields were required, it was necessary to prepare a thesaurus individually for each field, which was wasteful in terms of cost.

In addition, the conventional method for automatically extracting related words from statistical data based on data without using an existing thesaurus dictionary, the conventional method for automatically extracting related words uses only the appearance frequency of words that appear alone Is common.

Therefore, a document database containing technical terms, new words and buzzwords Even with the use of, there is a drawback in the extraction accuracy of the related word extraction method, and it has been difficult to extract the exact related word desired by the user. Disclosure of the invention

The present invention has been made to solve the above-mentioned problems of the prior art, and its purpose is to appear in a specific field specified by a user which is not described in a general existing thesaurus dictionary. Automatic extraction of related terms, new words and buzzwords, and a related word automatic extraction method and related words that can accurately and accurately extract important words closely related to the words specified by the user. An automatic extraction device is provided.

In order to solve this problem, the first invention uses a group of documents in a field designated by a user as a database, selects important words that are words of high importance from the documents in the database, and Alternatively, it is characterized by using an automatic related word extraction method that calculates the degree of relevance between important words using statistical information for pairs of important words. Here, the importance refers to the characteristic of the content indicated by the document or the degree to which the characteristic is well represented in the genre of the document.

According to this, there is provided a method for automatically extracting technical terms, new words and buzzwords that are not described in a general thesaurus and appearing in a specific field specified by a user, and an apparatus using the method. Becomes possible.

According to a second aspect of the present invention, in addition to the configuration of the first aspect, when a document group of a plurality of fields is stored on the database, the related words of each field can be automatically extracted. It is characterized by.

According to this, in addition to the effect of claim 1, for example, for the same word, related words specific to the field, such as being related words in one field but not in another field, are added. It becomes possible to extract. In addition, existing thesaurus dictionaries Users can set their own fields regardless of the field, so related words can be extracted according to the level of the field set.

According to a third invention, in addition to the configuration of the first or second invention, the database can be updated / added at any time, and the difference data is sequentially reflected at the time of automatic extraction of related words. It is characterized by having made it.

According to this, in addition to the effects of the first or second invention, it is possible to always extract the latest related words including new words and buzzwords that reflect the latest database information.

According to a fourth aspect of the present invention, in addition to the configuration of any one of claims 1 to 3, it is determined whether or not the document group in the database is the same document using one piece of document header information. It is characterized in that when the same document is included, one document is left and another same document is removed.

According to this, in addition to the effect of any one of claims 1 to 3, unnecessary bias of statistical information caused when a specific document has many identical documents can be removed, and as a result, Related word extraction accuracy can be improved. According to a fifth aspect of the present invention, in addition to the configuration of any one of claims 1 to 4, the important words are compound words created by dividing the document in the database into parts of speech and dividing them into morphemes. And

According to this, in addition to the effect of any one of claims 1 to 4, it is possible to avoid word abstraction due to division, and to improve the accuracy of related words that are finally extracted.

According to a sixth aspect of the present invention, in addition to any one of the first to fifth aspects, important words are words of speech which are expected to represent characteristics for each document in the database.

According to this, in addition to the effect of any one of the first to fifth aspects, it is possible to reduce omission of important words to be extracted. In the seventh invention, in addition to any one of claims 1 to 6, the words excluded from the important words are retained as an exclusion list, and the words in the exclusion list after extracting important words are excluded from the important words. It is characterized by doing.

According to this, in addition to the effect of any one of claims 1 to 6, unnecessary words can be eliminated.

According to an eighth aspect of the present invention, in addition to the configuration of any one of claims 1 to 7, an important word having the same meaning is held as a same word list, and the words in the same word list are extracted when extracting the important words. It is characterized in that statistical information is collectively stored. According to this, in addition to the effect of any one of claims 1 to 7, it is possible to improve the extraction accuracy of important words.

According to a ninth invention, in addition to the configuration according to any one of claims 1 to 8, the statistical information includes a total number of appearances in the database and an important word in the database. It is characterized by the ratio of the number of documents to be processed.

According to this, in addition to the effect of any one of claims 1 to 8, the extraction accuracy can be improved.

According to a tenth aspect of the present invention, in addition to the configuration of claim 9, the statistical information includes, in addition to the single occurrence frequency of an important word included in a document in the database, the occurrence frequency of a plurality of important words within a certain range. It is characterized by being used.

According to this, in addition to the effect of the ninth aspect, the meaning can be more accurately determined by a plurality of pairs of important words, and as a result, related word extraction accuracy can be improved.

In the eleventh invention, in addition to the configuration of claim 9, in addition to the statistical information, a surface expression included in a document in the database is automatically extracted, and upper and lower important words automatically constructed from the surface expression. It is characterized by using a hierarchical relationship. According to this, in addition to the effect of claim 9, it is possible to remove noise caused by a plurality of unrelated important words accidentally appearing. The extraction accuracy can be improved.

According to a twelfth aspect of the present invention, in addition to the configuration of any one of claims 1 to 11, when calculating the statistical information, a plurality of different search condition expressions are created, and the plurality of different search condition expressions are generated. Separately setting on the plurality of different processors of a massively parallel computer having a plurality of different processors, and simultaneously and simultaneously performing a full-text search of a group of documents stored in a database with the plurality of different search condition expressions; It is characterized by using results that match the search condition formula.

According to this, in addition to the effect of any one of claims 1 to 11, when calculating the statistical information, a plurality of different search condition expressions are created, and each time the related word automatic extraction method is applied. Accurate statistical information corresponding to the latest database can be used, and as a result, related word extraction accuracy can be improved. According to a thirteenth aspect of the present invention, in addition to the configuration of the first aspect, the database section according to the first aspect stores a document group in a field designated by a user, and the database section includes a database section. An important word analysis unit that extracts and selects important words to be included, a counting unit that obtains statistical information on the important words selected by the important word analysis unit and information about the hierarchical relationship of the important words, and a count that is generated by the counting unit It comprises a related word extraction unit that calculates the degree of relevance between important words using a list, and is characterized in that a series of processes use the related word automatic extraction method according to claim 1.

According to this, the user can accurately extract related words desired by the user, such as technical terms, new words, and buzzwords, without being aware of the internal structure of the related word automatic extraction device.

The fourteenth invention automatically extracts a plurality of important words using not only the number of appearances of the important words included in the document during the evening but also the number of occurrences of the plurality of important words within a certain range. In the multiple key word extraction program, documents in the database are read one by one, key words are searched from the document, and another key word is found within a predetermined range from the key words searched. Search whether there is any When an important word present in the range is searched, the important word pair is sequentially stored in the count list, the important word pair is searched from the already created count list, and the same important word pair is already counted. If it is found in the list, the count list is updated by adding 1 to the count of the number of occurrences.If it is not found in the count list, the count of the important word pair is set to 1 and saved in the count list These processes are performed for a plurality of documents specified in advance in the database, and the importance of a pair of important words is determined based on the created count list.

According to this, it is possible to rationalize the meaning by a plurality of pairs of important words, and as a result, it is possible to improve the accuracy of related word extraction.

A fifteenth invention is directed to an important word upper / lower hierarchical relationship extraction program that automatically extracts a surface expression included in a document in a database, and uses an upper / lower hierarchical relationship of an important word automatically constructed from the surface expression. Are read one document at a time, and the surface expressions written in the surface expression list created in advance are extracted from the document, and the important words are added to the upper and lower word parts in the extracted surface expressions. A search is performed to determine whether or not the important words extracted by the analysis unit 2 are included.If the important words are searched for in both the upper word part and the lower word part, the searched pair of upper and lower important words is sequentially counted. Save, if the same key word pair already exists in the count list, add 1 to the occurrence count and update the count list.If not, the above and below The key word pair force is set to 1 and saved in the count list, and these processes are performed overnight on a plurality of specified documents in the database, and the upper and lower key words are determined based on the created count list. It is characterized by building hierarchical relationships.

According to this, it is possible to rationally remove noise caused by a plurality of unrelated important words accidentally appearing. BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 is a block diagram of a related word automatic extraction device according to an embodiment of the present invention.

FIG. 2 is a conceptual diagram of an important word list used in the related word automatic extraction device according to the embodiment.

FIG. 3 is a conceptual diagram of a count list used in the automatic related word extracting apparatus according to the embodiment.

FIG. 4 is a conceptual diagram of a relevance judgment list created based on the count list of FIG. 3 and the keyword list of FIG.

FIG. 5 is a flowchart showing a procedure for extracting a plurality of important words within a certain range in the method for automatically extracting related words according to the embodiment.

FIG. 6 is a flowchart showing a procedure for extracting upper and lower hierarchical relationships of important words in the related word automatic extraction method according to the embodiment. BEST MODE FOR CARRYING OUT THE INVENTION

Hereinafter, the present invention will be described in detail based on the illustrated embodiment.

That is, the automatic related word extraction device includes a database section 1 for storing documents in a field designated by a user, an important word analysis section 2 for extracting and selecting important words contained in the database section 1. The counting unit 3 that obtains statistical information on the important words selected by the important word analyzing unit 2 and the hierarchical relationship information of the important words, and the relationship between the important words using the count list generated by the counting unit 3. It has a related word extraction unit 4 that calculates the degree of importance, and selects important words that are words of high importance from the documents in the database 1 Processing to calculate the degree of relevance between key words is performed using statistical information on the pair. The database unit 1 determines the same document from the input document group, and, when a plurality of the same documents are included, the same document determination function unit 1 1 that leaves one document and removes another same document. And a database 12 for storing the documents from which the same document has been removed by the same document determination function unit 11.

Hereinafter, the same document determination function unit 11 will be described in detail.

For example, assuming that the documents in the database 12 are patent documents, extract the “name of the applicant”, “name of the invention” and “name of the inventor” from the header of the patent document, and (1) The names of the applicants are the same. (2) The names of the inventions are the same. (3) The number of inventors is the same, and each of the names of the inventors is the same. Are all the same (in any order). All documents that meet the above conditions (1) to (3) are regarded as the same document.

The important word analysis unit 2 includes a morphological analysis unit 21 and an important word extraction unit 22.

The morphological analysis unit 21 divides the document in the data base into parts of speech by morphological analysis and acquires part of speech information.

The key word extraction unit 22 creates a compound word by performing compound word processing such as combining continuous nouns with the morphemes divided by the morphological analysis unit 21 into parts of speech, for example. The compound word is stored as an important word in the important word list together with the part of speech information and the statistical information. By creating compound words, it is possible to avoid the abstraction of words due to division, and to improve the accuracy of related words finally extracted. _C Important words are limited to compound words created by the above method However, the part of speech of words that are considered to characterize the content of each document in the database 12 such as common nouns other than compound words, proper nouns, undefined words, etc.

Also, after extracting important words, if necessary, exclude words etc. You may add a function to exclude words in the exclusion list from important words. Specifically, for each genre of documents in the database, words that cannot characterize the contents of the document, such as “inventor” and “comparative example” for patent documents, can be registered in the exclusion list. Conceivable.

This exclusion list may include words to be excluded as long as they partially match, in addition to words to be completely matched for each morpheme. In addition, key words with the same meaning are stored in the same word list, and when extracting important words, statistical information on the words in this key word list is saved together to extract important words. Accuracy can be improved.

Figure 2 is a conceptual diagram of the important word list.

Here, the “statistical information” to be stored in the keyword list includes the number of occurrences 25 of the keyword 23 in the database, and the number of documents 24 containing the keyword in the database. Use proportions. These are information that is the basis of various statistics used in the counting unit 3 and the related word extracting unit 41 later.

In order to obtain the number of documents containing an important word in the database 12, a plurality of different search condition expressions corresponding to each important word are created, and the plurality of different search condition expressions are super-parallel having a plurality of different processors. It is set separately on the plurality of different processors of the computer, and a full text search is performed simultaneously and in parallel with the plurality of different search condition expressions for the document group stored in the database 12. The results obtained can be used. Here, the number of results that match each search condition expression is the number of documents that include each important word in the database 12. The accuracy of the statistical information can be maintained by performing the full-text search each time the important word analysis unit 2 performs the processing.

The massively parallel computer incorporates thousands to tens of thousands of processors (hereinafter collectively referred to as a pipeline) so that a plurality of different search condition expressions can be simultaneously set in the pipeline. And these massive programs A full-text search is performed by simultaneously operating the speech processor and performing multiple search conditions and data-based matching. If a document that matches the search condition is found as a result of the matching, it has a function that regards the document as a hit.

The massively parallel computer is desirably a device such as a full-text search engine (for example, FDF (registered trademark) 4 TT ext Finder) manufactured by Paracel Corporation. Good. The counting unit 3 includes an extracting unit 31 for extracting a plurality of important words within a certain range, and an extracting unit 32 for extracting a hierarchical relationship between important words.

In the related word automatic extraction method, the user selects in advance either one of the extraction unit 31 for a plurality of important words within a certain range or the extraction unit 32 for the hierarchy of important words, and the user selects one. Only the performed processing is performed.

The extraction unit 31 for a plurality of important words within a certain range uses the important words extracted by the important word analysis unit 2 as a reference, and when there is another important word within a certain range defined in advance from the reference. An important word is defined, and the number of occurrences of the plurality of important words is counted and saved as a count list. The procedure for extracting multiple important words is shown in the flowchart of FIG. 5, and the details will be described later.

The extraction unit 32 of the upper and lower hierarchical relations of important words defines in advance the surface expression in which the relation between the upper and lower terms is clearly expressed, and includes the important words extracted by the important word analysis unit 2. The surface expression is extracted. The important words in the extracted surface expression are defined as upper and lower important words, and the count of the number of occurrences is stored as a count list. The procedure for extracting the hierarchical relationship of key words is shown in the flowchart of Fig. 6, and the details will be described later.

In FIG. 1, the related word extracting unit 4 includes a related word extracting unit 41. The related word extraction unit 41 performs related word determination based on the count list created by the counting unit 3. For example, to determine dissimilarity between two words, Inf o rm ation Rad ius (.Chr ist opher d.Manning and Hinrich S chut ze, Foundat ions 0 f St at istical Judgment indices such as Natura l Language Proscessing, The MI T Press (MAN FH 0-262-13360-1))) can be used. When the extraction unit 31 is selected, a pair of important words that have a common keyword within a certain range, or when the extraction unit 32 of the upper and lower hierarchical relations of the important word is selected, the lower significant words are common. The key of the key word that is used can also be determined as a related word.

Fig. 3 is a conceptual diagram of the count list, where ID 33 of keyword 1 and ID 34 of keyword 2 and the number of occurrences 35 of the pair of keyword 1 and keyword 2 are created as a list item. ing.

The words A, B, C, D,... Arranged in each column in FIG. 4 are the set of related word judgment target words (keywords) 42 to be subjected to related word judgment, and are arranged in each row. The words a, b, c, d, · · · are the related word judgment use words (important words) 43 used in the related word judgment. Basically, each column and each row is an important word extracted by the important word analysis unit 2, and one of the important word pairs extracted by the counting unit 3 is arranged in a column and the other is arranged in a row. For example, for key word pairs that exist within a certain range in Fig. 5, key word A is placed in a column, and key word B is placed in a row. In the upper and lower important word pairs in Fig. 6, the upper important words are arranged in columns and the lower important words are arranged in rows. In the association degree judgment list in FIG. 4, the number of each cell indicates the appearance probability. For example, in column c, row A, “probability that key word A and key word c appear within a certain range” or “probability that key word A is an upper word and key word c is a lower word”. Hereinafter, as an example of related word determination, a description will be given of a determination example in the case of using a determination index of Infoformat on Radius to determine the dissimilarity between two words.

The statistic is the “dissimilarity between two words” calculated using this probability of occurrence, and is calculated for all pairs of uppercase letters in each column (A and B, A and C , A and D, · · ·, B and C, B and D '· ·, C and D, · · · ·). Taking as an example the determination of the degree of relevance between important words A and D, the probability of occurrence of a, b, c, d, ... for A and the probability of occurrence of a, b, c, d, ... for D Is calculated as dissimilarity. Probability of occurrence is the same in all rows (a row A column = a row D column, b row A column = b row D column, c row A column = c row D column, d row A column = d In the row D column, · · · ·), the dissimilarity is 0, that is, the similarity between A and D is the largest, and therefore, the relevance between the important words A and D is the largest. Conversely, if there are no words a, b, c, d, · · 'that have non-zero occurrence probabilities, the dissimilarity is maximum, that is, the relevance is minimum. As described above, statistics are calculated for all pairs of uppercase alphabets, and only pairs below a certain threshold are judged to be related words (related words).

FIG. 5 is a flowchart showing a procedure for counting the number of simultaneous appearances of a plurality of important words existing within a certain range in the related word automatic extraction method according to the embodiment of the present invention.

First, the documents in the database are read one by one (step S1), and the key words extracted by the key word analysis unit 2 are searched from the documents (step S2).

The important words to be searched here are not limited to those extracted by the important word analysis unit 2, but may be words included in a user-defined important word list defined by the user in some cases. In the user-defined important word list, in addition to words whose search condition is a perfect match, words that are searched for if they partially match May be included.

Furthermore, as criteria for determining the importance of the word to be searched, the total number of occurrences in the database, the ratio of the number of documents in which the key word is included in the database, and the number of characters are applied to the filter of the key word to be searched as necessary. You may. By applying these various filters, important words can be further narrowed down, and as a result, the accuracy of related words finally extracted can be improved.

When an important word is searched (when YES is determined in step S3), another important word (this is referred to as an important word A) within a predetermined range from the searched important word (this is called important word A). A search is made to see if there is an important word B) (step S4).

The term "within a certain range" means, for example, within one sentence (the range from the beginning of a sentence to the period "."), Which is defined as being close to two before and after, but not limited to this. Specify the range that is expected to represent the feature for each. When an important word B existing within a certain range from the important word A is searched for (determined as YES in step S5), a pair of the important word A and the important word B is sequentially stored in a count list.

The key word A and the key word B are searched for from the already created count list (step S6), and when the same pair already exists in the count list (when YES is determined in step S7) Then, the count list is updated by adding one to the count of the number of appearances (step S8).

If it does not exist in the count list (if NO is determined in step S7), the count of the pair of the important word A and the important word B is set to 1 and is newly stored in the count list (step S9). .

As described above, the processing from step S1 to step S9 is performed for a plurality of documents designated in advance in the database (step S10).

After that, the count list created in steps S1 to S10 and the important Based on the statistical information in the word list, the importance of the pair of important word A and important word B is determined (step S11). In step S11, for example, a Die coefficient and a mutual information amount can be used.

FIG. 6 is a flowchart showing a procedure for extracting upper and lower hierarchical relationships of important words in the related word automatic extraction method according to the embodiment of the present invention.

First, the documents in the database are read one by one (step S21), and a surface expression described in a surface expression list created in advance is extracted from the document (step S22). .

Here, the surface expression to be written in the surface expression list is one in which the relation between the broader word and the lower word is clearly expressed. For example, “D such as A, B, C” (A to D) Are the important words.) In the expression, the upper word is D and the lower words are A, B, and C.

Next, the key words extracted by the key word analysis unit 2 are included in the upper word part and the lower word part in the surface expression extracted in step S22 (when YES is determined in step S23). A search is made as to whether or not they are to be performed (step S24).

Here, the important words to be searched are not limited to those extracted by the important word analysis unit 2, and may be words included in a user-defined important word list defined in advance by a user in some cases. In addition, the user-defined important word list may include words that are to be searched if they partially match, in addition to words for which a perfect match is a search condition.

When an important word is found in both the upper word part and the lower word part by this search (when YES is determined in step S25), the searched upper and lower important word pairs are sequentially stored in the count list. I do. At this time, as a judgment scale of the importance of the upper and lower key words, a comparison of the ratio of the number of documents containing the upper and lower keywords in the database 12, a comparison of the morphemes of the upper and lower keywords, The upper and lower key word pairs that are always excluded are retained as upper and lower key word pair exclusion lists. The function of excluding upper and lower key word pairs in the upper and lower key word pair exclusion list may be applied as necessary.

Good.

The upper and lower key word pairs are searched from the already created count list (step S26), and if the same pair already exists in the count list (if YES is determined in step S27), the occurrence count is counted. The count list is updated by 1 (step S28).

If it does not exist in the count list (when it is determined as NO in step S27), the count of the upper and lower important word pairs is set to 1 and is newly stored in the count list (step S29).

As described above, the processing from step S21 to step S29 is performed for a plurality of documents specified in advance in the database (step S30).

Thereafter, an upper / lower hierarchical relationship of the important words is constructed based on the statistical information in the count list and the important word list created in steps S 21 to S 30 (step S 31).

Specifically, for example, when upper keywords A and B having a common lower keyword C are extracted and upper keywords A and B are extracted at the same time, The only pairs that are in a direct hierarchical relationship are A (upper) — B (lower) pairs and B (upper) — C (lower) pairs, and A (upper) — C (lower) pairs are only redundant. Absent. Therefore, when constructing the hierarchical relationship of the key words, the redundant pair of AC is excluded.

In constructing the upper / lower hierarchical relationship, a threshold may be set for all occurrences of the upper / lower keyword pairs in the database, and the upper / lower keyword pairs below the threshold may be excluded as necessary. Industrial applicability According to the present invention, in a related word automatic extraction method for automatically extracting words closely related to a word specified by a user based on statistical information of words included in a database, a general existing thesaurus dictionary is used. , Which is not described in the field, can be used effectively as a related automatic extraction device that can implement a technical term that appears in a specific field specified by the user and a related word automatic extraction method that enables extraction of new words and buzzwords. it can.

Claims

The scope of the claims

1. Using a group of documents in the field specified by the user as a database, selecting important words, which are words of high importance, from the documents in the database, and including the important words or pairs of important words in the database. A related word automatic extraction method characterized by calculating the degree of relevance between important words using word statistical information and extracting related words.

2. The related word automatic extraction method according to claim 1, wherein, when a document group in a plurality of fields is stored in the database, a related word for each field can be automatically extracted.

3. The related word automatic extraction method according to claim 1, wherein the database can be updated and added at any time, and the difference data is sequentially reflected upon automatic related word extraction.

4. The group of documents in the database determines whether or not the same sentence is written using one header information of the document, and when a plurality of the same documents are included, one document is left and another same document is deleted. 4. The method for automatically extracting related words according to claim 1, wherein the related words are removed.

5. The related word automatic extraction method according to any one of claims 1 to 4, wherein the important word is a compound word created by dividing a document in a database into parts of speech and creating the divided morphemes.

6. The related word automatic extraction method according to any one of claims 1 to 5, wherein the important word is a part of speech that is predicted to represent a feature for each document in the database.

7. Retain words excluded from important words as an exclusion list, and after extracting important words, exclude words in the exclusion list from important words. Extraction method.

8. Keep important words with the same meaning as the same word list, and extract important words 8. The method for automatically extracting related words according to any one of claims 1 to 7, wherein statistical information of words in the same word list is collectively stored.

9. The related word automatic extraction method according to any one of claims 1 to 8, wherein the statistical information is a total number of occurrences in the database and a ratio of the number of documents including the important word in the database.

10. The statistical information according to claim 9, wherein the number of appearances of a plurality of important words within a certain range is used in addition to the number of single appearances of an important word included in the document in the database. 10. Related word automatic extraction method.

11. In addition to the statistical information, a surface expression included in a document in the database is automatically extracted, and upper and lower hierarchical relationships of important words automatically constructed from the surface expression are used. Related word automatic extraction method of description.

1 2. In calculating the statistical information, a plurality of different search condition expressions are created, and the plurality of different search condition expressions are separately set on the plurality of different processors of a massively parallel computer having a plurality of different processors. 2. The method according to claim 1, wherein a document group stored in the database is searched in full text simultaneously and in parallel with the plurality of different search condition expressions, and a result matching the search condition expression is used. Or the related word automatic extraction method described in any one of (1) to (11).

13. The database unit according to claim 1, which stores a document group in a field designated by a user, and an important word analysis unit that extracts and selects important words included in the database. A counting unit that obtains statistical information on the important words selected by the important word analyzing unit and upper and lower hierarchical relation information of the important words; and calculates a degree of association between the important words using the count list generated by the counting unit. An automatic related word extraction apparatus comprising a related word extraction unit, wherein the series of processes uses the related word automatic extraction method according to claim 1.

1 4. Multiple important words that automatically extract multiple important words using the number of appearances of multiple important words within a certain range in addition to the single occurrence number of important words included in the documents in the database In the word extraction program,

The documents in the database are read one by one, key words are searched from the documents, and if there is another key word within a predetermined range defined from the key words searched, certain key words are searched. When an important word existing within the range is searched, the important word pair is sequentially stored in the count list, and the important word pair is searched from the already created force list. If it exists in the count list, the count list is updated by adding 1 to the count of the number of occurrences.If it does not exist in the count list, the count of the important word pair is set to 1 and newly saved in the count list. These processes are performed on a plurality of documents specified in advance in the database, and the importance of a pair of important words is determined based on the created count list. Out program.

15 5. In the key word upper / lower hierarchical relation extraction program that automatically extracts the surface expressions contained in the documents in the database and uses the upper / lower hierarchical relations of the important words automatically constructed from the surface expressions,

The documents in the database are read one by one, and the surface expressions written in the surface expression list created beforehand are extracted from the documents, and the upper and lower word portions in the extracted surface expression are extracted. Whether the key words extracted by the key word analysis unit 2 are included.If both the high-order word part and the low-order word part are searched for key words, the searched pair of upper and lower key words is sequentially searched. Saved in the count list, if the same important word pair already exists in the count list, add 1 to the count of the number of occurrences, update the count list, and if it does not exist in the count list, The word pair count is set to 1 and saved in the count list.These processes are performed for a plurality of documents specified in advance in the database, and important words are determined based on the created count list. Key words upper and lower hierarchical relationship extraction program characterized that you build lower hierarchical relationship.