WO1999014690A1

WO1999014690A1 - Keyword adding method using link information

Info

Publication number: WO1999014690A1
Application number: PCT/JP1997/003280
Authority: WO
Inventors: Hisao Mase; Hiroshi Tsuji
Original assignee: Hitachi, Ltd.
Priority date: 1997-09-17
Filing date: 1997-09-17
Publication date: 1999-03-25

Abstract

Keywords proposed are extracted from an objective document and a document linked therewith, and they are integrated to qualify keywords of the objective document. The objective document is classified in a category by comparing these keywords with the processing keywords for classification. When a document is shown, the keywords concerning a document linked with the document or a document accessing frequency (an object corresponding thereto) is shown in an additionally arranged state.

Description

Description Keypad assignment method using link information

The present invention automatically extracts a keyword characterizing the content from a certain document and another related document, classifies the document according to the content based on the extracted key, and furthermore, searches the content of the searched document. How to display. In particular, it relates to a method for extracting an appropriate keypad from document information scattered on a network. Background art

As a method of classifying document information into categories according to the contents, (1) a method of extracting keywords from a document and determining an appropriate category based on the appearance tendency is general.

In addition, methods for efficiently accessing desired information include: (2) a method of linking a document related to a certain document and appropriately linking the link;

(3) A method using a document search system, that is, a method in which the user inputs search conditions related to keywords, dates, creators, and the like, and a list of information meeting the conditions is displayed. There is a search method, that is, a method in which each document is categorized in advance according to its content, and a user is allowed to search for a genre system, thereby facilitating narrowing down of documents. As a method of searching for the genre system in the above (4), the genre system is displayed by hierarchy level in order from the top, and the user can be displayed top-down. There is a way to let the user directly select

In the above-described conventional technology, the following problems exist.

(1) In document classification based on keywords, it is indispensable that text information exists to some extent in the document. It is not possible to extract appropriate keywords from missing documents and classify the documents by content. In addition, since the keyword extraction accuracy greatly affects the classification accuracy, the more accurate the keyword extraction and classification results are, the more keykeys can be selected and extracted from as many angles and judgment factors as possible. Can be expected.

(2) When searching a group of documents linked to a document related to certain document information through a link, the information described at the link destination is linked to the information desired by the user (or the information desired by the user). Since the only clue to determine whether the information is on the path) is the anchor of the link source (a phrase indicating that the document refers to another document), the user can actually enter the link It often happens that the information needed when viewing a document is not provided. Such a decrease in search efficiency due to trial-and-error search causes problems such as an increase in costs such as a line usage fee for a user accessing information through a telephone line or the like.

(3) When a user accesses certain information for the first time, it is difficult for the user to specify the keyword because the user does not know the content or structure of the information. Also, depending on the description of the search condition, a large amount of search results may be displayed, and it is necessary to perform the search many times while changing the search condition.

(4) For narrowing down by genre, it is necessary to determine beforehand which genre in the genre system corresponds to the document requested by some method. Therefore, the information requested by the user may not be included in the selected genre. In addition, when there is information related to the information requested by the user in the genre that has not been selected, a clue to reach the related information belonging to another genre by further linking the requested information. Because of this, a trial-and-error search is required hereafter, and the same problem as (2) above occurs.

The purpose of the present invention is to determine the appropriate It is an object of the present invention to provide a method and apparatus for assigning a key to a document which extracts a key key and classifies the document with high accuracy based on the key.

Disclosure of the invention

According to the present invention, a keyword is extracted from a document to be assigned with a key and stored in a storage device and a document associated with the document to be assigned with the key, and the extracted keyword is converted to the document to be assigned with the key. The above-mentioned problem (1) is solved by storing the data in the storage device in association with each other.

Further, in the present invention, a keyword is extracted from a document to be classified and a document associated with the document to be classified, and the keyword is associated with “a plurality of category groups and each category stored in a storage device. The similarity is calculated for each category by comparing the keywords in the "classification knowledge describing the group of keywords to be classified", and the similarity is calculated. Solve the problem (1).

Further, according to the present invention, a key code is extracted from the classification target document and the document associated with the classification target document, and the key code and the “user identifier and each user identifier” stored in the storage device are extracted. By comparing with the keywords in the "classification knowledge describing the corresponding keyword group", it is determined whether or not the document to be classified is a document requested by each user. The above problem (1) is solved by notifying the user of the contents or address information of the target document.

Further, in the present invention, one or more types of keywords are respectively extracted from the document stored in the storage device and the document associated with the document, stored in the storage device, and the document is displayed via the output unit. The above problems (2), (3), and (4) are solved by arranging and displaying the keywords so as to correspond to the associated document.

Further, according to the present invention, the document associated with the document stored in the storage device is The number of accesses is held, and when the document is displayed via the output unit, the number of accesses or the object corresponding to the number of accesses is arranged so as to correspond one-to-one for each document together with the display target document. The above problems (2), (3), and (4) are solved by displaying them. BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 is a diagram showing an outline of the key assignment described in this embodiment, FIG. 2 is a diagram showing an outline of a system described in this embodiment, and FIG. 3 is a document having a link structure. FIG. 4 is a diagram showing an example of a description of the document of FIG. 3 in HTML language, and FIG. 5 is a diagram showing an example of a configuration of a word weighting rule 17. FIG. 6 is a diagram showing a processing procedure of the document analysis processing unit 6, FIG. 7 is a diagram showing an example of a configuration of the word table 15, and FIG. 8 is a configuration of a keyword extraction rule 18. FIG. 9 is a diagram showing an example of a keyword recognized from a document and a word table 15; FIG. 10 is a diagram showing a processing procedure of a keyword recognition processing unit 7; FIG. 11 is a diagram showing an example of the definition of the classification knowledge base 20. FIG. FIG. 13 is a diagram showing an example of the configuration of the document information database 22. FIG. 14 is a diagram showing a processing procedure of the link information insertion processing section 12. FIG. 5 is a diagram showing an example of a description in the HTML language after the link information insertion processing, and FIG. 16 is a view showing an example of a document display result after the link information insertion processing. BEST MODE FOR CARRYING OUT THE INVENTION

Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

In this embodiment, a group of documents scattered on a network is collected, a keyword that characterizes each document is automatically recognized, and these documents are classified according to content based on a key. This is a system that displays to a user a document that matches a search request from the user. Documents to be searched are related sentences from one document. In this embodiment, it is assumed that the document has a hyperlink structure capable of linking to a textbook. In particular, in this embodiment, an HTML (Hypertext Markup Language) accessible by a WWW (Worldwide Web) browser is used. ) Document of description. Also, in HTML, information about character modification, title information, and links is described using various tags, so various information can be extracted by analyzing the types and ranges of these tags. In HTML, image information, video information, and audio information can be included. The contents described in the present embodiment can also be applied to the sorting and ordering of document information by each end user.

FIG. 1 is a schematic diagram showing the features of the present embodiment, and is a diagram for assisting understanding of the system described in detail in FIG. 2 and thereafter.

In Fig. 1, the processing target (keyword extraction, classification, display, etc.) is linked from document 01 to another document (002, 003, 004) and is linked to each other. ing. In this embodiment, first, a key candidate 106 is extracted from each document by a method described later. Then, by comprehensively evaluating the keyword candidates, the keyword 07 of the document 01 to be processed is identified. In other words, even if there is not enough text information in the document to be processed, the keywords included in the document linked to the document and satisfying a certain condition (described later) are used as the key words of the document to be processed. One of the effects of the present invention is that, by using a single key, highly accurate key information can be output.

In addition, in FIG. 1, some documents include audio, image, and video information in addition to text. In this case, text information can be extracted from these pieces of information by performing speech recognition, image processing, and image / speech recognition processing in video, so that it can be handled in the same way as text after extraction.

FIG. 2 is a diagram showing an outline of a system described in this embodiment.

The system shown in Fig. 2 is a document management service that collects and manages document information from external networks 1 such as the Internet where document information is scattered, and external network 1. —Server 2, a client that requests a search from the document management server 2 and displays the search results on the browser 28, a network 4 for connecting the client to the document management server and a group of clients (LAN (Local Area Network), Telephone line). Of course, the documents to be collected and managed by the document management server 2 may include those in the LAN 4.

The system described in this embodiment has the following six functions.

(1) Collect document information.

(2) From the collected documents and the documents related to the collected documents, identify keypads that characterize each of the collected documents.

(3) Automatically classify each document according to its content based on the keyword and store it in the document information database.

(4) Search the document DB according to the search request requested by the end user, and return the search result to the end user.

(5) If the key-key related to the information of interest is specified in advance by the end user, and if there is information matching the interest in the collected documents, the address of the information is sent to the end user. Notify.

(6) When an end user is requested to access and display a certain document, the data corresponding to the document referenced by the document is attached to the acquired document information and displayed to the end user.

Among them, the most important function is the keyword recognition of (2), which corresponds to the part surrounded by the dotted line in Fig. 2.

First, in the present system, the document collection processing unit 5 collects document information scattered on the external network 1. Each document has its own address information. In the WWW, address information called URL (Uniformed Resource Locator) is determined. The URL also contains the name of the server where that information is stored.

Generally, document collection starts from one page and is linked to that page This is done by scanning the pages that are present. Since the algorithm for document collection is already known, no particular mention is made here. The document may be collected by automatically collecting the document or by storing the document in a specific location of the document management server 2 by the document creator. The documents collected by the document collection processing unit 5 are temporarily stored in the document data 13.

Next, the document analysis processing unit 6 extracts words constituting the document from the text portion of the document collected and stored in the document data 13. If the document is not text (voice, image, video), it is necessary to extract text information by applying a program that recognizes text information from each information. For speech recognition and image recognition (especially character recognition), a system with a certain level of accuracy has already been realized.

In the document analysis processing 6, in order to divide the obtained text sentence into words, a word dictionary 16 that stores vocabulary information such as headings of words and parts of speech is referred to. Since the word segmentation algorithm is already known as shown in the IPSJ 44th National Convention Lecture Papers (3), page 18 and so on, it will not be described here.

Further, the document analysis processing unit 6 distributes weights to the respective words by referring to the word weighting rules 17 in order to determine how important each word is in the document. In the word weighting rule 17 in the present embodiment, it is possible to define rules for the following seven types of parameters relating to the description of a sentence.

(1) Title of the document (It does not appear explicitly in H TML. It is written between tag <TITLE> and tag </ TITLE>)

(2) Character size

(3) Text color

(4) Text style (gothic, italic, underline, etc.)

(5) Appearance frequency

(6) Words appearing within the first N characters

(7) Anchor information indicating links to other documents Each rule defines the weight to be added to words that satisfy the rule. The document analysis processing unit 6 assigns weights to words using at least one of the above seven types of parameters. The assigned result is stored in the word table 15 for each document. In the present embodiment, only the nouns extracted from the document are stored in the word table 15, and the rest are rejected.

Further, the document analysis processing unit 6 recognizes the ID (URL) of another document linked to the document by identifying the anchor linked to the document. In HTML, the anchor information is described in the form of “<AHREF =“ link destination address ”> anchor </A>”, which is used as a clue to “link destination address (URL) and anchor character. Column information can be easily obtained, and these link information are stored in pairs in the link information table 14.

Next, the keyword recognition processing section 7 comprehensively determines a keyword of a specific document from words (nouns) extracted from the document and a group of documents linked to the document. That is, key recognition is performed based not only on the words appearing in the document, but also on the tendency of words appearing in adjacent (related) documents. As a result, even when text information hardly exists in the document, or even when it does exist, there is no description of an appropriate keyword, it is possible to determine the appropriateness of the relevant document by comprehensively referring to the keywords of the relevant document. Key keys can be given.

The certification of the key is performed by referring to the key extraction rule 18. In this embodiment, key recognition is performed according to the following three types of parameters.

(1) A word extracted from each document has a weight value equal to or greater than a certain threshold.

(2) Of the words extracted from each document, words that exist in a certain percentage or more of the documents (documents linked to the document to be classified).

(3) Of the words extracted from each document, words that exist in a certain percentage or less of documents (documents linked to the documents to be classified).

Threshold and percentage values can be specified in rules. Rules and A word that satisfies at least one of these three parameters defined as a keyword is identified as a keyword in the document and stored in the keyword table 19 (of course, the system extracts keywords from the document. In this case, the keyword stored in the key table 19 becomes the final output, and the processing ends here.)

Next, the classification processing unit 8 classifies the documents to be classified into at least one of predefined categories. Categories are described in the classification knowledge base 20. The knowledge base 20 in the present embodiment is composed of a set of three elements: a category name, a keyword corresponding to each category, and a weight indicating the importance of the keyword. The classification knowledge base 20 may be created manually by defining keywords and their weights for each category, or (half) by extracting keywords and their weights from sample text data corresponding to each category. It may be generated automatically.

The classification processing unit 8 compares the key of the document to be classified stored in the keyword table 19 with the key described in the classification knowledge base 20 so as to be classified for each category. Calculate the similarity. This will be described in detail later. After calculating the similarity for each category, these are sorted by similarity. Then, a category having a similarity higher than a predetermined threshold is assigned to the document. Instead of using a threshold, the number of categories may be determined, or the maximum number of categories N may be determined, and the top N categories among the categories having similarities equal to or higher than the threshold are assigned. You may. The category assigned as the classification result is stored in the classification table 21.

The analyzed documents and their attribute information (categories, etc.) are all stored in the document information database 22. In the document information database, in addition to the document ID, the update date (registration date), the category assigned by the classification processing unit 8, the keyword certified by the keyword certification processing unit 7, the linked document ID, and the relevant document The frequency of access to the URL and the contents of the body are stored. Here, the access frequency information In other words, it is updated by an update request from a link information insertion processing unit 12 described later. The document management server 2 according to the present embodiment is a client-server system (CSS) that receives a request related to a search from a plurality of clients 3 and returns a processing result to the client. Since the CSS implementation method is already known, it will not be described here. For the sake of simplicity, the contents of requests from the client 3 described in this embodiment are of the following three types. In fact, there will be other requirements.

(1) A search execution instruction 26 for searching the document information database 22 for a document that satisfies certain conditions and obtaining the result.

(2) Document access instruction to obtain document information corresponding to a certain address 28.

(3) Definition and update of the contents of the classification knowledge base 20.

When a search execution instruction 2 is issued from the client 3, search conditions necessary for the search are passed to the document search processing unit 10 of the document management server 2 via the network 4. In general, search conditions generally use logical operators (AND / OR), and are created by users by describing keywords and categories. The document search processing unit 10 extracts from the document information database 22 document information that satisfies the search conditions passed from the client 3. The method of retrieving documents based on logical formulas is well-known in, for example, the IPSJ 45th National Convention Proceedings (3) Pages 239 to 244, and is not described here.

A list of the document ID and the update date (registration date) of the retrieved document information is temporarily stored in the search result 23. This list information is returned via the network 4 to the client 3 that has made the search request. Client 3 displays the returned document ID information in the browser.

When an instruction 27 to access document information having a certain address is issued from the client 3, the address character string input by the user is transferred to the link information insertion processing unit 12 of the document management server 2 via the network 4. The link information insertion processing unit 12 stores the information corresponding to the address in the document information database 22 or the key. _n

Cache directory, or from internal network 4 or external network 1. The method of acquiring document information on the network has already been realized by specifying the URL and acquiring the WWW information, so it will not be described here in detail.

When the obtained document information is present in the document information database 22, the key of the document ID linked to the document and the link destination document ID are stored in the document information database 2. Obtain from 2. In the document to be accessed, the anchor information regarding which document is linked to where can be identified based on the specific tag as described above. Immediately after the anchor linked to, enter the key code corresponding to the linked document. Instead of the keyword information, the frequency of access to the document stored in the document information database may be inserted.

The document into which the link information such as the key or the access frequency is inserted is temporarily stored in the document 25 with link information. This data is passed over the network 4 to the client that requested access. The link information input processing unit 12 requests the document information database 22 to increment the access frequency corresponding to the document by one each time there is an access request.

If the document information requested to be accessed does not exist in the document information database 22, the above link information is not displayed. This document ID is temporarily stored, passed to the document collection processing unit 5 and collected together with the document information linked to the document, and the keywords are extracted and classified, so that the next and subsequent times, Link information can be attached.

On the other hand, by extending this embodiment, a certain user can notify the user of document information of his or her interest. In other words, if the user defines the keywords (and their importance) related to the topic of interest, the system will collect new documents or update the contents of documents already collected. Is For a given document, the similarity is calculated by performing matching between the key code extracted by the above method and the keyword defined by each user, so that the key code having a similarity greater than or equal to a certain value is obtained. The address information of the document can be sent by e-mail or the like to the user who defined the code. In this case, the classification processing unit 8 performs both the similarity calculation for each category and the similarity calculation for each user. The keyword and weight information defined by the user are stored in the classification knowledge base 20 in association with the user ID. The calculation of the similarity between the user i and the document will be described later.

When the documents are classified and temporarily stored in the classification table 21, the document distribution processing unit 11 notifies the user based on the document ID stored in the classification table 21 and the information of the user ID to be notified. Create a document ID list to be notified. Then, the list is sent to the user by e-mail or the like. For the items that have been sent, the contents of the corresponding classification table 21 are deleted.

As described above, according to the system of the present embodiment, for a collected document, a key can be comprehensively recognized from the document itself and documents related to the document. Also, by using these keywords, it is possible to classify documents and to present supplementary link information (keyword information, access frequency information) to the user when displaying the documents.

Hereinafter, the details of the processing in FIG. 2 will be described using a specific example.

FIG. 3 is a diagram illustrating an example of a document group forming a link structure.

Figure 3 shows a structure in which five types of documents are linked to each other by links. In Fig. 3, the underlined characters (anchors) indicate links to other documents. In a WWW browser, the linked document 2 can be displayed by selecting the character string "Company Profile" of the document 1 with the mouse. The size, style, etc., of the characters that make up each document can be changed by using specific tags.

FIG. 4 is a diagram showing a description example of the document of FIG. 3 in the HTML language. In the HTML language, tags enclosed in inequality signs (<,>) are used to modify characters enclosed in inequality signs and to describe link information to other documents. Each tag uniquely corresponds to a specific function. An HTML document has a part that represents the bibliographic information and a part that describes the text. The former is surrounded by the tag HEAD, and the latter is surrounded by the tag BODY. Bibliographic information can include document title information (this information is not displayed in the browser). In addition, tags (H1, H2,...) That indicate the size of characters to be displayed, tags that indicate line breaks (P, BR), and tags that indicate references to other documents (A HREF) and so on. The function of each tag is valid until a tag with a slash mark (Z) appears on the tag, and some tags can be nested. When extracting keywords, the HTML document is analyzed while referring to this tag information.

FIG. 5 is a diagram showing an example of the configuration of the word weighting rule 17.

As described above, in principle, whether a word is important or not is determined by using the tag information of the word. Therefore, in this embodiment, the word weighting rule 17 relates to the presence or absence of the tag information. It is possible to define the weight corresponding to evening.

Therefore, the first record in FIG. 5 indicates that “when a word appears within the range of the tag“ TITLE (document title information) ”, weight 10 is added to the weight of the word”. If the tag information is within the range of multiple tag information, the weights corresponding to all of them are added. Note that the “Frequency” in the last record in FIG. 5 is not a tag, but a rule on the frequency of occurrence of words in the document. Is added. Instead of setting the lower limit, it is also possible to make relative settings such as words that appear in the document with the highest occurrence rate in the top N%. These threshold information may be additionally stored in the word weighting rule 17, may be stored in another storage location, or may be embedded in the processing program. FIG. 6 is a diagram showing a processing procedure of the document analysis processing unit 6.

The document analysis processing unit 6 performs the following processing until the end of the HTML document is reached (step 6001). First, one line of a character string is read from the HTML document (step 6002), and the character string is divided into tag information and text information (step 6003). Next, regarding the tag information, it is determined whether or not the tag is valid based on whether or not a slash (No) is immediately before the tag character string (step 6004). If the tag information is valid, the tag information is determined. Is held (step 6005). Also, it is determined whether the tag information is a tag “A HREFJ” indicating a link (step 60).

06), if the tag indicates a link, the character string starting with double quotation marks written immediately after the tag “HREF” following the tag “A”

It is determined to be 1D, and stored in the link information table 14 together with the link source document ID (step 6007). For character information other than tags, word division is performed with reference to the word dictionary 16 storing word headings and part-of-speech / utilization information (step 6008). There are several known word segmentation algorithms, such as the longest match method and the minimum cost method, and these methods can be applied. Next, only nouns are extracted from the divided words and temporarily stored in the work area (step 6009). Next, referring to the word weighting rule 17, if the tag specified by the rule 17 is included in the effective tag information at this time (step 6010), the tag is assigned to the tag in the rule 17. Weights are assigned to words that are in the effective range of the tag (step 60

1 1) o

After the analysis to the end of the document, the frequency of occurrence of the words stored in the work area is counted for each word (step 6012). Then, for each word (step 6013), it is determined whether or not the frequency of occurrence is equal to or greater than a threshold (step 6014). If the frequency is equal to or greater than the threshold, a word weight is assigned to the weight of the word. The weight defined in rule 17 (weight 3 corresponding to the frequency item in FIG. 5) is added (step 6015). However, for certain words The weight corresponding to a specific tag is given only once. For example, the word

Even if A appears twice in italics in the document, the weight corresponding to the italic tag added to word A is 3 (not 6). The weighting method based on information other than tags may use a rule that assigns a certain weight to words appearing from the beginning of the sentence to the N characters in addition to the word appearance frequency. May be used to assign a certain weight to a word accompanied by. The words (nouns) calculated by the processing up to this point and their weights are numerically sorted in descending order based on the weights, and the results are stored in the word table 15 (step 60016). When storing words in the word table 15 in step 60016, if the word to be stored is included in a group of words specified in advance, the word may not be stored. Good. This makes it possible to remove words that cannot clearly be keywords (for example, “if” or “when” in Japanese) o

FIG. 7 is a diagram showing an example of the configuration of the word table 15. FIG. 7 is a diagram showing words (nouns) extracted from each of the documents in FIG. 3 and examples of their weights (higher-order words with higher weights). is there. As shown in Fig. 4, the word "Tsurugame Denki" in Document 1 appears in TITLE, has a large character (tag "H1"), and has a Bo1d body (tag " B ”), the weight is 10 + 5 + 7 = 22 if rule 17 in Fig. 5 is used. Similarly, the word “PC” in document 3 appears in boldface (tag “B”), has large characters (tag “H2”), and forms an anchor character string that indicates a link to document 4. Therefore, its weight is 5 + 3 + 8 = 16.

FIG. 8 is a diagram showing an example of the configuration of the keyword extraction rule 18.

In this embodiment, the keywords of each document are certified using the following conditions.

(1) For a word extracted from the document, is the weight of the word equal to or greater than a certain threshold value?

(2) For a word extracted from the document and having a weight greater than or equal to the value, Whether the number of documents in which the word appears in the document and the documents linked from the document is equal to or greater than a certain threshold (less than or equal to).

(3) The power of a word extracted from a document linked from the document, the weight of the word being equal to or greater than a threshold.

(4) For words above a certain threshold value extracted from a document linked from the document, the number of documents in which the word appears among the documents and documents linked from the document is Is it above or below a certain threshold?

A word that satisfies the above conditions is recognized as a keypad of the document. In FIG. 8, threshold information under the above conditions is defined. FIG. 8 shows that 10 is defined as the threshold value of the weight of the above condition (1). Further, it indicates that the threshold of the weight of the condition (2) is defined as 5, and the threshold of the number of documents is defined as “60% or more or 1 or less”. Further, it shows that 15 is defined as the threshold value of the weight of the condition (3). In addition, the weight threshold of the above condition (4) is 5, and the threshold

It is defined as "60% or more or 1 or less".

FIG. 9 is a diagram showing an example of a keyword of each document recognized from the word table 15 in FIG. 7 based on the keyword extraction rule 18 in FIG. Explaining Document 1, the keywords “Tsurugame Electric” and “Home Page” are extracted first from Condition 1 in Fig. 8. Next, condition 2 is satisfied, but there is no word in document 1 with a weight of 10 or more and a proportion of the number of appearing documents of 60% or more or 1 or less.

(For example, "Company" also appears in linked document 2, but the ratio of the number of appearing documents is 50% (2 out of 4), less than 60%.) Next, condition 3, but linked to document 1 are three documents, documents 2, 3, and 4. Among them, words with a weight of 15 or more are the "company", " Since they are “Overview”, “Latest” and “News” in Document 3, and “Product”, “Information”, “Internet”, “Correspondence” and “PC” in Document 4, these are the keywords in Document 1. Finally, under condition 4, words with a weight of 5 or more and a percentage of appearing documents of 60% or more or 1 or less are defined as These are the keywords for Document 1 because they are "disk" and "printer". After all, in this example, the keywords in Document 1 are “Tsurugame Electric” “Homepage” “Company” “Overview” “Latest” “News” “Products” “Information” “Internet” “Correspondence”

We recognize that there are 13 types: "PC", "hard disk" and "printer". Note that “PC” and “Internet” are keywords that do not appear in Document 1. As mentioned above, among the above keywords, "Homepage"

"Summary", "Latest", "Correspondence", "Information", etc. are not considered to be appropriate as keywords that characterize the content of the document, so it is possible to prepare such a word list in advance and remove it. .

FIG. 10 is a diagram showing a processing procedure of the keyword recognition processing unit 7.

First, the document ID linked to the key certification document is acquired from the link information table 14 and stored in the work area (step 7001). Next, all the words and their weights corresponding to the document acquired in step 7001 are acquired from the word table 15 and stored in the work area (step 7002). Next, for each word stored in the work area (Step 7003), the number of documents containing the word is emphasized, and the percentage of the number of documents stored in the work area is calculated. And hold it (step 7004). Next, referring to the key extraction rule 18, it is determined whether or not the condition CONDI is defined (step 7005). If defined, the keyword stored in the work area is determined. For each word of the document to be certified (step 7006), it is determined whether the weight is greater than or equal to the threshold described in the condition CONDI (step 7007), and the weight is greater than or equal to the threshold. In this case, the word is stored in the key-code table 19 together with the document ID and the weight of the word as a keyword of the document targeted for keyword recognition.

(Step 7008). Next, similarly, it is determined whether or not the condition COND2 is defined (step 7109). If it is defined, for each word of the keyword recognition target document stored in the work area (step 7101) 0), and referring to the value calculated in step 7004, the number of documents in which the word appears and its _Q

It is determined whether or not the ratio to the whole satisfies the range described in C0ND2 (Step 7 0 1 1), and if so, the word is used as the key word of the document subject to the key-code certification and its document ID is determined. And the weight of the word in the key table 19 (step 7012). Next, the same processing as in steps 7005 to 702 is performed on the words of the document linked to the key-key certification target document (steps 701 to 702). 0, where COND3 is applied instead of CONDI and COND4 is applied instead of COND2). In this embodiment, four types of keyword extraction rules are used, but these are examples of rules, and another rule can be defined in the same manner.

FIG. 11 is a diagram showing an example of the definition of the classification knowledge base 20.

The classification knowledge base 20 is composed of two types of tables for different purposes. That is, a category classification table for classifying documents into categories and a user classification table for associating documents with users who are interested in the contents. As shown in Fig. 11, the former is composed of three types: category name, keyword, and weight, and the latter is composed of three types: user ID, keyword, and weight. Both have the same configuration except that the category name is User ID.

The category classification table can be manually defined by the administrator of this system, or texts corresponding to the category are collected, and a method such as that described in the present embodiment is used from those texts. It is also possible to automatically (semi-) define the keywords by automatically extracting the keywords. Either method can be used, but it is essential that a categorization table is defined anyway.

The user classification table is defined by each user using an editor or the like. However, in this case, the words specified by the user need to be divided into words by referring to the word dictionary 16 so that the classification processing unit 8 can match the keywords. You. At this time, if the word specified by the user does not exist in the word dictionary 16, the word is appropriately divided.

It is assumed that the weight described in the classification knowledge base 20 is more important as the numerical value is higher. This numerical value may be described as a relative numerical value (for example, between 0 and 1) or an absolute numerical value (for example, 30 or 200). In Fig. 11, the former is adopted.

FIG. 12 is a diagram showing a processing procedure of the classification processing unit 8.

After initializing the array element that stores the similarity value for each category (or user) to 0 (Step 8001), the key table of the document to be classified stored in the keyword table 19 (Step 8) 002), referring to the classification knowledge base 20, calculate the following value for the category (or user ID) having the keyword and add it to the similarity of the category (step 8003). ).

Wj x (w ij / ∑ w j)

Here, W j indicates the value of the weight of the keyword (j). w ij indicates the weight of the keyword (j) corresponding to a certain category i in the knowledge base 20. ∑w j is the sum of the weights of all the categories for the keypad.

According to the above equation, the similarity calculation has the following two properties.

(1) The greater the weight W j of the key-key of the document to be classified, the greater the similarity.

(2) The greater the relative proportion (w ijZS w j) of the weight of the keyword corresponding to a certain category i, the greater the similarity.

As a similarity calculation method instead of the above equation, a product of the weight of the keyword and the corresponding category may be used. Also, unary operators (log, ヽ exponentiation, factorial, etc.) may be applied to these values as similarities.

Next, regarding the similarity for each category (user ID) calculated so far, the category having a similarity greater than a certain threshold is stored in the classification table along with the document ID (step 8 0 4). 99

20

FIG. 13 is a diagram showing an example of the configuration of the document information database 22. As shown in FIG.

The document information storage processor 9 stores various data related to a certain document in the document information database 22. The document information database 22 is accessed to its data contents via the document search processing section 10 when requested by the user. The document information database 22 according to the present embodiment includes a document ID, an update date, a category, a keyword, an access frequency (initial value is 0), a linked document ID list, and a text.

FIG. 14 is a diagram showing a processing procedure of the link information insertion processing unit 12.

When a document ID for which an access request has been received from a user is received, the document information is first collected (step 12001). The document information may be extracted from the document information database 22. However, since the contents of the document may be updated, the document information is obtained from the server storing the document information via the network. Next, it is determined whether or not data corresponding to the document ID exists in the document information database 22 (step 1202). If so, the document ID linked to the document ID and the document ID are linked. The key obtains access frequency information for the document ID (step 1203). Next, the HTML file of the document is searched, an anchor indicating a link to another document is found next using the tag indicating the link as a clue, and a keyword group or access frequency information is obtained immediately after the anchor. Insert it (Step 1204). Then, 1 is added to the numerical value of the access frequency of the document in the document information database 22 (step 1200). If the corresponding data does not exist, the keyword information and the access frequency information are not inserted and sent to the client as it is (step 1206). Of course, the key code may be extracted by passing it to the document analysis processing unit 6, but it is considered that the access time will increase depending on the size of the processing time required for the analysis. And access frequency information are not inserted. However, it is possible to accumulate document ID information that is not stored in the document information database 22 and then process it by batch to register it in the database 22. The document information into which the keypad information or the access frequency information is inserted is sent to the client that has made the access request, and is displayed on the browser.

FIG. 15 is a diagram showing an example of a description in the HTML language after the link information insertion processing for the document of FIG. Immediately after the anchor character string "Latest News", a key for the document 3 linked by the anchor and information on the frequency of access to the document 3 are inserted. Whether or not to insert such information can be specified by the user.

FIG. 16 is a diagram showing an example of a document display result after the link information insertion process for the document in FIG. Each anchor is displayed with a keyword and information indicating the access frequency added thereto. This allows the user to know which anchor will be used next to reach the desired information. Also, if it is not possible to determine which link should be found only by the keyword, by referring to the access frequency information, it is possible to try to access from the document that other users access more frequently. it can. Regarding the display of keywords, there may be some overlap between the words that make up the anchor and the keywords in the document linked by the anchor. In this case, the link information input processing unit 12 may remove duplicate keywords.

Industrial applicability

According to the present invention, when keywords are extracted from a certain document or when a certain document is classified, not only information in the document but also keywords extracted from document information associated with the document are used. However, even if there is no appropriate key word in the document, the key word can be accurately identified, and the document can be classified with high accuracy. Further, according to the present invention, when displaying the contents of a document, the display is accompanied by key-key information relating to the document linked to the document or information on the access frequency of the document. I can go.

As a result, a document desired by the user can be efficiently accessed, and costs such as search time and search cost can be reduced.

Claims

The scope of the claims

1. Keyword is extracted from the document to be given a key word stored in the storage device and the document associated with the document to be given a key word, and the extracted key is made to correspond to the document to be given the key word. A method for providing a keyword using link information, wherein the keyword is stored in the storage device.

2. The fact that the key extracted from the key assignment target document and the document associated with the key assignment target document is recorded corresponding to the key assignment target document is recorded. Characteristic computer readable recording medium.

3. The key word using link information according to claim 1, wherein the keyword assignment target document includes at least one of audio data, video data, image data, and text data. Assignment method.

4. From each of the document to which the keyword is to be assigned and the document associated with the document to which the keyword is to be assigned, (1) the phrase that constitutes the title of the document, and (2) the characters that are compared with other characters. Large words, (3) words with different display colors from other words,

(4) words with different character styles from other words, (5) words that appear frequently, (6) words that appear at positions that meet specific conditions, (7) elements that indicate links to other documents ( 2. A keyword assignment using link information according to claim 1, wherein a phrase that satisfies an extraction condition for at least one of the words constituting the anchor) is set as a keyword candidate corresponding to the document. Method.

5. A weight is defined in advance for each of the above phrase extraction conditions, and a weight corresponding to the extraction condition is added to a phrase that satisfies a certain extraction condition, and a phrase having a weight equal to or more than a predetermined threshold is added. 5. The keyword assignment method using link information according to claim 4, wherein the keyword is set as a keyword candidate corresponding to the document.

6. When certifying a keyword corresponding to the document to which the key code is to be applied from the key candidates extracted from each of the documents, (1) a phrase having a weight not less than a certain threshold value; (2) Documents that exceed a specified percentage of the extracted words and phrases (3) A phrase that satisfies at least one keyword qualification condition of (3) a phrase that exists only in a document that is less than or equal to a predetermined ratio among the extracted phrases is determined as a keyword, and is determined as a keyword-added document. 2. A keyword assignment method using the link information according to claim 1, wherein the keyword assignment is performed.

7. Keys extracted from the classifying target document and the document associated with the classifying target document. The keys included in the classification knowledge obtained by classifying the keys for each category stored in the storage device. A document classification method characterized by calculating the similarity for each category by comparing with a keypad, and associating one or more types of categories with a high similarity t with the document to be classified.

8. Match the key extracted from the document to be classified and the document associated with the document to be classified with the keyword in the classification knowledge that classifies the keyword for each category stored in the storage device. A computer-readable recording medium, wherein a similarity is calculated for each category, and one or more categories having a high similarity are recorded in the storage device in association with the classification target document.

9. The document classification method according to claim 7, wherein the document includes at least one of audio data, video data, image data, and text data.

10. From each of the document to be classified and the document associated with the document to be classified, (1) a phrase constituting the title of the document, (2) a phrase having a character larger than other characters, (3) ) Words with different display colors from other words, (4) words with different character styles from other words, (5) words that appear frequently, (6) words that appear at positions that meet certain conditions, ( 7) A phrase that satisfies the phrase extraction condition for at least one of the words that constitute an element (anchor) indicating a link to another document is set as a key word candidate corresponding to the document. The document classification method according to claim 7, wherein

1 1. A weight is defined in advance for each of the above-mentioned phrase extraction conditions, and a weight corresponding to the extraction condition is added to a phrase that satisfies a certain extraction condition. 10. The document classification method according to claim 10, wherein a phrase having the above weight is used as a keyword candidate corresponding to the document.

1 2. When certifying a keyword corresponding to the target document from the keyword candidates extracted from each of the documents, (1) a phrase having a weight greater than or equal to a certain threshold, and (2) the extracted phrase. At least one of the keywords that are present in a document that exceeds a specified percentage of the extracted words, and (3) the words that exist only in a document that is not more than a predetermined percentage of the extracted words, The document classification method according to claim 4, wherein the satisfied words and phrases are recognized as a keypad and associated with the classification target document.

1 3. Keys extracted from the document to be classified and the documents associated with the document to be classified, and keywords in the classification knowledge obtained by classifying the keys for each user identifier stored in the storage device. To determine whether the document to be classified is a document requested by each user, and if so, notifies the user of the contents or address information of the document to be classified. A document classification method that features

14. One or more keypads are respectively extracted from the document stored in the storage device and the document associated with the document and stored in the storage device, and the document is output via the output unit. A document display method, wherein the keypad is arranged and displayed at the time of display so as to correspond to the associated document.

1 5. When extracting the keyword from each of the associated documents, (1) a phrase that constitutes the title of the document, (2) a phrase having a character larger than other characters, (3) Words with different display colors from other words, (4) words with different character styles from other words, (5) words with high frequency of appearance, (6) words that appear at positions satisfying specific conditions, (7 15. The keyword according to claim 14, wherein a phrase that satisfies at least one phrase extraction condition among phrases constituting an element (anchor) indicating a link to another document is used as a keyword. Document display method.

1 6. Weights are defined in advance for each of the above-mentioned phrase extraction conditions, and a certain extraction 15. The method according to claim 14, further comprising: adding a weight corresponding to the extraction condition to a word satisfying the condition, and setting a word having a weight equal to or greater than a predetermined threshold value as a key word of the document. Document display method of description.

17. Holds the number of times the document associated with the document stored in the storage device is accessed, and when displaying the document via the output unit, the number of accesses or the number of accesses together with the display target document. A document display method characterized by arranging and displaying corresponding objects in a one-to-one correspondence for each document.