WO2004097678A1

WO2004097678A1 - Automatic document classification program, method and device thereof

Info

Publication number: WO2004097678A1
Application number: PCT/JP2003/005526
Authority: WO
Inventors: Shigehiro Mochizuki
Original assignee: Fujitsu Limited
Priority date: 2003-04-30
Filing date: 2003-04-30
Publication date: 2004-11-11

Abstract

A program for causing a computer to perform automatic classification of documents stored in a computer storage device. The program causes the computer to execute the function to cluster document groups, the function to classify documents by machine learning, and the function to specify whether the keywords used for clustering include a feature keyword contained in the classification rule obtained by the machine learning, when performing the clustering.

Description

TECHNICAL FIELD The present invention relates to an automatic document classification program, a method thereof, and an apparatus.

The present invention relates to a technology for automatically classifying documents, and more particularly, to a technology for automatically classifying documents using machine learning and clustering. Background technology-.

Conventionally, methods for automatically classifying documents include a method based on machine learning and a method based on clustering processing.

A system that automatically classifies documents by machine learning generally performs the following processing.

(1) Define a category for classifying documents, and set the correct answer documents that should belong to that category.

(2) For each category, statistically examine the frequency of occurrence of words included in the correct answer example documents, find characteristic keywords of the category, learn score values corresponding to the number of appearances of the characteristic keywords, and generate classification rules I do.

(3) For uncategorized documents, analyze the words that exist in the documents, check if they are feature keywords that fall under the classification rules, and if so, add up the score values to correspond to the feature keywords Find a score. The document is classified into the category corresponding to the score that exceeds the threshold and obtains the highest value among the obtained scores. For an example of automatic classification of documents by machine learning, see "Koji Tsukamoto, Manabu Sashino: Text Classification Using Ada Boost and Active Learning. Information Processing Society of Japan. Association, 2001 "(hereinafter referred to as Non-Patent Document 1). You.

In a system that clusters documents, the following processing is generally performed.

(1) Divide a large number of documents into words (morphological analysis), calculate a numerical value representing the degree of feature from the frequency of appearance of each word, and select a certain number from the top as a feature keyword. (Here, as the numerical value representing the characteristic degree, for example, TF ■ IDF (Term Frequency X Inverse Document Frequency; = {单 § & total occurrences X log (the number of documents 数 the number of documents in which a word appears)}} is used. .)

(2) Calculate the appearance probability of each feature keyword in each document and calculate the correlation coefficient between keywords.

(3) Select the combination with the highest correlation coefficient between keywords and create a group (cluster).

(4) Recalculate the correlation coefficient between the keywords in the group and other keywords using the average value of the members in the group.

(5) From the group or keyword combination, select the combination with the highest correlation coefficient again, and create a new group.

(6) Repeat (4) and (5) until the whole is in one group.

In the above description, there is also a method of creating a cluster using the force S that forms a cluster based on the correlation coefficient between keywords and the correlation coefficient between documents.

Conventionally, automatic classification and clustering processing by machine learning have not been closely linked, and when creating an initial stage category for machine learning and selecting the correct answer example, the clustered document group was It was only used as a candidate. Also, for example, in the classification of documents disclosed in Japanese Patent Laid-Open Publication No. 05-3242272 (hereinafter referred to as Patent Document 1), when there is no appropriate category at the time of classification, a new category is set. Even to indicate that it is necessary However, functions such as support for creating new capability categories are not described. In automatic classification by machine learning, the distribution of classified documents is biased, and the majority of documents are often concentrated in a specific category. In such a case, it is necessary to subdivide the categories and average the number of documents belonging to each category. In addition, a new category was found from a group of documents that were determined to be “cannot be classified anywhere” because the score calculated based on the classification rules did not reach a certain threshold, and the documents were appropriately classified into them. It is necessary to revise the category system and classification rules so that

In such a case, it is necessary to create a new category and set the correct example document while reading all the documents in the category that should be subdivided or the documents that could not be classified anywhere by hand. A great deal of effort is required. In addition, simply applying the conventional clustering process also uses the feature keywords used in the existing classification rules without distinction in the clustering process, so it is not always necessary to create a new category that is different from the existing category. There was a problem that a suitable document group (cluster) could not be created.

The present invention has been made in view of the above problems, and provides a method for automatically classifying documents more appropriately by closely linking the classification of documents by machine learning and the clustering process. Is to link the two processes as follows.

1) When creating a new category system in machine learning, instead of simply applying the conventional clustering process, apply a clustering process so that an appropriate new category system can be created.

2) Classification results by machine learning can be further clustered and classified more appropriately.

3) Example of correct category in machine learning based on the result of clustering processing Enable document registration.

Non-patent document 1

Koji Tsukamoto, Manabu Sasano: Text Classification Using AdaBoost and Active Learning. Japan Society for Information Processing. 1st 4th Natural Language Processing Workshop.

Patent Document 1

Japanese Patent Application Laid-Open No. 05-34242 2 Disclosure of the Invention

One embodiment of the present invention is a program for realizing, by a computer, a process of automatically classifying a document stored in a storage device of the computer. A function for classifying, and a function for specifying whether or not to include a feature keyword included in a classification rule obtained by machine learning once in a keyword used for the clustering process when performing the clustering process. It is realized by a computer.

If feature keywords in the classification rules obtained by learning once are not included in the keywords used for clustering processing, the categories in which a large number of documents are concentrated as a result of classification by machine learning will be subdivided by clustering processing. Clustering process, or when creating a new category by clustering process for a document group that could not be classified anywhere as a result of classification by machine learning, excluding the feature keywords used in the existing classification rules Therefore, it is easy to create a new category that is different from the existing category.

One embodiment of the present invention is a program for realizing, by a computer, a process of automatically classifying documents stored in a storage device of the computer, the program having a function of performing a clustering process on a group of documents, and a machine learning process. Classification function, And a function of selecting and specifying a target document group when performing the clustering process. Note that there are three types of target documents: uncategorized documents, documents belonging to a specific category classified by machine learning, and documents that could not be classified by machine learning. be able to.

As a result, as a result of classification by machine learning, it is possible to select and specify only the power categories in which a large number of documents are concentrated, reclassify the documents by clustering processing, and classify with machine learning. It is possible to select and specify the missing documents and re-classify them by clustering. In this way, those that could not be classified sufficiently by machine learning can be supplemented by clustering processing, and labor can be saved significantly as compared with the case where reclassification is performed manually.

Further, one embodiment of the present invention is a program for realizing, by a computer, a process of automatically classifying documents stored in a storage device of the computer, the program having a function of performing a clustering process on a group of documents, and A function of classifying, for a group corresponding to the keyword obtained by the clustering process, a function of displaying a document closely related to the keyword of the group, and a document of the machine learning category closely related to the keyword. It is characterized in that a function of registering in a correct answer document of and a computer is realized.

As a result, a number of high-quality correct examples can be arranged in the category of the machine learning, and the effect of easily improving the accuracy of learning and classification in the machine learning process can be obtained. BRIEF DESCRIPTION OF THE FIGURES

The present invention will be more clearly understood from the following detailed description when read in conjunction with the accompanying drawings. W

6 Will be clear.

FIG. 1 is a diagram showing a system configuration of the present invention.

FIG. 2 is a diagram showing a hardware configuration of a computer constituting the system of the present invention.

FIG. 3 is a diagram illustrating an example of a machine learning operation screen when a category is created. FIG. 4 is a diagram illustrating an example of a machine learning operation screen when a correct example document is registered. FIG. 5 is a diagram illustrating an example of a machine learning operation screen when performing learning and classification processing.

FIG. 6 is a diagram illustrating an example of a machine learning operation screen when the classification result is confirmed and the correct answer example is corrected.

FIG. 7 is a diagram showing an example of a machine learning operation screen when confirming the category statistical information and characteristic keywords.

FIG. 8 is a diagram showing an example of a machine learning operation screen when an unnecessary word and a part of speech to be extracted are set.

FIG. 9 is a diagram illustrating a method of calling the clustering operation screen from the machine learning operation screen.

FIG. 10 is a diagram showing an example of the clustering operation screen.

FIG. 11 is a diagram illustrating an example of a clustering operation screen when a clustering result is displayed.

FIG. 12 is a diagram illustrating a clustering operation screen when a list of documents highly relevant to the cluster keyword is displayed.

H1 13 is a diagram showing a flow of learning in the machine learning / classification processing unit.

FIG. 14 is a diagram illustrating a flow in the case where the machine learning / classification processing unit performs a classification process. FIG. 15 shows the flow of the clustering processing unit.

FIG. 16 is a diagram for explaining the details of the “document analysis and feature keyword selection processing” in the clustering processing unit.

FIG. 17 is a diagram illustrating an example of a document to be classified.

FIG. 18 is a diagram showing an example of category definition data.

FIG. 19 is a diagram illustrating an example of a feature key guide for each category obtained by machine learning.

FIG. 20 is a diagram illustrating an example of data of classification rules generated as a result of learning. FIG. 21 is a diagram illustrating a data example of the classification result.

FIG. 22 is a diagram showing an example of unnecessary word list data.

FIG. 23 is a diagram illustrating an example of the processing target document list.

FIG. 24 is a diagram illustrating an example of a recording medium on which the recorded control program can be read by a computer. BEST MODE FOR CARRYING OUT THE INVENTION

Hereinafter, embodiments of the present invention will be described with reference to the drawings.

FIG. 1 is a diagram showing a system configuration of the present invention.

This system consists of a machine learning operation screen 11, a clustering operation screen 13, a machine learning / classification processing unit 14, a clustering processing unit 17, and a repository 19 for correct example documents / classification target documents.

The machine learning operation screen 11 provides the user with a user interface for performing an operation for performing automatic classification by machine learning. Specifically, it provides a user interface for defining categories and registering the corresponding correct examples. The user gives a processing instruction to the machine learning / classification processing unit 14 from this screen (a). In addition, the clustering operation screen 13 can be called (e). In addition, machinery The category definition and the data of the correct answer document defined in the learning operation screen 11 are stored in the database 12 of the category definition + correct answer example document.

In response to the learning instruction from the machine learning operation screen 11, the machine learning / classification processing unit 14 reads out the correct answer document from the storage 19 (g), detects the feature key code for each category, and performs learning. A classification rule is created and stored in a database 15 for storing the learning results. It has a function to return a list of feature keys included in the classification rule in response to an external request (ί). Also, in response to the classification instruction from the machine learning operation screen 11, the documents to be classified are read from the storage 19 (g), classified according to the rules, and which documents belong to which category (or The power that does not belong is stored as a classification result in the database 16 that stores the classification result, and a function that returns a list of documents classified (or could not be classified) into each category in response to an external request. (B, ί).

The clustering operation screen 13 is used to select which documents to target, whether or not to include the characteristic keywords of the existing category, and to issue an instruction for the clustering process. (C) Provide the user interface to the user. It also has a function to display the result of the clustering process and deliver it to the machine learning operation screen 11 as a correct example document (e).

In response to the instruction from the clustering operation screen 13, the clustering processing unit 17 communicates with the machine learning classification processing unit 14 to acquire the target document group and the characteristic keywords of the existing category (f), The target document is read from the storage 19 (h), analyzed to generate a clustering result, and stored in the database 18 for storing the clustering result. The generated clustering result is returned to the clustering operation screen 13 as a result (d) According to the force \ or a request from the machine learning classification processing unit 14 (f). Although the system of the present invention shown in FIG. 1 is configured by a computer (information processing device), the entire system may be configured by a single computer, or may be configured by a plurality of computers, that is, via a network such as the Internet. It may be configured by a system constructed by using

FIG. 2 shows a hardware configuration of a computer (information processing device) constituting the present invention. The computer shown in FIG. 1 includes a CPU 21, a RAM 22, a ROM 23, an HDD 24, an input unit 25, an output unit 26, and an external interface unit 27 which are interconnected via a bus 28. Data can be exchanged mutually under the control of 21.

The CPU (Central Processing Unit) 21 is a central processing unit that controls the operation of the entire computer, and controls the display of the machine learning operation screen 11 and the clustering operation screen 13 in Fig. 1 and machine learning Z classification processing. Functions as the unit 14 and the clustering processing unit 17.

The RAM (R and om Access Memory) 22 is used as a work memory when the CPU 21 executes various programs, and also as a main memory used as a temporary storage area for various data as needed. The memory used.

The ROM (Read Only Memory) 23 is a memory in which a basic control program executed by the CPU 21 is stored in advance, and the CPU 21 executes the basic control program when the computer starts up. Basic control of the operation of the entire computer system is performed by the CPU 21.

The HDD (Hard Disk Drive) 24 functions as a database that stores category definition + correct example documents, learning results (classification rules), classification results, clustering results, correct example documents / classification target documents. In the present invention The part that stores these data is not limited to the HDD in a single computer, and may function in the HDD in another computer connected via a network such as the Internet. May be stored on the HDD in the web server connected via the network. The HDD 24 stores a machine learning Z classification processing program executed by the CPU 21 and a clustering processing program.

The input unit 25 receives an external input and passes the content of the input to the CPU 21. The input unit 25 includes, for example, an input device used by a user to instruct classification operations such as a keyboard and a mouse, and further includes an FD (Flexible Disk), a CD-ROM (Comm Act D). isoc-ROM), DVD-ROM (Digital Versati 1 e D isc-ROM), MO (magneto-optics) disk, etc. Is done.

The output unit 26 performs an output according to an instruction from the CPU 21 and displays various data, for example, a display device such as a CRT (Cathode Ray Tube) or an LCD (Liquid Crystal D islay). And a printer device as required.

The external I / F (interface) unit 27 manages communication when exchanging data between computers. When the whole system shown in Fig. 1 is composed of multiple computers, data is exchanged between them. It manages the communication of each computer when exchanging.

As described above, the computer shown in FIG. 2 has a standard configuration as a computer.

Next, the system of the present invention shown in FIG. 1 will be described in detail in the following _order.c First, 1) The user's operation method and the acquisition of the classification result when classifying documents are used. An explanation will be given using an example of a screen displayed on a display device of a computer facing a user to clarify the gist of the present invention. That is, using the machine learning operation screen 11 (1-1-1) and the clustering operation screen 13 (1-2) in Fig. 1, we will explain what features this system has.

Next, 2) the details of classification and clustering processing by machine learning in this system will be described. That is, the processing contents of the machine learning / classification processing unit 14 (2-1) and the clustering processing unit 17 (2-2) of FIG. 1 will be described using a flowchart.

Finally, 3) Data required in the process of document classification by this system, output data, classification results, etc. will be shown using specific examples. "

In the above 3), the subject of classification is “E-mail” in the data of patent application publications, and the “Title of Invention” and “Summary” are each a single file. The explanation will be made using the data saved in the server. In the explanation of the screen example in 1) above, the explanation will be made using the screen example when the system is operated with the same classification target.

1) Machine learning operation screen 1 1 and clustering operation screen 13:

1-1) Machine learning operation screen 1 1:

In classification by machine learning, the user creates an optimal automatic classification rule while repeating the work of defining the category system → learning processing → document classification processing → evaluation and feedback of the classification result several times. When it is determined that the optimal classification rule has been created, the procedure is to start automatic classification.

First, as a definition of the category system, users must create categories to classify documents. Figure 3 shows an example of a screen for creating a category. Press the "Create category" button 31 in the left frame to display the category creation screen on the right. In the right frame, enter the category ID, category name and description, Press the “Create” button 3 2 to create one category. In Fig. 3, the following categories have been created: Category ID: 01, Category name: Forwarding / Address change, Description: E-mail forwarding, circulation, and technology for changing the address. In this way, two or more categories into which documents should be classified are created.

In addition, as a definition of the category system, the user must register a document as a correct example to be classified into the category created earlier. Figure 4 shows an example of a screen when registering a correct answer document. Here, the five categories defined in the left frame are displayed (“Transfer”, “Redirect”, “Format conversion”, “Efficiency”, “Improved operability”, “Security”). This shows that the correct answer example is registered in the “Change of destination” category. Enter the correct answer document URL in the correct answer example URL at the bottom of the right frame and press the register button to register the correct answer example.

After the category definition and the correct example documents have been registered, the user presses the "Save and start learning" button 51 in the upper frame shown in Fig. 5 to perform the learning process. . The system analyzes the contents of the correct answer examples, extracts feature keywords for each category, and creates classification rules. This learning process is the same as the conventional technology. When the learning is completed, set the location (collection destination URL) of the document to be classified and press the “Start document collection Z classification” button 52 to perform the document classification process. The system determines the category of the collected documents by referring to the classification rules based on the classification rules, sorts them to the appropriate places, and creates the classification results.

After classifying the documents, the user learns whether the documents have been classified into appropriate categories by learning. It is necessary to provide feedback that documents that are not classified into an appropriate category are newly registered by the user as a correct answer in an appropriate category. FIG. 6 shows an example of the machine learning operation screen when the classification result is confirmed and the correct answer example is corrected. Categorized into "Forwarding / Redirecting" category The URL indicating the location of the document is displayed as a classification result in a list. Also, the number "Confidence" is displayed at the left of the result list, which is expressed as a decimal between 0 and 1, and '1' surely belongs to this category '0' means that it does not belong to this category at all. A confidence level of around 0.5 is in the middle of being in or out of this category, and should be used by the user as a guide to confirm that the document has not been misclassified. Can be. In Fig. 6, since the document shown at the bottom of the classification result is a document that was incorrectly classified, check the "Selection check box" in the right column of the confidence level, and select "Transfer" in the left frame. The user must select a category other than "Address change", that is, one of the following categories: "Format conversion", "Efficiency", "Improvement of operability", and "Security J". In this way, by repeating learning and classification, it is possible to increase the accuracy of document classification by machine learning.

At the beginning of using the system, users define the category system as "approximately, let's classify it like this" and proceed with machine learning, but review the initially defined categories and divide and consolidate categories. It is necessary to construct an optimal category system. For this purpose, the machine learning operation screen 11 of this system includes a screen for confirming category statistical information and feature keywords (Fig. 7), and a screen for setting unnecessary words for determining feature keywords and part-of-speech to be extracted (Fig. 7). 8) A function to display is provided.

Figure 7 shows an example of a screen for confirming category statistical information and feature keywords. Here, you can check the number of documents classified into the “Forwarding / Redirecting” category, the occupancy rate in the entire document, and the maximum certainty factor and characteristic keywords of the power category. In Figure 7, “Evaluation” is “No problem.” For example, if a document is too concentrated in a certain category and the occupancy is too high, It is evaluated as “consideration of division”. On the other hand, if the document is hardly classified, it will be evaluated as “Consider abolished Z integration”. In addition, if the maximum confidence is too high, that is, if you are confident enough to judge that the document is not in this category, you may say, "There are not enough correct answers or there are many features similar to other categories. It is evaluated. Users can divide and consolidate categories and build an optimal category system while referring to these evaluations. In the list of characteristic keywords displayed at the bottom, if any of the words extracted as characteristic keywords does not match the criteria for category classification, select them and select Press “Settings” button 7 1 to remove from the feature keywords. Pressing the “Unnecessary word list” button 72 switches to the unnecessary word and part-of-speech setting screen.

Fig. 8 shows an example of the setting screen for the unnecessary words and the parts of speech to be extracted. The set unnecessary words are displayed on the left side of the right frame. Here, “system”, “mail”, “invention”, “device”, and “e-mail” are set as unnecessary words. By pressing the "Delete" button 8 1 at the bottom, it is possible to delete from the unnecessary word list. On the right side of the right frame, the part of speech to be extracted as a feature keyword can be set. Here, it is set to extract common keywords, common nouns, personal nouns, proper nouns, place names, personal names, unregistered words, katakana unknown words, alphanumeric unknown words, etc. as feature keywords as characteristic keywords .

1— 2) Clustering operation screen 13:

In the system of the present invention, a category in which a large number of documents are concentrated in classification by machine learning is subdivided by clustering processing, and a group of documents that cannot be classified anywhere by classification by machine learning is clustered by clustering processing. It is possible to create a new category. In addition, documents that are closely related to the keywords obtained by the clustering process are converted to the correct example documents in the machine learning category. It is possible to register to. As a result, as shown in Fig. 9, the user who calls the clustering operation screen from the machine learning operation screen can execute the processing using machine learning. And the clustering process can be used easily. That is, when the “clustering” button 91 of the machine learning operation screen shown in FIG. 9 is pressed, the clustering operation screen is displayed as another window. Figure 10 shows an example of the called clustering operation screen. “Existing clustering result” is displayed at the top of the window (No. 2 to No. 4), but nothing is displayed here in the initial state when the clustering operation screen is opened for the first time. Not done. After performing the clustering process using the “New clustering” input form at the bottom, the process No. and the content memo are listed in “Existing clustering results”. Select one of “Existing clustering results” (here, No. 2 to No. 4) and press the “Cluster display” button 101 to display the clustering result display screen (Fig. 11). Transitions to

In the “New clustering” form at the bottom of the window, select the required items and press the “Start clustering” button 102 to execute the clustering process under the specified conditions. This clustering operation screen is linked with the machine learning operation screen. If "documents that could not be classified" is selected as "processing target", documents that could not be classified by machine learning can be targeted. Select “Documents in the category selected below” and select the category currently defined (and containing the classification result) in the classification by machine learning in the list box in the figure. Documents can be clustered. In other words, unclassified documents (corresponding to the case where an evaluation subset document is selected in Fig. 10) are targeted for clustering, documents that cannot be classified by machine learning, and those that are classified by machine learning. Specific category It is possible to perform clustering processing by specifying any of the documents belonging to Also, whether or not to check the check box 103 of “Include already learned feature keywords” specifies whether or not to include feature keywords extracted by machine learning as feature keywords in the clustering process. be able to. Furthermore, the setting of the unnecessary words and the parts of speech to be extracted in the machine learning is also enabled for the clustering process by checking the "Enable the setting of the parts of speech to be extracted for unnecessary words" check box 104. It is possible to set whether or not.

Fig. 11 shows an example of the clustering result display screen displayed when the "clustering display" button 101 is pressed. The upper URL indicates the location of a group of documents to be subjected to the clustering process. The number of target documents is set to 2,311, and the feature key word is set to 50,000. If you specify the size of the cluster (maximum number of keywords to be cut out as one cluster) and press the “Cluster display” button 1 1 1, the cluster is cut and displayed in a tree shape. In the "keyword frequency index" input field, it is possible to set a threshold that determines keywords to be regarded as important words and displayed in a different color from other keywords. In Fig. 11, the threshold is specified as a percentage of the keyword T F ■ I D F value for the number of documents to be processed. Also, the clustering result is displayed at the bottom of FIG. In this way, a group of documents that cannot be appropriately classified by machine learning can be classified by clustering processing. When the “Related Document List” button 1 1 2 of each cluster is pressed, the screen transits to the document list screen (Fig. 12) that is highly relevant to the cluster keyword.

Figure 12 shows an example of a screen that shows a list of documents that are highly relevant to the cluster keyword. In this screen, for the keywords of the cluster displayed on the clustering result display screen, the appearance probabilities in each document are totaled, and the score is high. For example, a list of up to 20 documents is displayed by URL and title (or a summary of the first 10 characters of the document). In some cases, the documents displayed on this screen may not be organized in light of the human senses.In such a case, return to the clustering result display screen and change the size of the cluster or select another cluster. To select an appropriate harm group. If the documents are properly organized, select them individually in the check box on the left of each document, or press the “Select All” button 1 2 1 at the bottom to select all, and click the “Register to Correct Answer” button. Press 1 2 2. Then, the machine learning operation screen is displayed in the foreground. Clicking the category of the registration destination allows you to register it as a correct answer document of that category. As described above, we have clarified what functions this system provides by examining user operations on the machine learning operation screen 11 and the clustering operation screen 13. In other words, in this system, the clustering operation screen can be called from the machine learning operation screen, and the called clustering operation screen inherits the categories, classification rules, feature keywords, classification results, etc. in machine learning. It is characterized in that it can be used for subsequent clustering processing. Another feature is that the result of the called clustering process can be registered so that it can be reflected in the classification process by machine learning.

Next, the details of the classification process and the clustering process by machine learning of the system of the present invention will be described.

2) Details of processing of machine learning / classification processing unit 14 and clustering processing unit 17:

2-1) Machine learning / classification processing section 14:

The machine learning classification processing unit in the system of the present invention is not much different from the classification based on the machine learning of the prior art.

FIG. 13 shows a flow of a learning process performed by the machine learning / classification processing unit 14. First, Learning starts when the category definition and the correct answer example document described in Fig. 3 and Fig. 4 in 1) have been registered. In S131, the correct example document registered in each category is subjected to morphological analysis, and only words corresponding to the specified part of speech to be extracted are extracted. The part of speech to be extracted is set as shown in FIG. Next, in S132, the number of appearances in each document for each word and the total number of appearances in all documents are totaled. In S133, the feature level of the word (the ratio of the probability of occurrence of a certain category in the correct example document to the probability of occurrence in the entire correct example document) is calculated. In S134, a fixed number of features are extracted as feature keywords in descending order of the feature level (except for those included in the unnecessary word setting (see Fig. 8)). Then, in S135, the ratio (score) of each feature keyword and the number of occurrences contributing to the determination of the category is calculated, a classification rule is created, and the classification rule is stored in the database 15 where the classification rule is stored. Store. The formula for calculating this score uses the one described in Non-Patent Document 1.

After learning as described above, the documents are classified. FIG. 14 shows the flow of processing performed by the machine learning Z classification processing unit 14 at the time of classification. First, in S140, one document is read from a set of documents to be classified (such as a file server or a ZWeb server). At S141, it is determined whether or not all the documents to be classified have been read. If all the data has been read (Y), all the objects to be classified have been classified, and the processing ends. If all of them have not been read (N), the process proceeds to SI42, where the document content of one read document is subjected to morphological analysis, and the number of occurrences of each word is counted. In S 1 4 3, one is taken out of the totaled words. It is determined whether or not all the words counted in S144 have been completed. If all has been completed (Y), proceed to S147. If all of them have not been completed (N), it is determined whether or not the word exists in the classification rule in S145. If the word does not exist in the classification rule (Ν), the process returns to S144. If the word exists in the classification rule (Y), go to S146 and specify the word in the classification rule. The score values of the obtained feature keywords are integrated for each category, and the process returns to S143. When proceeding to S147, since the score value of each category has been obtained for one read document, in S147, the category that obtained the maximum value among the obtained score values is extracted, and S148 Proceed to. In S148, it is determined whether or not the extracted score value is equal to or greater than the threshold value. If the score value is not equal to or greater than the threshold value (N), the process proceeds to SI49 and is stored in the classification result database 16 as "Nothing was classified". If the value is equal to or larger than the threshold value, the process proceeds to S150, and is classified into the category having the highest score value. After S149 and S150, each returns to S140, reads a new document from the set of documents to be classified, and performs the same classification processing.

2-2) Clustering processing section 17:

Next, the clustering processing unit 17 will be described. FIG. 15 shows the flow of the clustering processing unit 17. The process after S159 in FIG. 15 is the same as the conventional clustering process, but the rest is unique to the present invention.

First, in S151, it is determined whether or not the feature key of an existing category in the classification by machine learning is included in the feature keyword of the clustering process. If not included (N), the feature key is obtained from the machine learning / classification processing unit 14 in S152, an unnecessary word list is created, and the process proceeds to S153. When including (Y), the process proceeds to S153. Reference numerals 3153 to 3157 denote processing units for determining a document target to be subjected to clustering processing and acquiring a classification target. In S153, it is determined whether or not the target of the clustering process is a document group in a specific category in the classification by machine learning. If the documents belong to a specific category (Y), the process proceeds to S 154, in which a list of documents belonging to the specific category is obtained from the machine learning / classification processing unit 14, and a processing target document list is created. If it is not a document group in the specific category (N), the process proceeds to S155, and it is determined whether or not the target is a document group that could not be classified by machine learning. If the document group could not be classified (Y), Proceeding to S156, a list of documents that could not be classified is acquired from the machine learning Z classification processing unit 14, and a list of documents to be processed is created. If it is not a group of documents that could not be classified (N), the process advances to S157 to obtain a list of all documents to be classified and create a list of documents to be processed. In S158, analysis of the document and selection of characteristic keywords are performed. Details will be described with reference to FIG. Then, in S159, the probability of occurrence of the feature key word for each document is calculated, in S160, the correlation coefficient between the feature key words is calculated, and in S161, the cluster is created by combining in descending order of the correlation coefficient. The processing ends. This is the flow of the clustering process of the system of the present invention. S 151 is a process corresponding to determining whether or not the check box 103 in FIG. 10 is checked, and S 153 and S 155 S 157 corresponds to the selection of “processing target” in FIG. 10, and includes a group of documents belonging to a specific category in machine learning, a group of documents that could not be classified by machine learning, a group of unclassified documents, This is a process for determining whether or not the above-described process is performed. It can be said that the feature of the present invention is that these processes are added to the conventional clustering process.

With reference to FIG. 16, the processing of “analysis of document and selection of characteristic keywords” in S158 will be described.

In S162, it is determined whether or not all the documents in the processing target document list have been read. If not all the documents have been read (N), proceed to S163, read one document, morphologically analyze the document, and count the total number of occurrences for each word and the number of occurrences. Then, the process returns from S163 to S162. If it is determined in S162 that all documents have been read (Y), the process proceeds to S164. In S164, the TF ■ IDF value for each word is calculated and sorted in descending order. In S165 and later, sorted words are read. In S165, it is determined whether the number of read words has reached the maximum number of keywords set on the clustering processing screen in FIG. If it has reached (Υ), the process ends. If it does not reach (Ν), S 166 Go to and read one word in sorted order. In S167, it is determined whether or not the morphologically analyzed word ends. If the processing has ended (Y), the processing ends. If not completed (N), the process proceeds to S168, and it is determined whether or not the read word exists in the unnecessary word list. If it is in the unnecessary word list (Υ), return to S165. If it does not exist in the unnecessary word list (N), go to S169, adopt it as a feature keyword, and return to S165. Thus, the analysis of the document and the selection of the characteristic keyword are performed.

The machine learning / classification processing unit 14 and the clustering processing unit 17 of the present system have been described above with reference to FIGS. The system of the present invention is particularly characterized in that the clustering process 17 can use the classification result, the classification rule, and the like of the machine learning / classification processing unit 14, and has clarified that the system has a function for that.

Next, a case where documents are classified by the system of the present invention will be described using a specific example in correspondence with user operations.

3) Specific examples of document classification by this system:

FIG. 17 shows an example of a document used as a classification target in the following description.

As described above, the classification target is “e-mail” in the patent application publication data, and the “name of the invention” and “abstract” are one file per file in HTML format. It is saved on the server. First, a category system for machine learning is defined in order to classify documents by machine learning. That is, the force category shown in FIG. 3 is created and the correct example document shown in FIG. 4 is registered. The registered categories and correct answer example documents are stored in the database 12 shown in Fig. 1. The data structure of the data stored in the database is shown in Fig. 18. In the category system definition, define at least two or more categories and set one or more correct answer documents that should belong to each category in any number However, in the example of Fig. 18, five categories are defined from 01 to 05, and the correct examples corresponding to the category are listed in the form of codes (here, URLs) that can identify their locations. Is stored in Note that the title (name of the invention) is shown on the right side of the URL in Fig. 18, but this is added to make it easier for people to see, and is not essential data. .

After defining the categories and registering the correct example documents, perform machine learning. (If you press the learning start button 51 in Fig. 5, the machine learning / classification processing unit 14 executes the machine learning process shown in Fig. 13. Begins), and the learning result is stored in the database 15 of FIG. Figure 19 shows examples of feature keywords for each category extracted by learning. FIG. 20 shows an example of the data configuration of the classification rules that are the learning results.

Figure 19 shows the feature keywords for each category and their corresponding score values. The score value is the logarithm (log) of the ratio of the probability that the keyword appears in the correct answer document in that category to the probability of appearing in the entire correct answer document, and indicates the weight as a feature.

FIG. 20 shows the data configuration of the classification rules that are the learning results. For each keyword, when it appears in a certain document, the category to which it belongs is quantified and stored as a score for each category. The “P” column of the score for each category indicates the score when the keyword in the left column appears in the document more than the threshold number. The “N” column indicates the score when the keyword appears less than the threshold number in the document. For example, if the keyword "input" appears more than once (the P column in the box enclosed by 201 in the figure is from left to right), the score for category 01 is 0.815, and so on. The score for category 002 is 0.541, the score for category 03 is 1.07, the score for category 004 is -0.074, the score for category 005 Is one 1.082. Also, the keyword "input" is less than once (Only N 出現 in the part surrounded by 201 in the figure is from left to right), the score for category 001 is 0.484, the score for category 002 is -0.183, and the score for category 003 is The score for 0.16, category 004 is 0.072, and the score for category 005 is 0.135. Note that the score differs depending on how many times the word “input” appears in the document to be classified. In FIG. 20, the keyword “input” is shown when the threshold of the number of appearances is one (the part surrounded by 201) and when the threshold of the number of appearances is two (the part surrounded by 202). For example, if the keyword “input” appears once in a document 內, the score for category 001 is 0.815 (P section of category 001 enclosed by 201) + (—0.487) [N section of category 00 1 enclosed in 202] = 0.328 Force S. If the keyword “input” appears twice, the score for category 001 is , 0.815 [P column of category 001 enclosed by 201] + 0.945 [P column of category 001 enclosed by 202 :) = 1.760. When classifying a document, it is determined whether or not all the words in the document correspond to the feature keywords in the classification rules.If the words correspond to the feature keywords, the number of occurrences of the word is counted, and Then, the scores shown in FIG. 20 are added for each power category, and the score value for each category is obtained.

The category that obtained the maximum value among the score values obtained in this manner is extracted, and it is determined whether or not the score value is equal to or greater than a set threshold value. If the score value is equal to or greater than the threshold value, the category is classified. If it is less than the threshold value, it is determined that no classification has been made, and it is stored in the database 16, which is the storage location of the classification results shown in FIG. Figure 21 shows an example of the data structure of the classification result. Each classified document is recorded for each category by a code (here, URL) that can uniquely identify the location together with the certainty factor and title. As shown in Figure 6, the certainty factor indicates that the document is in the category. This is a numerical indication of the certainty of classification, and can be obtained from the obtained score value and threshold value.

The example of the data structure in the learning and the classification in the machine learning of the system of the present invention has been described above. Next, an example of data passed to the clustering processing unit when the clustering processing is called from the processing by machine learning is shown.

First, Figure 22 shows an example of unnecessary word list data. This is done by turning off the “Include already learned feature keywords” check box 103 shown in Figure 10 and turning on the “Enable unnecessary word and part-of-speech settings to be extracted” check box 104 This is an example of a case in which the feature key word obtained by machine learning and a list of words in which unnecessary words specified during learning are not subjected to clustering processing. In this way, a list of unnecessary words is created by the machine learning classification unit 14 according to the user's specification, and is passed to the clustering unit 17. Next, FIG. 23 shows an example of data of a document list to be subjected to the clustering process. This corresponds to the result of the machine learning classification when the user specifies "document that could not be classified" or "document in the category selected below" in "processing target" shown in Fig. 10. The document list of the category is taken out and passed from the machine learning Z classification processing unit 14 to the clustering processing unit 17. In FIG. 23, a document list of “documents that could not be classified” is shown.

As described above, the required data is passed from the machine learning / classification processing unit 14 to the clustering processing unit 17 according to the designation of the user, and the clustering process is performed. The result of the clustering process is stored in the database 18 for storing the clustering result shown in FIG. Since the clustering result has the same data structure as general clustering results, it is not specifically shown here. In addition, a list of related documents is stored in the database corresponding to each cluster, and among the documents in the list, the user selects the check box in the leftmost column of the screen example shown in Fig. 12. When the user presses the “Register to correct answer” button at the bottom of Fig. 12, the category definition and correct answer are selected via the machine learning operation screen 11. Passed to database 12 to register the example document.

The data structure and the data flow in the system of the present invention have been described above with reference to FIGS.

As described above, the system of the present invention shown in FIG. 1 has been described in detail in the order of 1) to 3), and the details of the system of the present invention have been clarified.

By the way, it has been described that the system of the present invention is configured by a computer (information processing device), but the various processes shown in FIGS. The present invention can also be implemented by causing a computer to record a control program that causes a computer to perform these various processes and reading and executing the control program from the recording medium by the computer. It is.

Fig. 24 shows an example of a recording medium that allows a computer to read the recorded control program. As shown in the figure, the recording medium may be, for example, a RAM 24 or a ROM provided as an internal or external accessory device in the computer 241, or a memory 2442 such as a hard disk device, or a flexible device. Portable recording media such as disks, MOs (magneto-optical disks), CD-ROMs, DVD-ROMs, etc. can be used.

The recording medium may be a storage device 246 provided in a computer functioning as a program server 245 connected to the computer 241 via the communication line 244. In this case, a transmission signal obtained by modulating a carrier with a data signal representing a control program is transmitted from the program server 245 through a communication line 244 as a transmission medium. By demodulating the received transmission signal and reproducing the control program, the control System can be executed.

In addition, the present invention is not limited to the above-described embodiment, and various modifications and changes can be made without departing from the gist of the present invention. Industrial potential

As described above in detail, according to the present invention, by closely linking the classification by machine learning and the clustering processing, the category in which many documents are concentrated by the classification by machine learning is subdivided. When creating a new category from a document that could not be classified anywhere, it is possible to save a lot of labor compared to creating a new category by relying on humans, and to identify words that do not exist in the characteristic keywords of existing categories. It is easy to specify so that a new featured category can be created. In addition, it is possible to reflect the result of the clustering process to machine learning, that is, it is easy to collect documents with similar contents and register them in the correct answer documents of the category, so that high-quality correct answer documents It becomes easy to prepare a large number of, and it is easy to improve the accuracy of machine learning and classification.

Claims

The scope of the claims

1. A program for realizing, by a computer, a process of automatically classifying documents stored in a storage device of the computer,

A function of performing clustering processing on the document group;

A function of classifying the document by machine learning;

When performing the clustering process, a function for specifying whether to include or not include a feature keyword in a classification rule obtained by machine learning once in a key used for the clustering process;

A computer-implemented automatic classification program for documents.

2. A program for realizing, by a computer, a process of automatically classifying documents stored in a storage device of the computer,

A function of performing clustering processing on the document group;

A function of classifying the document by machine learning;

At the time of performing the clustering process, an opportunity to select and specify a target document group,

A computer-implemented program for automatically classifying documents.

3. The program according to claim 2, wherein the target document group is:

Unclassified documents,

A group of documents belonging to a specific category classified by the machine learning,

A group of documents that cannot be classified by the classification based on the machine learning,

The program is characterized by three types.

4. A program for performing, by a computer, a process of automatically classifying documents stored in a storage device of the computer,

A function of performing clustering processing on the document group;

A function of classifying the document by machine learning;

For a group corresponding to a keyword obtained by performing the clustering process, a function of displaying a document closely related to the keyword in the group;

A function of registering a document closely related to the keyword as a correct example document in the machine learning category;

A computer-implemented automatic classification program for documents.

5. A method for automatically classifying documents stored in a storage device of a computer by a computer,

Clustering the group of documents;

Classifying the document by machine learning;

When performing the clustering process, a step of designating whether to include or not include a feature keyword in a classification rule obtained by machine learning once in a keyword used for the clustering process;

A method for automatically classifying documents, comprising:

6. A method for automatically classifying documents stored in a storage device of a computer by a computer,

Clustering the group of documents;

Classifying the document by machine learning;

When performing the clustering process, select and specify the target document group A method for automatically classifying documents, comprising:

7. A method for automatically classifying documents stored in a storage device of a computer by a computer,

Clustering the documents.

Classify the documents by machine learning.

Displaying, for a group corresponding to the keyword obtained by the clustering process, a document closely related to a keyword in the group; and converting the document closely related to the keyword into a correct example document of the machine learning category. Registering,

A method for automatically classifying documents, comprising:

8. A device for automatically classifying documents stored in a storage device of a computer,

Means for clustering the document group;

Means for classifying the document by machine learning;

Means for designating whether or not to include a feature keyword in a classification rule obtained by machine learning once in a keyword used for the clustering process when performing the clustering process;

An automatic document classification apparatus, comprising:

9. A device for automatically classifying documents stored in a storage device of a computer,

Means for clustering the document group;

Means for classifying the document by machine learning; Means for selecting and specifying a target document group when performing the clustering process;

An automatic document classification apparatus, comprising:

10. A device for automatically classifying documents stored in a storage device of a computer.

Means for clustering the document group;

Means for classifying the document by machine learning;

Means for displaying, for a group corresponding to the keyword obtained by performing the clustering process, a document closely related to a keyword in the group;

Means for registering a document closely related to the keyword in a correct example document of the machine learning category;

An automatic document classification apparatus, comprising:

1 1. A computer-readable recording medium on which a program for realizing a process of automatically classifying documents stored in a storage device of a computer is recorded.

A function of performing clustering processing on the document group;

A function of classifying the document by machine learning;

When performing the clustering process, a function to specify whether or not to include a feature keyword in the classification rule obtained by machine learning once in a keyword used for the clustering process,

A computer-readable recording medium that stores a program for causing a computer to realize the above.

1 2. A computer-readable recording medium on which a program for realizing a process of automatically classifying documents stored in a storage device of a computer is recorded.

A function of performing a clustering process on the documents in the document group;

A function of classifying the document by machine learning;

A function for selecting and specifying a target document group when performing the clustering process;

1 3. A computer-readable recording medium on which a program for realizing a process of automatically classifying documents stored in a storage device of a computer is recorded.

A function of performing clustering processing on the document group;

A function of classifying the document by machine learning;

For a group corresponding to a keyword obtained by performing the clustering process, a function of displaying a document related to a keyword in the group and a document;