WO2014002212A1

WO2014002212A1 - Document linking method, document searching method, document linking apparatus, document linking apparatus, and program therefor

Info

Publication number: WO2014002212A1
Application number: PCT/JP2012/066348
Authority: WO
Inventors: 義行小林
Original assignee: 株式会社日立製作所
Priority date: 2012-06-27
Filing date: 2012-06-27
Publication date: 2014-01-03
Also published as: JP5894273B2; JPWO2014002212A1

Abstract

Provided is a work procedure document searching method which can retrieve appropriate work procedure documents by just a query about a work procedure and does not require input of a detailed work procedure. A document linking method comprises a step of inputting a plurality of document files, a step of extracting descriptions about the work procedures from the document files, a step of evaluating similarities among the work procedures, and a step of classifying the documents into a tree structure according to similarities among the work procedures. Also, a document searching method retrieves documents that were linked by this method according to the similarities among the work procedures.

Description

Document association method, document retrieval method, document association apparatus, document retrieval apparatus, and program therefor

The present invention relates to a method and an apparatus for searching a document such as a work procedure manual (manual) for business that performs work in accordance with predetermined procedures such as manufacture, maintenance, and inspection of products.

There are many tasks to proceed with work according to a fixed procedure. In such work, a work procedure manual that describes the procedure for proceeding the work is prepared in advance, and the worker is required to proceed with the work in an appropriate procedure according to the work procedure manual. Therefore, it is required that a work procedure manual suitable for the work to be performed by the worker can be easily searched.

However, since the types of terms used to describe the work are limited, there is a problem that it is difficult to efficiently find an appropriate work procedure manual in a general keyword search.

Patent Document 1 is an invention that supports efficient search of work procedure manuals. The present invention is intended to assist the search for a work procedure manual that serves as a model when creating a work procedure manual for a network maintenance worker such as a computer. The work procedure manual is searched based on the similarity of the work procedure, and the similarity between the construction target network and the network described in the work procedure manual is evaluated, so that the work procedure manual can be efficiently searched. A sequence matching technique can be used to calculate the similarity of work procedures. As such a technology, there is Non-Patent Document 1.

JP 2009-181170 A

The invention of Patent Document 1 is a combination of two methods: a method of searching for documents according to similar work procedures and a method of narrowing down search results using the similarity of the network to be constructed. The latter method cannot be applied to work without network construction. In addition, the former method can be applied to search for various work procedure manuals, but it is assumed that the worker can input a sufficiently detailed work procedure. However, inputting a detailed work procedure is more burdensome than inputting a keyword. Therefore, the operator is expected to input a simple work procedure and search. In such a case, it is difficult to select an appropriate work procedure manual because it is possible to obtain similar search results from various viewpoints only by searching for a similar work procedure manual. In Patent Document 1, it is assumed that a search is performed using a similarity to a work procedure written in a work procedure document being created. This is a description of a detailed work procedure by inputting a simple work procedure. It can be considered that we are going to get.

An object of the present invention is to provide a document search method and apparatus capable of searching an appropriate document only by using a work procedure as a query, and not requiring a detailed work procedure to be input when the work procedure is used as a query. And

In order to solve the above problems, the present invention adopts the configuration described in the claims.

As an example of the document association method of the present invention, a step of inputting a plurality of document files, a step of extracting a description of the work procedure from each of the document files, a step of evaluating the similarity of the work procedure, A document association method including a step of classifying documents into a tree structure according to the similarity of work procedures.

The document search method of the present invention includes a step of inputting a search query in the form of a work procedure in the classification database created using the document association method, and a work procedure for each classification of the search query and the classification database. And a step of searching for a document by evaluating the similarity.

As an example of the document association apparatus of the present invention, a document file input unit that inputs a plurality of document files, a work procedure extraction unit that extracts a description of a work procedure from each of the document files, A document association apparatus comprising: a work procedure similarity evaluation unit to be evaluated; and a document classification unit that classifies the document into a tree structure according to the similarity of the work procedures.

The document search apparatus of the present invention includes a classification database created using the document association apparatus, a search query reception unit that receives input of a search query in a work procedure format, and a classification of the search query and the classification database. A document search unit for evaluating the similarity of each work procedure and searching for a document is provided.

An example of the program of the present invention is a program for causing a computer to function as a document association apparatus, which extracts a document file input unit for inputting a plurality of document files and a description of a work procedure from each of the document files. And a work procedure similarity evaluation unit that evaluates the similarity of work procedures, and a document classification unit that classifies the document into a tree structure according to the similarity of the work procedures.

According to the present invention, documents classified according to the similarity of work procedures are searched. Therefore, it is only necessary to compare documents for each classification and select an appropriate document, thereby reducing the work of selecting a document. be able to.

1 is a block configuration diagram of a document association apparatus and a document search apparatus according to an embodiment of the present invention. The block diagram of the system of the Example of this invention. The figure which shows the flow of a process of the Example of this invention. The figure which shows the data structure of a work procedure and a file identifier. The figure which shows the flow of a process of work procedure manual classification | category process. The figure which shows the example of integration of a work procedure. The figure which shows the flow of the integration process of a work procedure. The figure which shows the model example of a work procedure manual classification | category result. The figure which shows the model example of the similarity of a work procedure. An example of string mapping. Table of string similarity evaluation. An example of correspondence between character strings. Work procedure database table. Work procedure manual classification database table. The figure which shows the flow of a process of work procedure manual search.

Embodiments of the present invention will be described with reference to a block diagram representing a system function and a specific system diagram.

FIG. 1 shows a block configuration of the embodiment of the present invention based on functions on the system. The configuration for classifying and searching work procedure manuals includes a work procedure manual file input unit 101, a work procedure manual file reading unit 102, a work procedure extraction unit 103, a work procedure manual classification unit 104, a work procedure similarity evaluation unit 105, a work procedure. Book database 106, work procedure manual classification database 107, search query receiving unit 108, work procedure manual search unit 109, and work procedure manual file output unit 110.

Further, as shown in FIG. 2, this system has a specific system configuration, that is, a central processing unit 201 having a central processing unit and processing information by a storage program system, and a main storage device 202 including a random access memory. A computer comprising an external storage device 203 for storing a document to be processed and a dictionary of processing results, an input device 204 for inputting documents and the like, and an output device 205 for outputting information processing results such as a created dictionary Operates on devices. The central processing unit 201 may be connected to another information processing apparatus 207 via the network 206. The external storage device 203 includes a database 2031 and a dictionary 2032. The input device 204 includes a CR-ROM reader 2041, a DVD reader 2042, a keyboard 2043, and the like. The output device 205 includes a CR-ROM writing device 2051, a DVD writing device 2052, a display 2053, and the like. The system shown in FIG. 1 can be realized by reading the program into the main storage device 202 via the input device 204 or the network 206 and operating it on the central processing unit 201.

Hereinafter, each configuration of FIG. 1 will be described in detail.

The work procedure manual file input unit 101 receives a document file input from the outside of the system in the form of a storage medium such as a DVD or a CD-ROM by the input device 204 and stores it in the external storage device 203. On the external storage device 203, a work procedure manual database 106 and a work procedure manual classification database 107 are constructed. It is assumed that the work procedure manual file is an electronic document created by a word processor or the like, and its contents are character-coded. The character code is not particularly limited. A unique symbol for identification is given to the work procedure manual file. Hereinafter, a unique symbol for identification is called a file identifier.

The work procedure manual file reading unit 102 reads the work procedure manual stored in the work procedure manual database 106 constructed on the external storage device 203 into the main storage device 202.

The work procedure extraction unit 103 extracts the contents of the work procedure from the work procedure manual file. It is assumed that the extracted work procedure is stored in the main storage device 202 in association with the file identifier. The work procedure is explicitly indicated by a tag or the like in a structured document such as an XML file. Since the work procedure manual is a document intended to explain the procedure for proceeding with the work, such an assumption is appropriate. In the following description, it is assumed that the work procedure is extracted using explicit information. However, when there is no such explicit information, it is conceivable to use a method as described in Non-Patent Document 2.

The work procedure extracted from the work procedure manual is stored as a set of file identifier and work procedure as shown in FIG. 4 and used in the subsequent processing.

The work procedure manual classification unit 104 classifies the work procedure manuals registered in the work procedure manual database 106 according to the similarity of the work procedures. At this time, the process proceeds while evaluating the similarity of the work procedure extracted using the work procedure similarity evaluation unit 105. The result of classifying the work procedure manual is stored in the work procedure manual classification database 107.

The flow of the work procedure manual classification process is shown in FIG. Here, classification is performed by hierarchical clustering. The classification algorithm is not limited to hierarchical clustering as long as classification is performed using similarity.
First, in S501, all work procedures are classified into one classification. That is, the number of work procedure manuals is created as many as the number of work procedure manuals, and a different work procedure manual is put into each category.
Next, the work procedure similarity evaluation unit 105 calculates the similarity between the classifications in S502. Since the first classification includes only one work procedure, the similarity is calculated using this work procedure. In the second and subsequent calculations in the loop, the work procedures included in the classification are integrated into one work procedure, and the similarity is calculated using this work procedure.
After calculating the similarity, two categories having the maximum similarity are selected and integrated in S503.
Further, the two work procedures included in the classification integrated in S504 are integrated into one work procedure. Each of the two work procedures is associated with the calculation in the work procedure similarity evaluation unit 105. Using this result, work procedures are integrated.

Fig. 6 shows an example of integration, and Fig. 7 shows the flow of integration processing.

First, assume that one of the two work procedures is A and the other is B in S701. In S702, one work is read from the head of the work procedure of B. The work read out in S703 is checked for the correspondence result with work A. Hereinafter, the process changes for each check result of association (S704). If the corresponding work is the same work (work 1 and work 5 in the example), the next work is checked without doing anything. If a blank is inserted in B and corresponds to the work of A, the next work is processed without doing anything. If there is no work of A corresponding to the work of B, the work of B is inserted at this position of A (work 4 in the example) (S705). If the corresponding work is different, the work of B and the work of A are compared, and inserted in this position so as to be arranged in the dictionary order (work 3 and work 6 in the example) (S706).

In FIG. 5, finally, the number of classifications is checked in S505. If it is 1, the process is terminated. If it is greater than 1, the above process is repeated.

By the processing of the work procedure manual classification unit 104, a tree structure classification such as the schematic example shown in FIG. 8 is obtained. The classification structure is as follows. First, the classification that includes all work procedures is at the top. This classification is divided into a classification including work procedure manuals A to G and a classification including work procedure manuals H to L. The former classification is further divided into a classification including work procedure manuals A to C and a classification including work procedure manuals D to G. Furthermore, the classification including work procedure manuals A to C is divided into a classification including work procedure manuals A and B and a classification including work procedure manual C. The structure of other classifications is the same.

The work procedure similarity evaluation unit 105 evaluates the similarity of work procedures. At this time, the way in which the work is arranged in the work procedure is compared by a sequence matching method, and the evaluation is performed based on the ratio of work correspondence between the work procedures.

The method for evaluating the similarity of work procedures will be described using the schematic example of FIG.

The work n (n is a natural number) represents the name of each work, and the work names arranged vertically represent the work procedure. For example, the left side of (a) represents a work procedure in which work 1, work 2, work 3, work 4, work 5, and the work are sequentially advanced. The relationship between the tasks in correspondence and the tasks is arranged horizontally, and the case where the same tasks in the correspondence are grouped is represented by a straight line.
The left and right sides of (a) are the same work. In this case, there is a correspondence between all the operations, and all the correspondence relationships are the same task set. Therefore, the similarity of work procedures is 1 at maximum.
The left side and the right side of (b) are series composed of the same work, but the order of the work is different. In this case, there is only one set of work having the same correspondence, and the length of the work procedure is 1 for 5, so the similarity is 1/5 = 0.2.
The left side and the right side of (c) are a series composed of almost the same work. In this case, since there are four sets of work having the same correspondence and the length of the work procedure is 4 with respect to 5, the similarity is 4/5 = 0.8.
The left side and the right side of (d) are series composed of substantially the same work, but the lengths of the series are different. In this case, correspondence is made by inserting a blank in the series. There are 4 sets of work with the same correspondence. The similarity is calculated using the longer work procedure and 4/5 = 0.8.

The similarity of the work procedure shown in FIG.

The correspondence of work procedures is the same as that of character strings. Therefore, the degree of correspondence of the work procedure is calculated using a method for determining the similarity of character strings using DP matching. Since this calculation method is disclosed in many books such as Non-Patent Document 3, it will not be described in detail here.

Here, the calculation method will be briefly described using the correspondence of the character strings shown in FIG. 10 as an example. For simplicity, two character strings, “a document creation method using a document part” and “a document creation method that reuses a document” are collated. When three or more character strings are collated, all combinations may be calculated.

The cost is a numerical value indicating how different the two character strings are. The calculation is performed using the number of operations necessary to transform one character string into the other character string. As operations, insertion, deletion, and replacement of characters are considered. A cost is assigned to each operation, and the costs are totaled for the necessary operations. Here, -2 points are given when a character is inserted, deleted, or replaced, and 2 points are given when they match.

In DP matching, as shown in FIG. 11, each character string to be compared is associated with a column and a row, the score is managed in a two-dimensional table, and the score is calculated in turn in the table. In the example of FIG. 11, “document creation method using document parts” is associated with a row, and “document creation method using document reuse” is associated with a column. Suppose that the position of a cell in the table is expressed using rows and columns. The square in the nth row and the mth column is represented by (n, m). Note that both rows and columns start from 1. The score S (n, m) of the cell (n, m) is calculated by Equation 1. At this time, the cell used to calculate the score is stored.

For example, the value of S (12,13) is such that the twelfth character of the row is “saku” and the thirteenth character of the column is “saku”. Therefore, since the second term is 10, the third term is 10, and the third term is 10, the largest one term is selected and becomes 14. At this time, it is stored that the score of the square (12, 13) is calculated from the value of the square (11, 12). However, when the score becomes 0, all of the stored contents are deleted.

The score of each cell in the score table represents the similarity of the character string corresponding to the cell traced until the score of the cell is calculated. When this score is calculated, the correspondence between the character strings can be obtained by following the reverse order of storing the used squares. By following the correspondences in order from the cell with the highest score, it is possible to obtain the correspondences in descending order of similarity. At this time, it is possible to prevent the portion including the same character string from being extracted many times by preventing the cell once traced from being traced twice. In FIG. 11, the value of (15, 16) is 20 and is the maximum. Follow this square to get this value. The result of associating the character strings is shown in FIG.

The correspondence of work procedures using DP matching is calculated by applying Equation 2 recursively. Calculate the degree of correspondence between work procedure S (1, n) = (s1, s2,…, sn) of length n and work procedure T (1, m) = (t1, t2,…, tm) of length m The work procedure can be associated by outputting a correspondence having a large correspondence m (S (1, n), T (1, m)).

文字 The character string similarity determination method using DP matching is also used to determine the similarity of work names. The character string similarity SM between the work name K (1, n) = (k1, k2,…, kn) and the work name L (1, m) = (l1, l2,…, lm) is calculated by Equation 3. .

In this implementation column, it is determined whether the work names are the same using Formula 3, but it is also possible to determine using a dictionary. By using a dictionary, it is possible to determine that synonyms having completely different notations such as “collation” and “matching” are the same.

The work procedure manual database 106 stores all work procedure manual files. This database can be constructed on the external storage device 203 using a program such as a relational database, an XML database, or a file server. In this embodiment, it is assumed that the table is constructed on the relational database. As shown in FIG. 13, the file ID, file identifier, and character string in the file used for management inside the database are stored in association with each other.

The work procedure manual classification database 106 stores the result of classifying the work procedure manual in association with the file identifier and the work procedure extracted from the work procedure manual. This database can be constructed on the external storage device 203 using a program such as a relational database or an XML database. In this embodiment, it is assumed that the table is constructed on the relational database. As shown in FIG. 14, it is constructed with two tables. In the table (a), the work procedure, the classification ID, and the parent classification ID are stored in association with each other. The parent classification ID is a classification ID of the hierarchy one level higher than the classification hierarchy obtained by hierarchical clustering. The work procedure is an integrated work procedure obtained when the two classifications are integrated at the time of calculation by the work procedure similarity evaluation unit 105. In (b), the correspondence between the classification ID and the file ID is stored.

The search query receiving unit 108 receives a query input for searching for a work procedure manual. A query is input using an input device 204 such as a keyboard. The query is input in the form of a work procedure.

The work procedure manual search unit 109 searches the work procedure manual by evaluating the similarity between the work procedure input as a search query and the work procedure for each classification stored in the work procedure manual classification database 107. At this time, the similarity between work procedures is evaluated using the work procedure similarity. The search result is represented by the classification ID and the file identifier of the work procedure manual file.

The processing procedure is shown in FIG. In S1501, the search query input by the search query receiving unit 108 is read. Subsequently, in step S1502, one classification ID is read from the work procedure manual classification database 107 in order from the lower hierarchy. At this time, work procedures are related and read. Next, in S1503, the similarity between the work procedure with which the classification is related and the search query is calculated. If the similarity is greater than 0 (similar), the file identifier of the work procedure manual included in the classification is stored in S1505. This result is used by the work procedure manual file output unit 110. Subsequently, in S1506, the higher-level classification (the classification including the processed classification) is marked as checked so as to be treated as processed thereafter. If there is no check mark in S1507 or there is a classification that has not yet been processed, the matching process with the search query is performed again.

The work procedure manual file output unit 110 reads the work procedure manual file from the work procedure manual database 106 using the stored file identifier, and outputs it to the output device 205 such as a display. At this time, by sorting and outputting in descending order of similarity, it is possible to display the work procedure manual that matches the search query at the top of the ranking.

FIG. 3 shows a processing flow of this embodiment corresponding to the block diagram of FIG. In S301, the work procedure manual file is input and stored in the work procedure manual database 106. In step S <b> 302, the work procedure manual file is read from the work procedure manual database 106. In S303, a description of the work procedure is extracted from the work procedure manual file. In step S304, the similarity of the work procedure is evaluated. In S305, the documents are classified according to the similarity of the work procedure, and the work procedure manual classification database 107 is constructed. The process up to here corresponds to the work procedure association method.

Next, the work procedure manual is searched using the constructed work procedure manual classification database. In S306, an input of a search query for searching the work procedure manual is accepted. The degree of similarity between the work procedure input as a search query in S307 and the work procedure for each category stored in the work procedure manual classification database 107 is evaluated, and the work procedure manual is searched. In step S <b> 308, based on the search result, the work procedure manual file is read from the work procedure manual database 106 and output to the output device 205.

As described above, it has been explained that work procedures can be associated and work procedure manuals can be efficiently searched by the apparatus and method shown in FIGS.

DESCRIPTION OF SYMBOLS 101 Work procedure manual file input part 102 Work procedure manual file reading part 103 Work procedure extraction part 104 Work procedure manual classification part 105 Work procedure similarity evaluation part 106 Work procedure manual database 107 Work procedure manual classification database 108 Search query reception part 109 Work Procedure manual search unit 110 Work procedure manual file output unit 201 Central processing unit 202 Main storage device 203 External storage device 204 Input device 205 Output device 206 Network 207 Information processing device

Claims

Entering multiple document files;
Extracting a description of the work procedure from each of the document files;
A step of evaluating the similarity of work procedures;
A document associating method comprising: classifying the document into a tree structure according to the similarity of the work procedure.
The document association method according to claim 1,
The step of evaluating the similarity of work procedures is a method for associating the work procedures by comparing the order of work using a sequence matching method and evaluating the similarity of work procedures.
The document association method according to claim 1 or 2,
The step of classifying the documents is classified into a tree structure by classifying the documents using hierarchical clustering.
The document association method according to claim 2,
In the step of evaluating the similarity of the work procedure, the document association method further comprises evaluating the similarity of the names of the work procedures using series matching when comparing the work names constituting the work procedures.
The document association method according to claim 2,
In the step of evaluating the similarity of the work procedure, the document association method further comprising: using a dictionary to evaluate the similarity of the work procedure name when comparing the work names constituting the work procedure.
The document association method according to any one of claims 1 to 5,
The document association method, wherein the document is a work procedure manual.
A classification database created using the document association method according to any one of claims 1 to 6,
Entering a search query in the form of a routing,
A document search method comprising: searching for a document by evaluating a similarity between the search query and a work procedure for each classification in the classification database.
A document file input section for inputting a plurality of document files;
A work procedure extraction unit that extracts a description of the work procedure from each of the document files;
A work procedure similarity evaluation unit for evaluating the similarity of work procedures;
A document association apparatus comprising a document classification unit that classifies the document into a tree structure according to the similarity of the work procedure.
The document association apparatus according to claim 8.
The work procedure similarity evaluation unit compares work arrangements using a series matching method and evaluates the similarity of work procedures.
The document association apparatus according to claim 8 or 9,
The document classification device classifies the document into a tree structure by classifying the document using hierarchical clustering.
The document association apparatus according to claim 9, wherein
The work procedure similarity evaluation unit evaluates the similarity of work procedure names using series matching when comparing the work names constituting the work procedures.
The document association apparatus according to claim 9, wherein
The work procedure similarity evaluation unit uses a dictionary to evaluate the similarity of work procedure names when comparing the work names constituting the work procedures.
The document association apparatus according to any one of claims 8 to 12,
The document association apparatus, wherein the document is a work procedure manual.
A classification database created using the document association apparatus according to any one of claims 8 to 13;
A search query accepting unit that accepts input of a search query in the form of a work procedure;
A document search apparatus comprising: a document search unit that evaluates the similarity between the search query and a work procedure for each classification of the classification database, and searches for a document.
A program for causing a computer to function as a document association device,
A document file input section for inputting a plurality of document files;
A work procedure extraction unit that extracts a description of the work procedure from each of the document files;
A work procedure similarity evaluation unit for evaluating the similarity of work procedures;
A program that functions as a document classification unit that classifies the document into a tree structure according to the similarity of the work procedure.