WO2019087593A1

WO2019087593A1 - Document retrieval device and method

Info

Publication number: WO2019087593A1
Application number: PCT/JP2018/034358
Authority: WO
Inventors: 直之青田
Original assignee: アズビル株式会社
Priority date: 2017-11-06
Filing date: 2018-09-18
Publication date: 2019-05-09
Also published as: JP7053219B2; JP2019086934A

Abstract

This document retrieval device (1) includes: a document DB (41) in which a plurality of original text documents are stored; and a classification result DB (42) in which information which identifies an original text document to be retrieved, information which identifies text included in the original text document, and information which indicates a classification class that represents the attribute of the text are associated with each other. The document retrieval device (1) performs a full text retrieval on the plurality of original text documents stored in the document DB (41) on the basis of a retrieval condition input by a user, and extracts a plurality of documents which match the retrieval condition. The document retrieval device (1) performs a decision for preferentially outputting and displaying, as a retrieval result, a document which is stored in the classification result DB (42) and includes text associated with a classification class that represents a phenomenon, among the plurality of documents extracted through the full text retrieval.

Description

Document retrieval apparatus and method

The present invention relates to a document search apparatus and method, and more particularly to a document search technology using a full text search method.

Heretofore, there have been accumulated text data on events and phenomena occurring in various scenes such as a record of inquiries to a call center and a response to a defect at a manufacturing site, and the cause and countermeasure. Also, in order to utilize such text data, a full-text search method is used, and past cases similar to the phenomenon currently occurring are searched and referred to. In particular, in a call center or a manufacturing site, it may be necessary to respond urgently to problems or problems.

For example, Patent Document 1 sets plural types of sentence types for identifying the contents of sentences, such as "Opinion", "Recommendation", etc., from the original text database storing original text document data. We create excerpt sentence data of sentence unit classified into sentence type. Then, the excerpt sentence data is formed, for example, in a form in which a conjunction is removed, and a technique for extracting excerpt sentence data corresponding to a specified sentence type and displaying a list is disclosed. Further, Patent Document 1 discloses a technique of setting in advance a priority between patterns of specific sentences belonging to sentence types, and adjusting excerpt sentence data to be displayed based on the priorities.

Unexamined-Japanese-Patent No. 8-255172

However, in the technique described in Patent Document 1, excerpt sentence data of a search result is displayed on the basis of a priority set individually in advance between patterns of specific sentences belonging to sentence types.

Therefore, when a problem occurs at a manufacturing site or the like, and when searching for an existing document including information of a phenomenon that has occurred in the past similar to the phenomenon that has occurred, information required by the user is not preferentially displayed. was there. In such a case, it takes time for the user to confirm the search results, and it may be difficult to urgently deal with problems.

The present invention has been made to solve the above-described problem, and a document search apparatus capable of preferentially displaying an existing document including information of a phenomenon that has occurred in the past that is similar to a phenomenon that has occurred. Intended to provide.

In order to solve the problems described above, a document search device according to the present invention includes a document database in which a plurality of documents are stored, first information identifying each of the plurality of documents, and a plurality of documents. A classification result database in which second information identifying a sentence included in each and third information indicating a classification class representing an attribute of the sentence are stored in association with each other, and a document regarding a certain phenomenon is searched. A search condition input unit into which a search condition is input, and a full-text search is performed on the plurality of documents stored in the document database based on the search condition to extract a document matching the search condition And a display order determination unit that determines an order in which the plurality of extracted documents are output and displayed when there are a plurality of documents extracted by the extraction unit; Includes at least a first classification class representing the phenomenon, and the display order determination unit is associated with the first classification class among the plurality of extracted documents with reference to the classification result database. It is characterized in that a decision is made to preferentially output and display a document including a sentence.

Further, in the document search method according to the present invention, a search condition input step in which a search condition for searching a document concerning a certain phenomenon is input, and a plurality of documents stored in the document database based on the search condition Determining an order of outputting and displaying the plurality of extracted documents when there are a plurality of documents extracted in the extraction step, and performing a full text search to extract documents matching the search condition A display order determination step, the display order determination step comprising: first information identifying each of the plurality of documents; and second information identifying a sentence included in each of the plurality of documents. Among the plurality of extracted documents, with reference to a classification result database in which third information indicating a classification class representing the attribute of the sentence is stored in association with each other. Performed preferentially decision to output display documents containing the first statement associated with classification class representing the serial phenomenon, the classification class is characterized in that it comprises at least the first classification class.

According to the present invention, among the documents extracted by the full text search, the document including the sentence associated with the classification class representing the phenomenon is preferentially output and displayed, so it occurred in the past similar to the occurring phenomenon. It is possible to preferentially display documents containing information on phenomena.

FIG. 1 is a functional block diagram of a document search apparatus according to the first embodiment of the present invention. FIG. 2 is a block diagram showing a configuration example of hardware for realizing the document search device according to the first embodiment of the present invention. FIG. 3 is a flowchart illustrating search processing according to the first embodiment of the present invention. FIG. 4 is a view showing an example of a display unit according to the first embodiment of the present invention. FIG. 5 is a functional block diagram of a document search device according to a second embodiment of the present invention. FIG. 6 is a flowchart illustrating classification processing according to the second embodiment of the present invention. FIG. 7 is a functional block diagram of a document search device according to a third embodiment of the present invention. FIG. 8 is a flowchart for explaining classification model construction processing according to the third embodiment of the present invention.

Hereinafter, preferred embodiments of the present invention will be described in detail with reference to FIGS. 1 to 8. The components common to the respective drawings are denoted by the same reference numerals. In the following, "document" and "statement" mean text data. Also, "sentence" refers to text data of a character string separated by phrase points or periods, and "document" refers to a file of text data including sentences composed of a plurality of "sentences".

First Embodiment
As shown in FIG. 1, the document search device 1 according to the present embodiment searches for an existing document related to a “phenomenon” that has occurred, such as a defect that has occurred at a manufacturing site, for example. Documents containing information on phenomena that occurred in the past that are similar to phenomena currently occurring are preferentially displayed as search results. Then, the search result output and displayed with priority is referred to by the user and utilized for emergency response to the failure.

The document search device 1 has a document DB 41 in which a plurality of original text documents are stored, information (first information) for identifying the original text document to be searched, and information for identifying sentences included in the original text documents (first information) There is a classification result DB 42 in which the second information) and the information (third information) indicating the classification class representing the attribute of the sentence are associated with each other. The document search device 1 performs full-text search on a plurality of original text documents stored in the document DB 41 based on the search condition input by the user, and extracts a plurality of documents matching the search condition. The classification class includes at least a classification class (first classification class) representing a phenomenon that has occurred. Among the plurality of documents (a plurality of extracted documents) extracted by the full text search, the document search device 1 stores a document including a sentence associated with a classification class representing a phenomenon, which is stored in the classification result DB 44, Make a decision to output and display preferentially as a search result.

[Function block of document search device]
As shown in FIG. 1, the document search device 1 according to the first embodiment includes an input / output unit 2, a search unit 3, and a storage unit 4.

The input / output unit 2 includes a search condition input unit 21 and a display unit 22. The input / output unit 2 receives an input from a user who uses the document search device 1, and outputs and displays a search result. The input / output unit 2 uses, for example, a form of a web browser, but a dedicated application may be used. Also, the input / output unit 2 may be separated from other functional units included in the document search device 1 in a network manner, or may be on the same computer.

The search condition input unit 21 receives, from the user, an input of a search condition for searching an existing document related to a phenomenon that has occurred, such as a defect at a manufacturing site. The search condition input unit 21 receives, for example, a word string or a query sentence representing a phenomenon occurring at present.

The display unit 22 displays the search result by the search unit 3 described later. Specifically, the display unit 22 highlights sentences belonging to a classification class such as a sentence representing a phenomenon that has occurred, and displays a plurality of documents extracted by the extraction unit 31 described later as a search result.

Further, the display unit 22 displays the sentence to be displayed in an emphasized manner so that the classification classes to which the sentence belongs can be distinguished from each other. For example, when a plurality of classification classes are adopted, it is assumed that the document of the search result includes a plurality of sentences belonging to different classification classes. In such a case, the display unit 22 highlights sentences belonging to different classification classes included in the same document, for example, in different colors in the original document. The details of the classification class will be described later.

The search unit 3 includes an extraction unit 31 and a display order determination unit 32. Search conditions such as a query from the user input to the search condition input unit 21 are input to the search unit 3. The search unit 3 searches a document matching the search condition with respect to the original text document registered in the document DB 41 described later, and determines the order of outputting and displaying the search result document.

The extraction unit 31 executes a full text search on a plurality of original text documents registered in the document DB 41 based on the search condition input through the search condition input unit 21 to extract a document matching the search condition. . More specifically, the extraction unit 31 performs full-text search with reference to the index DB 411 in which indexes of a plurality of original text documents included in the document DB 41 are registered, and extracts a document that matches the search condition.

When there are a plurality of documents extracted by the extraction unit 31, the display order determination unit 32 determines the order in which the plurality of documents are output and displayed. More specifically, the display order determination unit 32 refers to a classification result DB 42 described later, and a plurality of documents extracted by the extraction unit 31 are sentences associated with the classification class “phenomena” representing the phenomenon that has occurred. Make a decision to preferentially output and display documents containing.

The storage unit 4 includes a document DB 41 and a classification result DB 42. The document DB 41 includes an original text DB 410 and an index DB 411.
The document DB 41 stores information on a plurality of original text documents (a plurality of documents) to be searched.

In the original text DB 410, a plurality of original text documents prepared in advance or link information to the original text documents are registered. The plurality of original text documents registered in the original text DB 410 are used when the display unit 22 displays the search results. More specifically, the display unit 22 reads out the original text document corresponding to the search result document from the original text DB 410 based on the information of the document of the search result by the search unit 3 and performs display content processing and highlighting. .

In the index DB 411, indexes corresponding to a plurality of original text documents registered in the original text DB 410 are registered. The index DB 411 is provided to speed up the search process when the extraction unit 31 executes the full text search.

More specifically, in the index DB 411, an index such as a transposed index generated from a text document is registered. The transposition index has, for example, a data structure in which a matrix of a table is registered in which information of character strings, position information of character strings in a document, identification information of a document, and the like are associated with one another and registered.

The generation of the index is performed prior to the search processing by the search unit 3 and is also performed when the original text document is registered in the original text DB 410. Further, as a method of extracting a character string at the time of index generation, for example, morphological analysis is used. The index may be generated by a device installed outside the document search device 1 or may be generated by, for example, the control unit 102 in the document search device 1.

For example, if the original text DB 410 includes an original text document in a language such as Japanese, which is not separated, the text of the original text is divided by morphological analysis. In addition, before or after performing morphological analysis, standardization of character strings called mixed normal-width and half-width, and mixed-case of upper-case and lower-case characters, and deletion of special symbols are performed. Is desirable.

The classification result DB 42 includes information identifying each of a plurality of original text documents registered in the original text DB 410, information identifying a sentence included in each of a plurality of original text documents, and a classification class representing the attribute of the sentence And the information which shows are mutually linked | related and memorize | stored. More specifically, information in which a classification class is associated with one sentence (a sentence included in the original text document), and the classification class and a position in the original text document (for example, the start position and the number of characters) are associated. The information may be

In the present embodiment, a classification class is a set of sentences defined by the attributes of sentences, such as the meaning and content of sentences. In the present embodiment, in addition to the classification class "phenomenon", the classification class "cause" (second classification class) indicating the cause of the occurred phenomenon and the classification class "action" indicating the countermeasure for the occurred phenomenon Three classification classes (the third classification class) are adopted. For example, as a sentence belonging to the classification class “phenomenon”, “... Occurrence of error” may be mentioned. As a sentence belonging to the classification class "cause", for example, "... considered as a factor", and as a sentence belonging to the classification class "action", for example, "... Be

[Hardware configuration of document search device]
As shown in FIG. 2, the document search device 1 includes a computer including a control unit 102 connected via a bus 101, a communication control device 103, a storage device 104, an input device 105, and a display device 106, and their hardware. It can be realized by a program that controls wear resources.

The control unit 102 includes a CPU 102 a and a main storage unit 102 b. Programs for the CPU 102a to perform various controls and operations are stored in advance in the main storage unit 102b. The control unit 102 implements the functions of the document search apparatus 1 such as the extraction unit 31 and the display order determination unit 32 illustrated in FIG. 1.

The communication control device 103 is an input / output interface for connecting the document search device 1 and various devices. The communication control apparatus 103 may have a function as a control apparatus for connecting the document search apparatus 1 and various external electronic devices via a network. For example, the classification result of the document to be searched which is executed by an apparatus installed outside may be received via the communication control apparatus 103 and stored in the classification result DB 42.

The storage device 104 includes a readable and writable storage medium, and a drive device for reading and writing various information such as programs and data from and to the storage medium. For the storage device 104, a semiconductor memory such as a flash memory or a hard disk can be used as a storage medium. The storage device 104 is a document DB 41, a classification result DB 42, a program storage unit 104a, and other storage devices (not shown), for example, a storage device for backing up programs and data stored in the storage device 104. It can have.

The program storage unit 104a stores various programs for executing processing necessary for document search such as search processing in the present embodiment.

The input device 105 is realized by a keyboard, a mouse, a touch panel, and the like, and receives input and operation from the user. The input device 105 receives the input of the search condition from the user. The input device 105 functions as the search condition input unit 21 described with reference to FIG.

As the display device 106, a liquid crystal display or the like is used. On the display device 106, an input result by the input device 105 is displayed, and information on a document of the search result is displayed. The display device 106 functions as the display unit 22 described in FIG.

[Operation of document search device]
The operation of the document search apparatus 1 having the above-described configuration will be described with reference to FIGS. 3 and 4. In the following, for example, a defect (for example, “error of device Δ”) occurs at the manufacturing site, and the document search device 1 is a phenomenon that is currently occurring based on the input of the search condition by the user. A case will be described in which an existing document related to “ΔΔ device error” that has occurred in the past similar to “Δ device error” is retrieved.

First, the search condition input unit 21 receives an input of a search condition by the user (step S1). The user's input accepted by the search condition input unit 21 is displayed in the area 220 of the display unit 22 as shown in the display example of FIG. 4. In the present embodiment, for example, character strings of “ΔΔ device”, “error”, and “occurrence” are accepted as search conditions.

Thereafter, as shown in FIG. 4, when the “search” button displayed on the display unit 22 is pressed by the operation of the user, a signal is input to the search unit 3. First, the extraction unit 31 executes a full text search to extract a document matching the search condition from the document DB 41 (step S2).

More specifically, the extraction unit 31 executes a full-text search with reference to the index DB 411. The extraction unit 31 extracts a plurality of documents including the search condition “ΔΔ device”, “error”, and “occurrence” in the transposed index registered in the index DB 411.

The extraction unit 31 also calculates the degree of similarity of each of the plurality of extracted documents with the search condition. In the calculation of the degree of similarity, the extraction unit 31 may use a known method generally used in full-text search. The document extracted by the extraction unit 31 is temporarily stored in association with the degree of similarity. The document extracted by the extraction unit 31 may include a document having a content different from the content intended by the user even if the document matches the search condition.

Next, the display order determination unit 32 determines the order in which the plurality of documents extracted by the extraction unit 31 are output and displayed (step S3). More specifically, the display order determination unit 32 generates a phenomenon that occurs among a plurality of extracted documents based on an index value that indicates the degree of the relationship between each of the plurality of extracted documents and the classification class "phenomenon". The order in which the document including the sentence belonging to the classification class (the first classification class) to be displayed is displayed is determined.

Specifically, referring to the classification result DB 42, the display order determination unit 32 matches the search condition among the plurality of documents extracted by the extraction unit 31, and is included in the extracted documents. A decision is made to preferentially output and display a document including a sentence classified into the classification class "phenomenon".

For example, the display order determination unit 32 calculates a display order index value obtained by multiplying the similarity calculated for each document extracted by the extraction unit 31 by a predetermined coefficient. The predetermined coefficient is set such that a display order index value higher than the value of the display order index value of the search result classified into another classification class is calculated as the search result classified into the classification class "phenomenon" Do. The display order index value of the document i (i = 1, 2,..., N) extracted by the extraction unit 31 is calculated by the following equation (1).

In the above equation (1), for example, when the value of the coefficient is 0, it is returned to the input / output unit 2 only when the sentence classified into the classification class “phenomenon” matches the search condition.

Next, the display unit 22 processes the display content in the search result document in which the display order determination unit 32 determines the order of output and display (step S4). For example, the display unit 22 highlights sentences belonging to the classification classes “phenomenon”, “cause”, and “action” included in each of a plurality of documents displayed as a search result, and displays the sentences as a search result.

More specifically, as shown in FIG. 4, the display unit 22 adds, to a part of the original text document corresponding to the search result document, processing such as HTML tags that can be distinguished on the display. Specifically, in the area 221 where the original text document corresponding to the document of the search result is displayed, the display unit 22 displays the sentences classified into the classification classes “phenomenon”, “cause” and “action”. The

regions

222a, 222b and 222c are processed.

For example, the display unit 22 may surround the

regions

222a, 222b, and 222c with tags (for example, div tags) that group the regions 222 as HTML block elements, or may apply a style sheet such as Cascading Style Sheets (CSS).

Thereafter, the display unit 22 displays the document of the search result in which the display content is processed (step S5). Specifically, the display unit 22 lists and displays the corresponding original text documents from the top of the display screen in accordance with the output display order of the search result documents determined in step S3. As shown in the display example of FIG. 4, the document “No. 1” displayed at the top of the display screen is a document for which the highest display order index value has been calculated.

In addition, when displaying the document of the search result in which the display content is processed, the display unit 22 can distinguish the sentences belonging to the classification class "phenomenon", "cause", and "action" in each document. Thus, for example, the character colors of the

areas

222a, 222b, and 222c and the highlight display colors may be changed from each other.

As described above, according to the first embodiment, the document search device 1 preferentially displays, among a plurality of documents extracted by full-text search, a document including a sentence belonging to a classification class representing a phenomenon. Do. Therefore, the document search device 1 can preferentially display an existing document including information of a phenomenon that has occurred in the past, which is similar to a phenomenon that is currently occurring. As a result, the user can perform quicker emergency response to a problem or the like that has occurred at a manufacturing site or the like.

Further, when displaying the document of the search result, the document search device 1 highlights and displays the sentence included in the document and belonging to the classification class. Therefore, when confirming the search result on the display screen, the user can more easily confirm whether the document of the search result is an existing document including information similar to the phenomenon that is actually occurring at present. It can be carried out.

In addition, the document search device 1 uses three classification classes of “classification phenomenon”, “cause” and “action”, and therefore, not only existing documents concerning the phenomena currently occurring but also the present It is possible to output and display a document containing information that is more useful to the user, such as investigation of the cause of the phenomenon and recovery.

Further, since the document search device 1 has the classification result DB 42 in which information on classification classes in sentence units is stored in advance for the document to be searched, the calculation load in the document search device 1 can be further reduced. It is possible to make the document search device 1 a simpler configuration.

Second Embodiment
Next, a second embodiment of the present invention will be described. In the following description, the same components as those in the first embodiment described above are denoted by the same reference numerals, and the description thereof is omitted.

In the first embodiment, the case has been described in which the classification result DB 42 prestores information on the classification class for each sentence included in the document to be searched. On the other hand, in the second embodiment, the document search device 1a further classifies the sentences included in each of the plurality of documents into any of a plurality of classification classes representing the attributes of the sentences, and the classification result DB 42. And a classification execution unit 5 for storing the information. The document search device 1 a further includes a classification model storage unit 43.

As shown in FIG. 5, the document search device 1a classifies each of a plurality of original texts to be searched in sentence units and stores the classification result in the classification result DB 42. The document search device 1a performs a search based on the search condition input by the user thereafter.

The classification execution unit 5 classifies a plurality of original text documents registered in the original text DB 410 into classification classes in sentence units. More specifically, the classification execution unit 5 inputs the original text document registered in the original text DB 410 to be classified into the classification model stored in advance in the classification model storage unit 43. Then, the classification execution unit 5 classifies the sentences contained in each document into classification classes “phenomenon”, “cause”, and “action”, which are set in advance, and outputs classification results.

For example, when performing classification, the classification execution unit 5 can set a threshold value to classify into one of the classification classes “phenomenon”, “cause”, and “action”, class by class. You may decide whether or not. In this case, the classification execution unit 5 may output a sentence not classified into any classification class as a classification result. The classification result output by the classification execution unit 5 is stored in the classification result DB 42.

The classification result information stored in the classification result DB 42 is data in which information for identifying a document in the original text, information for identifying a sentence, and information indicating a classification class in which the sentence is classified is associated. The classification result stored in the classification result DB 42 may be a classification class and one sentence (a sentence included in the original document), and the classification class and the position in the original document (for example, the start position and the number of characters) ) May be included.

The classification model storage unit 43 stores, for example, a classification model which is learned and constructed in advance by a device installed outside. The classification model is a model constructed by learning a classifier selected from known algorithms used in natural language processing, and the details will be described later.

Next, the classification process of the document search device 1a according to the present embodiment will be described using the flowchart of FIG.
First, prior to the search processing by the search unit 3, the classification execution unit 5 reads out the original text document registered in the original text DB 410 of the document DB 41 and inputs it to the classification model stored in the classification model storage unit 43. (Step S20).

Next, the classification execution unit 5 classifies sentences for each of a plurality of original text documents (step S21). More specifically, the classification execution unit 5 classifies each of the sentences included in the original text document into any one of predetermined classification classes “phenomenon”, “cause”, and “action”.

After performing class classification for each of a plurality of original text documents, the classification execution unit 5 associates information for identifying the document, information for identifying the sentence, and information on the classification class into which the sentence is classified. It stores in the classification result DB 42 (step S22). The information identifying each sentence may be the original sentence or the position of each sentence in the original document.

When the classification process by the classification execution unit 5 is completed, the extraction unit 31 refers to the full text search with reference to the index DB 411 based on the search condition input to the search condition input unit 21 as in the first embodiment. Execute and extract multiple documents that match the search condition.

Then, the display order determination unit 32 determines the order in which the plurality of extracted documents are output and displayed. The display order determination unit 32 determines the order in which the document is output and displayed using the classification result DB 42 in which the classification result by the classification execution unit 5 is stored.

When the order of outputting and displaying the search result document is determined, the display unit 22 processes the display content. For example, the display unit 22 highlights and displays, in the corresponding original text document, the sentences classified into the classification classes “phenomenon”, “cause” and “action” included in the document of the search result. Furthermore, the display unit 22 emphasizes and displays the sentences of each classification class so as to be distinguishable from each other.

As described above, according to the document search device 1a according to the second embodiment, the classification execution unit 5 uses the classification model stored in advance in the classification model storage unit 43 to use the class of the original document Perform classification. Thus, when a new original text document is registered in the original text DB 410, the document search device 1a can perform class classification for the original text document. Therefore, it becomes possible to cope with the update of the original document of the search target in the document search device 1a.

Third Embodiment
Next, a third embodiment of the present invention will be described. In the following description, the same components as those in the first and second embodiments described above are designated by the same reference numerals, and the description thereof will be omitted.

In the second embodiment, prior to the search processing by the search unit 3, the classification execution unit 5 is registered in the document DB 41 (original text DB 410) using the classification model stored in advance in the classification model storage unit 43. The case where original text documents are classified into sentence classes in each classification class is described. On the other hand, in the third embodiment, the document search device 1 b further includes a learning unit 6. The learning unit 6 learns a predetermined classifier, and constructs a classification model used when the classification execution unit 5 executes the classification process.

As shown in FIG. 7, the learning unit 6 includes a teacher data setting unit 61 and a classification model learning unit 62. The classifier used by the learning unit 6 is, for example, a network combining a support vector machine (Support Vector Machine, hereinafter referred to as "SVM") or "word2vec", which is a two-layer neural network, and a convolutional neural network. Etc., from known algorithms used in document classification in natural language processing. In the present embodiment, a classifier using supervised learning is adopted, but in constructing a classification model, a classifier using unsupervised learning may be adopted.

The teacher data setting unit 61 sets teacher data including a sentence and a classification class to which the sentence should belong. More specifically, the teacher data setting unit 61 labels the teacher data labeled like a sentence representing the classification class "phenomenon", a sentence representing the classification class "cause", and a sentence representing the classification class "action". prepare.

The classification model learning unit 62 inputs the training data set by the training data setting unit 61 into the classifier, and learns a classifier such as an SVM, for example, to construct a classification model. More specifically, the classification model learning unit 62 first converts sentences of text data into vector representations. Specifically, the classification model learning unit 62 may use a sentence vector weighted for each appearing word using an algorithm such as the tf-idf method.

The classification model learning unit 62 classifies this sentence vector with a classifier such as SVM to construct a classification model. The classification model constructed by the classification model learning unit 62 is stored in the classification model storage unit 43.

Next, classification model construction processing executed by the learning unit 6 will be described using the flowchart of FIG. In the present embodiment, the classification model construction process is performed prior to the classification process performed by the classification execution unit 5.

As shown in FIG. 8, the teacher data set by the teacher data setting unit 61 is input to a classifier such as an SVM (step S30). Next, the classification model learning unit 62 learns the classifier based on the input teacher data, and constructs a classification model (step S31). The classification model constructed by the classification model learning unit 62 is stored in the classification model storage unit 43.

After the classification model is constructed, classification processing by the classification execution unit 5 is executed as in the second embodiment, and the document is classified using the classification model constructed by the learning unit 6. Furthermore, search processing by the search unit 3 is executed, and the order of output display of the plurality of extracted documents is determined. Then, the display unit 22 processes the display content in the original text document corresponding to the search result document, emphasizing and displaying the sentences belonging to each classification class so that they can be distinguished from each other.

As described above, according to the document search device 1b according to the third embodiment, the learning unit 6 learns a predetermined classifier to construct a classification model. As a result, the document search device 1b can locally update the classification model, reset the classification class, and the like as needed.

Although the embodiments of the document search apparatus and document search method of the present invention have been described above, the present invention is not limited to the described embodiments, and a person skilled in the art would be within the scope of the invention described in the claims. Various possible modifications can be made.

For example, in the embodiment described above, the case has been described in which three classification classes of "occurring phenomenon", "cause" where the phenomenon occurred, and "action" for the occurring phenomenon are set in advance. However, the classification class is not limited to these three cases, and the classification class “phenomenon” may be used alone, and further different classification classes may be added and used in combination.

DESCRIPTION OF SYMBOLS 1 ... Document search apparatus, 3 ... Search part, 4 ... Storage part, 21 ... Search condition input part, 31 ... Extraction part, 32 ... Display order determination part, 41 ... Document DB, 42 ... Classification result DB, 102 ... Control part 102a: CPU, 102b: main storage unit, 103: communication control device, 104: storage device, 104a: program storage unit, 105: input device.

Claims

A document database in which a plurality of documents are stored;
First information identifying each of the plurality of documents, second information identifying a sentence included in each of the plurality of documents, and third information indicating a classification class representing an attribute of the sentence Classification result database stored in association with each other,
A search condition input unit into which a search condition for searching a document related to a certain phenomenon is input;
An extraction unit which executes full-text search on the plurality of documents stored in the document database based on the search condition to extract a document matching the search condition;
A display order determination unit that determines the order in which the plurality of extracted documents are output and displayed when there are a plurality of documents extracted by the extraction unit;
Equipped with
The classification class includes at least a first classification class representing the phenomenon;
The display order determination unit refers to the classification result database and determines to preferentially output and display a document including a sentence associated with the first classification class among the plurality of extracted documents. A document search apparatus characterized in that.
In the document search device according to claim 1,
A document search apparatus further comprising a classification execution unit which classifies a sentence included in each of the plurality of documents into any one of a plurality of classification classes representing attributes of the sentence and stores the classification in the classification result database. .
In the document search device according to claim 2,
And a learning unit configured to learn a classifier to construct a classification model that defines the classification class.
A document search apparatus, wherein the classification execution unit classifies a document using the classification model constructed by the learning unit.
In the document search device according to claim 3,
The document search apparatus, wherein the learning unit learns the classifier based on teacher data including a sentence and a classification class to which the sentence belongs, to construct the classification model.
The document search device according to any one of claims 1 to 4.
The document search apparatus further includes a second classification class representing a cause of occurrence of the phenomenon, and a third classification class representing a countermeasure against the phenomenon.
In the document search device according to any one of claims 1 to 5,
Furthermore, the document search device is characterized by further comprising a display unit for displaying the plurality of extracted documents by emphasizing sentences belonging to the classification class.
In the document search device according to claim 6,
10. The document search apparatus according to claim 1, wherein the display unit displays the sentences displayed in an emphasized manner such that the classification classes to which the sentences belong can be distinguished from one another.
In the document search device according to any one of claims 1 to 7,
The display order determination unit determines a document including a sentence belonging to the first classification class among the plurality of extracted documents based on an index value indicating the degree of the relationship between each of the plurality of documents and the phenomenon. A document search apparatus characterized in that the order of displaying the output is determined.
A search condition input step in which a search condition for searching a document related to a certain phenomenon is input;
An extraction step of performing a full text search on a plurality of documents stored in a document database based on the search condition to extract a document matching the search condition;
A display order determination step of determining an order in which the plurality of extracted documents are output and displayed when there are a plurality of documents extracted in the extraction step;
Equipped with
The display order determination step includes: first information identifying each of the plurality of documents; second information identifying a sentence included in each of the plurality of documents; and a classification class representing an attribute of the sentence Among the plurality of extracted documents, a document including a sentence associated with the first classification class representing the phenomenon is referred to with reference to the classification result database in which the third information to be shown is stored in association with each other. Make a decision to display preferentially.
The document classification method includes at least the first classification class.