WO2016162961A1

WO2016162961A1 - Text search device

Info

Publication number: WO2016162961A1
Application number: PCT/JP2015/060904
Authority: WO
Inventors: 貴元松井; 慶今沢
Original assignee: 株式会社日立製作所
Priority date: 2015-04-08
Filing date: 2015-04-08
Publication date: 2016-10-13

Abstract

The text search device according to the present invention stores, in advance, specification item tree structures, each for managing specification items in a tree structure, and also stores, in advance, topic probabilities and a representative specification item for each item of text in other selected documents. The text search device then performs: a process for acquiring at least one specification item, which serves as a search key, and a technical document; a process for calculating topic probabilities for each of the individual texts constituting the technical document, on the basis of the word distribution in the text; a process for determining, from a specification item tree structure, the representative specification item associated with the search specification item; a process for calculating a representative specification item certainty level for each text on the basis of the topic probabilities for the text; a process for calculating an intra-text word proximity level for each text on the basis of a word or words constituting the search specification item; a process for combining the certainty level and the proximity level to calculate a combined measure; and a process for extracting one or more texts relating to the search specification item for a new technical document, on the basis of the combined measure, and displaying the extracted one or more texts as extraction results. In this way the text search device improves the accuracy of text search results.

Description

Text search device

The present invention relates to a text search device that extracts a specific text from an electronic document.

As a background art in this technical field, there is Japanese Patent No. 2885487 (Patent Document 1). In this gazette, word notation and semantic category are extracted from the search sentence, similarity with index of registered documents (index of word notation and semantic category attached to headings and paragraphs) is calculated, It is described that search and display are performed in order from the chapter having the highest sum of the similarity of the paragraphs to the paragraph having the highest similarity of each paragraph in the same chapter.

Japanese Patent No. 2885487

In the left column on the third page of the above publication, the word notation “format” and “format” are semantic categories [FMT], the word notation “output” and “write” are semantic categories [OUT], and the word notation “designation” is semantic category [SITE]. And the degree of similarity for each paragraph is calculated based on the presence or absence of the corresponding “semantic category”. However, only the total value determines the order of search and display. Therefore, even if the paragraphs are searched with the same similarity, one is the content that you want to search, the other is the content that you do not want to detect, or the content that you want to detect when the total similarity score is lower Some things tend to happen. So-called erroneous search (false detection) is likely to occur.

Also, depending on the document, the description of a specific word may be omitted. In that case, the similarity is low, and as a result, the location is not searched.

An object of the present invention is to improve search accuracy.

This application includes a plurality of specific solutions, but typical ones are as follows.

A specification item tree structure for managing specification items in a tree structure, a topic probability and representative specification item for each sentence of other documents, a process for acquiring specification items and technical documents as search keys, and a technology Processing to calculate the topic probability of each sentence from the word distribution contained in each sentence of the document composition unit of the document, processing to obtain the representative specification item of the specification item from the specification item tree structure, and representative specification for each sentence from the topic probability A process for calculating the item certainty factor, a process for calculating the search word proximity in the sentence based on the words constituting the search specification item, a process for calculating a composite index of the certainty factor and the proximity, and a composite index A text search apparatus that executes a process of extracting text related to each search specification item in a new technical document and displaying it as an extraction result.

According to the present invention, the search accuracy can be improved.

It is a figure which shows the structural example of a text search apparatus. It is a figure which shows the example of data definition of a specification item tree structure. It is a figure which shows the example of a table of technical document results. It is a figure which shows the example of a table of the topic probability track record according to sentences. It is a figure which shows the example of a table of the representative specification item performance according to text. It is a figure which shows the example of a calculation formula of the topic probability estimation model classified by sentences. It is a figure which shows the example of a calculation formula of a representative specification item reliability estimation model. It is a figure which shows the whole processing flow. It is a figure which shows the topic probability calculation processing flow according to sentence of a new technical document. It is a figure which shows the example of a word list. It is a figure which shows the update process flow of the topic probability estimation model according to sentences. It is a figure which shows the example of an aggregation result of word distribution according to sentences. It is a figure which shows the calculation process flow of representative specification item reliability. It is a figure which shows the update process flow of a representative specification item reliability estimation model. It is a figure which shows an output screen.

Hereinafter, examples will be described with reference to the drawings. Note that components having the same function are denoted by the same reference symbols throughout the drawings for describing the embodiment, and the repetitive description thereof will be omitted.

(1) System Configuration FIG. 1 shows a configuration example of a text search apparatus 100 that constitutes the system of the present embodiment.

The text search apparatus 100 includes a data storage unit 101, a model generation processing unit 102, a model storage unit 103, a text extraction processing unit 104, an interface unit 105, and a bus.

The interface unit 105 is an input device that inputs a technical document such as a requirement specification or an output device that outputs an extracted sentence and the like, for example, a keyboard, a mouse, a display, and a printer. In this system, a graphical user interface (GUI) is configured on the screen of the interface unit 105 based on the processing of the text extraction processing unit 104, and various types of information are displayed.

The data storage unit 101 includes known elements such as HDD and MO, for example, and includes a specification item tree structure 1011, a technical document result 1012, a sentence-specific topic probability result 1013, and a sentence-specific representative specification item result 1014. . These data information, programs, and the like may be in a format that is acquired / referenced from the outside via a communication network.

The model generation processing unit 102 includes known elements such as a CPU, a RAM, and a ROM. The model generation processing unit 102 is a part that performs processing for realizing the characteristic functions of the present embodiment, and includes a topic-to-representative specification item linking engine 1021, a sentence-specific topic probability estimation model generation engine 1022, a representative A specification item certainty factor calculation model generation engine 1023. The representative item specification means a specification item including lower specification items defined in the upper node of the specification item to be searched in the specification item tree structure data 1011. In the following embodiments, a case where the representative specification item is the specification item of the highest node will be described as an example.

The model storage unit 103 includes, for example, known elements such as HDD and MO, and includes a sentence-specific topic probability estimation model 1031 and a representative specification item certainty factor estimation model 1032. These data information, programs, and the like may be in a format that is acquired / referenced from the outside via a communication network.

The sentence extraction processing unit 104 is composed of known elements such as a CPU, a RAM, and a ROM. The sentence extraction processing unit 104 is a part that performs processing for realizing the characteristic functions of the present embodiment. The sentence-specific topic probability calculation engine 1041, the representative specification item certainty factor calculation engine 1042, and the in-sentence search word proximity degree A calculation engine 1043 and a composite index calculation engine 1044 are included.

Although not shown, this apparatus 100 has known elements such as an OS, middleware, and application, and in particular, has an existing processing function for displaying a GUI screen in a web page format on the interface unit 105 such as a display. Prepare. The description location extraction processing unit 104 performs processing of drawing and displaying a predetermined screen, processing of data information input by the user on the screen, and the like using the above-described existing processing function.

(2) Specification Item Tree Structure FIG. 2 is a data definition example of the specification item tree structure 1011. The specification item tree structure 1011 is a hierarchical definition of specification item groups that may be used to determine product specifications. Specification item 3 “101103,“ Specification item 2 ”110102 and“ Specification item 1 ”110101 describe specification items in nodes (vertices) as higher-level specifications, and a parent-child relationship in which they are connected by edges (connection lines) Is defined in a tree structure.

(3) Technical Document Performance FIG. 3 is a table example of the technical document performance 1012. On the vertical axis (row) of the table, sentence IDs uniquely identifying each sentence of the document constituent unit in the technical document stored in the technical document record 1012 are arranged. The horizontal axis (column) stores, from the right side, information uniquely identifying a sentence with a sentence ID and a document constituent unit to which the sentence belongs. In this embodiment, a file name is stored as information for uniquely identifying a sentence, and a chapter title and a section title are stored as information for uniquely identifying a document constituent unit. The document composition unit is a unit that hierarchically divides a sentence. In addition to “chapter” and “section” of this embodiment, “part” (part), “section” (subsection), It may be a “section”, “paragraph”, etc.

(4) Sentence Probability Result by Sentence FIG. 4 is a table example of the topic probability result by sentence 1013. Technical text IDs are arranged on the vertical axis (row). On the horizontal axis (column), topic IDs that uniquely identify topics included in the sentence described in the sentence ID are arranged. In the portion of the cell where the horizontal axis “text ID” and the vertical axis “topic ID” intersect, the probability that the text with the text ID is a text about the topic with each topic ID is stored. Taking the example of FIG. 3, the cell portion where the horizontal axis “sentence A” and the vertical axis “topic 2” intersect is “0.2”. It means that the probability of handling was 0.2.

(5) Representative Specification Item Result by Text FIG. 5 is a table example of the representative specification item result 1014 by text. This table indicates to which representative specification item the text of each document constituent unit relates, and the text ID is arranged on the vertical axis (row) of the table, and the representative specification item ID is arranged on the vertical axis (row). The portion of the cell where the horizontal axis “sentence ID” and the vertical axis “representative specification item ID” intersect is flag data indicating the presence / absence of a word group in each sentence (in this example, “1”, no "0") is stored. Taking the example in FIG. 3, the cell portion where the horizontal axis “sentence 1” and the vertical axis “representative specification item B” intersect is “1”. Means appearing.

(6) Sentence-specific topic probability estimation model FIG. 6 is a mathematical example of a sentence-specific topic probability estimation model. This model estimates the topic probability of a sentence by using the word distribution by sentence of each document constituent unit as an input. At this time, there is Latent Dirichlet Allocation (LDA) as a usable technique, and FIG. 6 is an example of a mathematical formula when LDA is used as the technique.

(7) Representative specification item certainty factor estimation model FIG. 7 is an example of a mathematical expression of the representative specification item certainty factor estimation model. This model estimates what typical specification items are described in each sentence based on the topic probability of each sentence. The formula in FIG. 7 is an example in the case of using Support Vector Machine (SVM).

(8) Overall Process Flow FIG. 8 is a diagram showing the overall process flow.

In step S801, a requirement specification which is a technical document and a specification item created from the requirement specification are acquired as input information via the interface unit 105. Hereinafter, the requirement specification entered here is referred to as “new technical document”, and the specification item is referred to as “search specification item”.

In step S802, each sentence that is a document constituent unit (eg, chapter or section) of a new technical document is extracted, each sentence is divided into words, and the number of occurrences of each word (hereinafter referred to as “word distribution”). .) FIG. 12 shows an example of the total result of the word distribution by sentence. The sentence-specific word distribution and sentence-specific topic probability estimation model 1031 obtained here are input to the sentence-specific topic probability calculation engine 1041 to calculate the topic probability of each sentence. Details will be described later in (9) and FIG.

In step S803, the representative specification item defined in the parent node of the search specification item is acquired from the specification item tree structure 1011. If the specification item to be searched does not have a parent node, the search specification item is set as a representative specification item. Hereinafter, this representative specification item is referred to as “search representative specification item”. The specification item tree structure is generated by the engine 1021 with the specification item and the representative specification item.

In step S804, the retrieval representative specification item acquired in step 803, the topic probability of each sentence calculated in step 802, and the representative specification item certainty factor estimation model 1032 are input to the representative specification item certainty factor calculation engine 1042, and the new technology A representative specification item certainty factor (hereinafter referred to as “confidence factor”), which means the probability that each sentence in the document constituent unit of the document describes a topic related to the retrieval representative specification item, is calculated. Details of the processing steps for acquiring the retrieval representative specification item will be described later with reference to (10), (11), FIG. 13 and FIG.

In step S805, the words constituting the search specification item are input to the in-sentence search word proximity calculation engine 1043, and the description corresponding to the search specification item is included in each sentence of the document constituent unit of the new technical document. The in-text search word proximity (hereinafter referred to as “proximity”) that means the probability is calculated. The target to be input here may be only a search specification item having a certainty factor or more.

In step S806, the certainty factor calculated in step 804 and the proximity calculated in step 805 are input to the composite index calculation engine 1044, and the probability that the new technical text is described for the topic related to the representative specification item is obtained. A composite index is calculated which means that the probability of being high and including a description related to the search specification item is high.

In step S807, sentences relating to each search specification item in the new technical document are extracted based on the composite index in step S806 and displayed as an extraction result. In the display, the preceding and following sentences are displayed, and the sentence relating to the search specification item is highlighted.

(9) Topic-Specific Topic Probability Calculation Processing Flow for New Technical Document FIG.

In step S8021, each sentence is extracted in units of document structure of the new technical document acquired in step S801, and a morphological analysis is performed to create a word list in which each sentence is divided into words. FIG. 10 shows an example of a word list. In the example of FIG. 10, “Device A”, “To”, and “Device B” from the new technical document 1001 that “the arrangement of device A and device B has been changed, so please extend the wiring length.” "" "" Place "" "" "" "Change" "" to "" Now "" So "", "" Wiring "" "" "Length" "" "Extension" "" "" "Please" " A word list 1002 divided into “.” Is created.

In step S8022, the part of speech is determined for each word in the word list 1002 created in step 8021.

In step S8023, only words having an arbitrary part of speech are selected from the part of speech determined in step 8022. In this embodiment, only nouns are extracted. Nouns are more closely related to modifiers such as adjectives and adjective adverbs as words expressing the meaning of product or service specification items. Therefore, by selecting only the noun word from all, processing in a short time is possible with almost no reduction in extraction accuracy.

In step S8024, the word appearance frequency (word distribution) in each sentence is totaled for each sentence ID for the part-of-speech word selected in step 8023.

In step S8025, it is confirmed whether or not the sentence-specific topic probability estimation model 1031 is the latest with respect to the sentence stored in the technical sentence result 1012. If the sentence-specific topic probability estimation model is not updated, a sentence-specific topic probability estimation model 1031 is acquired in step S8026. When the sentence-specific topic probability estimation model 1031 is updated, the sentence-specific topic probability estimation model 1031 is updated and acquired in step S8027. Details of the update process of the sentence-specific topic probability estimation model will be described later with reference to FIG.

In step S8028, the sentence-specific topic probability calculation engine 1041 calculates the topic probability of each sentence using the sentence-specific topic probability estimation model acquired in step S8026 or S8027. (10) Document-specific topic rate estimation model update processing flow FIG. 11 is a diagram showing an update processing flow of the sentence-specific topic rate estimation model.

In step S80271, a technical document that has not been subjected to this processing is acquired from the technical document record 1012. Note that the sentence-specific topic probability estimation model 1031 may be re-created with the technical document that has already undergone this processing as a processing target.

In step S80272, similarly to step 8021, morphological analysis is performed on the technical text acquired in step S80271 to generate a word list.

In step S80273, as in step S8022, the part-of-speech determination process for each word in the word list created in step S80272 is executed.

In step S80274, as in step S8023, only words of any part of speech are selected from the part of speech determined in step S80273.

In step S80275, as in step S8024, the appearance frequencies of words corresponding to the part of speech selected in step S80274 are tabulated.

In step S80276, the data processed in steps S80271 to S80275 and the sentence-specific topic probability estimation model 1031 are input to the topic probability estimation model generation engine 1022 to update the sentence-specific topic rate estimation model 1031. (11) Representative specification items Confidence Level Calculation Processing Flow FIG. 13 is a diagram showing a representative specification item reliability calculation processing flow.

In step S8041, it is confirmed whether the representative specification item certainty factor estimation model 1032 is up-to-date with respect to the representative specification item result 1014 by technical text.

If the representative specification item certainty factor estimation model 1032 is not updated, the representative specification item certainty factor estimation model 1032 is acquired in step S8042.

When updating the representative specification item certainty factor estimation model, the representative specification item certainty factor estimation model is updated and acquired in step S8043. Details of the update process of the representative specification item certainty factor estimation model will be described later with reference to FIG.

In step S8044, the topic probability of each sentence calculated in step 802 and the representative specification item certainty factor estimation model 1032 are input to the representative specification item certainty factor calculation engine 1042, and the representative specification item certainty factor of each sentence is calculated.
(12) Representative specification item certainty factor estimation model update processing flow FIG. 14 is a diagram showing an update processing flow of the representative specification item certainty factor estimation model.

In step S80431, a sentence-specific topic probability record newly registered after the previous process is acquired from the sentence-specific topic probability record 1013.

In step S80432, the representative specification item result by sentence newly registered after the previous process is acquired from the representative specification item result by sentence 1014.

In step S80433, the representative specification item estimation model generation engine 1023 adds the data added to the sentence-specific topic probability record acquired in step S80431 and the sentence-specific representative specification item record acquired in step S80432, and representative specification item certainty factor estimation. The model 1032 is input to the representative specification item estimation model generation engine, and the representative specification item certainty factor estimation model is updated.
(13) Output screen FIG. 15 shows an output screen. An example of a GUI screen displayed on the interface unit 104 will be described with reference to FIG.

The GUI screen includes a document structure display area 1501 for a new technical document, a specification hierarchical structure display area 1502 for a product or service related to the new technical text, and a text display area 1503 for displaying a text of the document structure unit of the new technical document. Mainly composed.

In the chapter structure display area 1501, chapter titles, section titles, and the like are displayed hierarchically based on the chapter attribute information stored in the technical document performance 1012. The document structure highlighting display 1504 indicates the position in the sentence corresponding to the specification item designated by the user. The document structure probability display 1505 indicates the probability (representative specification item certainty) that a topic related to the representative specification item defined in the parent node of the specification item designated by the user is described. In the example of FIG. 15, the probability is displayed as a 5-level bar display, but other probability display forms may be used.

In the specification hierarchical structure display area 1502, based on the specification item tree structure stored in the specification item tree structure 1011, the specification items related to the target product or service are displayed hierarchically, and this item is displayed below the corresponding specification item. The extraction result by the process of the invention is displayed, and the probability of correctness of the extraction result (sentence in words in the sentence) is displayed in order from the top. The specification hierarchical structure highlight display 1506 indicates candidates for extraction results of specification items designated by the user. A specification hierarchy highlighting 1507 indicates representative specification items defined in the parent node of the specification item specified by the user. The specification hierarchical structure probability display 1508 indicates the probability of correctness of the extraction result of the corresponding specification item (text search word proximity). In the example of FIG. 15, the probability is displayed as a 5-level bar display, but other probability display forms may be used.

In the text display area 1503, the text of the new technical document is displayed as a search result. The in-text highlighting area 1509 shows a sentence corresponding to the in-specification hierarchy highlighting 1506 designated by the user.

Also, in this screen, there is a composite index ratio setting 1510 for setting a ratio for synthesizing representative specification item certainty factor and search word proximity in text, which is displayed in the output screen according to the operation of the setting lever by the user. Recalculates the probability, and dynamically changes and visualizes the probability display.

Based on this output screen, the user sequentially specifies the candidate extraction results of the specification items to be confirmed in the new technical document, which text structure is positioned, how likely it is, and the text It is possible to quickly identify the location where information necessary for engineering work is described from a technical document such as a requirement specification by confirming how it is specifically described in the list. Thereby, it is possible to shorten the period required for reading and understanding the technical document and the entire engineering work.

As mentioned above, the invention made by the present inventor has been specifically described based on the embodiment. However, the present invention is not limited to the embodiment, and various modifications can be made without departing from the scope of the invention. Needless to say.

DESCRIPTION OF SYMBOLS 100 ... Text search device 101 ... Data storage part 102 ... Model production | generation process part 103 ... Model storage part 104 ... Text extraction process part 105 ... Interface part 1011 ... Specification item tree structure 1012 ... Technical document results 1013 ... Topic probability result 1014 according to sentences ... Text-specific specification item result 1021 ... Topic and representative specification item linking engine 1022 ... Text-specific topic probability estimation model generation engine 1023 ... Representative specification item reliability estimation model generation engine 1031 ... Text-specific topic probability estimation model 1032 ... Representative Specification item certainty estimation model 1041 ... Sentence topic probability estimation calculation engine 1042 ... Representative specification item certainty calculation engine 1043 ... Sentence search word proximity calculation engine 1044 ... Synthetic index calculation engine 200 ... Output screen

Claims

A text search device for searching text related to specification items related to products or services from a technical document to be searched,
A storage unit and an arithmetic processing unit;
The storage unit
A specification item tree structure for managing specification items including representative specification items in a tree structure;
Topic probability by sentence that memorizes topic probabilities for each other sentence,
With representative specification item results by sentence to store representative specification items for each sentence,
The arithmetic processing unit
Processing for obtaining a specification item as a search key and the technical document;
Processing for calculating the topic probability of each sentence from the word distribution included in each sentence of the document composition unit of the technical document;
Processing for obtaining a representative specification item of the specification item from the specification item tree structure;
From the topic probability, a process for calculating a representative specification item certainty factor that means a probability that a topic related to the search representative specification item is described for each sentence;
Based on the words constituting the search specification item, a process for calculating a search word proximity in a sentence that means a probability that a description corresponding to the search specification item is included in each sentence;
From the certainty factor and the proximity, a composite index that means that there is a high probability that a sentence is described on a topic related to a representative specification item and that a sentence related to a search specification item is high A process of calculating
A text search apparatus that executes a process of extracting text related to each search specification item in a new technical document based on the composite index and displaying the text as an extraction result.
In claim 1,
An engine for calculating the representative specification item certainty factor,
A text search apparatus characterized by updating a model used in the engine.
In claim 1,
An engine for calculating the topic probability,
A text search apparatus characterized by updating a model used in the engine.
In claim 1,
A sentence search apparatus characterized in that the proximity is calculated only for those having a high certainty factor.
In claim 1,
If there is a parent node in the representative tree structure, the parent node is a representative specification item,
If there is no parent node in the representative tree structure, the self-node is used as a representative specification item.