WO2016031055A1

WO2016031055A1 - Information retrieval apparatus, information retrieval method, and information retrieval program

Info

Publication number: WO2016031055A1
Application number: PCT/JP2014/072762
Authority: WO
Inventors: 関　峰伸; 義行小林
Original assignee: 株式会社日立製作所
Priority date: 2014-08-29
Filing date: 2014-08-29
Publication date: 2016-03-03

Abstract

This information retrieval apparatus executes: identification processing for tracing a path in search of a retrieval keyword group in tree-structured data constructed by hierarchizing, in the form of nodes, a title and paragraphs in a document on the basis of text information in the document to identify the positions of retrieval keywords along said path in the tree-structured data; calculation processing for calculating an evaluation value regarding the accuracy of identification of said retrieval keyword group on the basis of the positions of the retrieval keywords along the path in the tree-structured data, said positions being identified by the identification processing; and display processing for displaying the location of a description corresponding to the path in the document on the basis of the evaluation value calculated by the calculation processing.

Description

Information search device, information search method, and information search program

The present invention relates to an information search device, an information search method, and an information search program for searching for information.

Conventionally, when a search query includes a plurality of keywords, there is a method called neighborhood search in which search results are ranked using the distance between keywords appearing in a document.

Further, the search result ranking apparatus of Patent Document 1 uses the meaning of a text structure when ranking search results using the proximity of keyword groups in an input search query in a search for a structured text set. Evaluate the distance between keywords. The search result ranking apparatus obtains the proximity between the keywords using the distance, and based on the number of occurrences of the keyword in the search query in the entire text and the number of occurrences of the keyword in the text, Assess relevance. Thereafter, the search result ranking apparatus calculates a document score from the proximity between keywords and the relevance of text, and ranks the search results.

JP 2010-282480 A

However, the above-described conventional technique has a problem of erroneously extracting word pairs that are not meaningful. That is, in the above-described prior art, since the hierarchical chapter structure of the document is not grasped, a word set cannot be extracted for each group of meanings of the document. For example, when the extraction range is narrower than the distance between words in a word set, the word set cannot be extracted. On the other hand, if the extraction range is too wide than the distance between words, words that are used for different meanings are extracted in combination, so a word set that does not originally have a meaning is erroneously extracted.

Further, in the above-described conventional technology, the distance between words is calculated when one word constituting a word set exists in the section title and the other word exists in the body of the section. However, in the above-described prior art, since the hierarchical chapter structure of the document is not grasped, when each word of the word set exists in a different chapter in the document, how to calculate the distance between words? It is unknown whether it is good. Therefore, for example, when a certain word included in a set of words is separately described in the title of Chapter 1 and the paragraph in Section 1.5, they cannot be extracted as a set.

The present invention aims to improve search accuracy.

An information search apparatus, an information search method, and an information search program according to an aspect of the invention disclosed in the present application are such that a processor includes a title in a document and a title in the document based on character information indicating a character in the document and the position of the character. A specific process for identifying a position on the tree structure data of a search keyword included in the path by searching a path that follows the search keyword group from the tree structure data hierarchically structured with paragraphs as nodes, and A calculation process for calculating an evaluation value related to the specific accuracy of the search keyword group based on a position on the tree structure data of a search keyword included in the path specified by the specific process, and an evaluation calculated by the calculation process Display processing for displaying a description portion corresponding to the path in the document based on the value. And butterflies.

According to the representative embodiment of the present invention, it is possible to improve the search accuracy. Problems, configurations, and effects other than those described above will become apparent from the description of the following embodiments.

It is explanatory drawing which shows the information search example 1 concerning a present Example. It is explanatory drawing which shows the information search example 2 concerning a present Example. It is a block diagram which shows the hardware structural example of an information search device. It is a block diagram which shows the functional structural example of the information search device concerning a present Example. It is explanatory drawing which shows the example of a screen of a chapter structure analysis result. It is explanatory drawing which shows the example of a search of tree structure data. It is explanatory drawing which shows the specific example of the shortest path. It is explanatory drawing which shows the specific example of the shortest path. It is explanatory drawing which shows the example 1 of an output screen produced | generated by a display part. It is explanatory drawing which shows the output screen example 2 produced | generated by the display part. It is a flowchart which shows the example of an information search process sequence by an information search device. It is a flowchart which shows the detailed process sequence example of the chapter structure analysis process (step S1002) shown in FIG. It is a flowchart which shows the detailed process sequence example of the specific process (step S1004) shown in FIG.

<Example of information search>
FIG. 1 is an explanatory diagram of an information search example 1 according to the present embodiment. The information retrieval target document 100 is an electronic document converted into text data, and examples thereof include a product instruction manual, a maintenance manual, and a required specification. In this embodiment, an infrastructure facility maintenance manual (hereinafter simply referred to as “maintenance manual”) will be described as an example. The information retrieval apparatus acquires the document 100 and performs chapter structure analysis on the acquired document 100. The chapter structure analysis is processing for analyzing the chapter structure.

The chapter / section structure is tree structure data T1 indicating a logical hierarchical relationship composed of chapters, sections, and terms included in the document 100. In the chapter structure tree structure data T1, chapter, section, and section titles and paragraphs are structured as hierarchical nodes starting from the root node. Specifically, in the tree structure data T1, the title is an intermediate node and the paragraph is a leaf node. When there is a chart in a paragraph, the character string in the chart is a lower hierarchy of the paragraph and becomes a leaf node. A symbol N # (# is a number) indicates a node constituting the tree structure data T1.

In the maintenance manual 100 of FIG. 1, “1. Bolt”, “2. Piping”, and so on are titles of chapters. The titles of the sections are “1.1 Installation of pipe end”, “2.1 Water leakage from pipe”, and so on. “1.1.1 Phenomenon and Cause”, “2.1.2 Countermeasures”,... For example, a sentence that does not have an item number at the beginning is a paragraph, such as “Bolt has loosened due to water pressure and vibration and water leaked.”

The tree structure data T1 which is the result of the chapter structure analysis is obtained by the chapter structure analysis. When the search keyword is given, the information search device searches with reference to the tree structure data T1. Here, as an example, it is assumed that “bolt” and “leakage” are given as search keywords. There are two types of search keywords, for example, an input keyword given by a user's operation input and a set keyword set in advance as a dictionary (for example, a specification item name dictionary 400 described later).

Referring to the tree structure data T1, (A1, B1) is specified as a set of “bolt” and “water leakage”. In (A1, B1), since the nodes are in a serial hierarchical relationship, the text including the water leak B1 is presumed to be an explanatory text about “water leak” caused by “volt”. In this way, when the nodes are in a serial hierarchical relationship, the search is performed assuming that there is a relationship between words.

In the document 100, for example, the number of words between the bolt 11 and the water leakage 12 is small and close to each other, but the nodes A2 and B2 are not in a serial hierarchical relationship in the tree structure data T1 for the bolt 11 and the water leakage 12. Therefore, it is not specified from the tree structure data T1. Therefore, it is presumed that the sentence (section 2.1 title) including the water leakage B1 is not described as “water leakage” due to “volt”.

Here, the number of words between two words is called the distance between words related to the number of words. Specifically, for example, the distance between words related to the number of words is the number of words included between the words + 1. The inter-word distance regarding the number of words is an index value indicating the number of words between words, and the shorter the distance between two words is, the closer the two words are in the document 100.

Also, as a feature of the present invention, the distance between words related to the number of nodes is applied. The inter-word distance related to the number of nodes is the number of nodes between two nodes having a serial hierarchical relationship in the tree structure data T1. Specifically, for example, the distance between words related to the number of nodes is the distance between the node containing word A and the node containing word B in a path connecting the node containing word A and the node containing word B on the tree structure data. The number of nodes (titles and paragraphs) included in +1.

In the neighborhood search, since only the distance between words related to the number of words is used, a group such as (A2, B2) that does not substantially make sense is erroneously searched. On the other hand, by searching using the tree structure data T1, it is possible to suppress erroneous search of a set such as (A2, B2) that does not substantially make sense. Furthermore, the probable combinations can be ranked higher by using the distance between words related to the number of nodes. This is because word groups with meanings are close together even on a tree structure.

FIG. 2 is an explanatory diagram of an information search example 2 according to the present embodiment. FIG. 2 is an example of information retrieval when there is an error in the structural analysis. The document 200 is a document in which the manual of the operation method and the maintenance content manual shown in FIG. 1 are combined as one file. Here, it is assumed that “valve”, “bolt”, and “water leakage” are given as search keywords.

In Document 200, “1. Bolt” that is Chapter 1 of the maintenance contents appears after “5. Valve” that is Chapter 5 of the working method. The tree-structured data T2 includes nodes N1, N11, N111, N111a, N112, N112a and subsequent nodes after “1. Volt”, which is chapter 1 of the maintenance content, below the node N512 that is “5.1.2 Procedure 2”. Has ... Under normal circumstances, it is correct that “1. Volt” is in parallel with “5. Valve”. However, manuals and design documents have a higher number of chapters, and new chapter structures may be included in 5.1.2, so they are mistakenly distinguished in chapter structure analysis processing. This is because there are cases.

(C, A3, B3), (C, A1, B1) are specified as a set of “valve”, “bolt”, and “leakage” by referring to the tree structure data T2. For (C, A3, B3) and (C, A1, B1), the nodes are in a serial hierarchical relationship. Therefore, the water leaks B1 and B3 are presumed to be the places where the explanatory text about the “water leak” due to the “valve” and the “bolt” is written. However, (C, A3, B3) is ranked higher than (C, A1, B1) because the inter-word distance regarding the number of nodes is shorter than (C, A1, B1).

In this way, the information retrieval device ranks candidates with a short inter-word distance related to the number of titles in the higher rank while complying with the chapter structure even if there is an error in the chapter structure. The place to be described can be specified.

<Hardware configuration example>
FIG. 3 is a block diagram illustrating a hardware configuration example of the information search apparatus. The information search apparatus 300 includes a processor 301, a storage device 302, an input device 303, an output device 304, and a communication interface (communication IF) 305. The processor 301, the storage device 302, the input device 303, the output device 304, and the communication IF 305 are connected by a bus. The processor 301 controls the information search device 300. The storage device 302 serves as a work area for the processor 301. The storage device 302 is a non-temporary or temporary recording medium that stores various programs and data. Examples of the storage device 302 include a ROM (Read Only Memory), a RAM (Random Access Memory), an HDD (Hard Disk Drive), and a flash memory. The input device 303 inputs data. Examples of the input device 303 include a keyboard, a mouse, a touch panel, a numeric keypad, and a scanner. The output device 304 outputs data. Examples of the output device 304 include a display and a printer. The communication IF 305 is connected to a network and transmits / receives data.

<Functional configuration example>
FIG. 4 is a block diagram illustrating a functional configuration example of the information search apparatus 300 according to the present embodiment. The information search apparatus 300 includes a specification item name dictionary 400, an acquisition unit 401, an analysis unit 402, an input unit 403, a specifying unit 404, a calculation unit 405, and a display unit 406. Specifically, the specification item name dictionary 400 realizes its function as information stored in the storage device 302 shown in FIG. 3, for example. Specifically, the acquisition unit 401, the analysis unit 402, the input unit 403, the specification unit 404, the calculation unit 405, and the display unit 406, for example, store the program stored in the storage device 302 illustrated in FIG. The function is realized by executing.

The specification item name dictionary 400 is information that stores specification item names given to the specifying unit 404 as setting keywords. Specifically, for example, the specification item name dictionary 400 is a table in which category names and specification item names are associated with each other. The specification item name dictionary 400 is information given in advance when the documents 100 and 200 (hereinafter collectively referred to as the document 1) are specifications. Since the section title in the specification often includes the specification item name, the retrieval process as in this embodiment is effective. The category name is a name that identifies the category (type) to which the document 1 belongs.

The acquisition unit 401 acquires character information and graphic information from the document 1. The character information includes a character code of a character existing in the document 1 and position information of the character. The position information of the character includes a description page number of the character and four corner coordinates. The four corner coordinates are coordinate values of four vertices of a rectangle surrounding the character when the origin is the lower left corner of the page specified by the page number of the character. The character position can be specified by the character information. The graphic information includes graphic data such as image data and table data existing in the document 1 and position information of the graphic data. The position information of the graphic data includes a description page number of the graphic data and four corner coordinates. The four corner coordinates are coordinate values of four vertices of a rectangle surrounding the graphic data when the origin is the lower left corner of the page specified by the page number of the graphic data. When the graphic information includes a character, the acquisition unit 401 also acquires the character as character information. In this case, the position information of each character is calculated using the position information of the graphic data.

The analysis unit 402 analyzes the chapter structure of the document 1. Specifically, for example, the analysis unit 402 extracts a character line using the character position information acquired by the acquisition unit 401, and determines whether the character line is a title such as a chapter, a section, or a term. To do. A character line is a character string obtained by concatenating character groups having a distance between adjacent characters within a predetermined distance in a set of characters arranged in the horizontal direction. For example, when there is a number at the beginning of a character line, the analysis unit 402 determines that the character line is a title. Further, the analysis unit 402 may determine whether or not the title is a title by using a difference in font and character size from other character strings and a distance between character lines.

Further, the analysis unit 402 identifies a paragraph including a plurality of character lines between titles. A paragraph is a grouping of meanings in a sentence. The paragraph can be specified because it is separated from the title by indenting or widening the space between character lines. Further, the analysis unit 402 analyzes the hierarchical relationship between successive titles using the number of levels of title numbers. The number of hierarchies is the depth of the hierarchies. For example, if the chapter is “1”, the hierarchy number is 1, the section “1.1” is the hierarchy number 2, and the term “1.1.1” is the hierarchy number 3. The continuity is, for example, the section “1.2” appears after the section “1.1”, and the section “1.3” or the chapter “2” appears after the section “1.2”. This is a feature indicating the ascending order of the number of the title to be. As a result, tree structure data T1, T2 (hereinafter collectively referred to as tree structure data T) is generated.

FIG. 5 is an explanatory view showing an example of a chapter structure analysis result screen. The analysis result display screen 500 includes a first panel 501, a second panel 502, a keyword input field 503, a search button 504, and a slider 505. The tree structure data T is displayed in the index portion that is the first panel 501, and the document 1 is displayed in the document display portion that is the second panel 502. Further, by selecting a title of the tree structure data T from the input device 303, a sentence corresponding to the selected title is displayed on the document display unit. The keyword input field 503 is an area for receiving a character string input by operating the input device 303. When the search button 504 is pressed, a search is executed for the input word group. The slider 505 is an interface for setting the weights of evaluation values Pf, Ps, Pid, Pq, and IPFi described later. Moving the slider to the left decreases the weight, and moving it to the right increases the weight.

Returning to FIG. 4, the input unit 403 receives a character string input by operating the input device 303 from the keyword input field 503 of the analysis result display screen 500. The input unit 403 analyzes the input character string when the search button 504 is pressed. When the input character string is a plurality of words separated by a space, the input unit 403 passes the plurality of words to the specifying unit 404. When the input character string is a sentence, the input unit 403 cuts out a word from the sentence by morphological analysis and passes it to the specifying unit 404.

The specifying unit 404 specifies the description position of the search keyword group from the tree structure data T analyzed by the analyzing unit 402. Here, the search keyword group may be a plurality of words input to the keyword input field 503 of FIG. The specification item name of the specification item name dictionary 400 may be used. If the specification item name is a character string that is not separated by a space, the specifying unit 404 cuts out a word from the character string by morphological analysis. For example, in the case of “water leakage due to loosening of bolts”, “bolt”, “loosening”, and “water leakage” are cut out. The specifying unit 404 searches for a path including a word group by tracing from the paragraph serving as the leaf node of the tree structure data T to the top.

FIG. 6 is an explanatory diagram showing a search example of the tree structure data T1. For example, when the search keyword group is “Bolt phenomenon leakage”, the specifying unit 404 searches from the leaf nodes N111a, N112a, N211a, N211c, and N212a of the tree structure data T1 in FIG. 1 toward the root node. . The specifying unit 404 specifies the position of the search keyword “bolt phenomenon water leak” in the node being searched.

If the example which started the search from the leaf node N111a is demonstrated, the specific | specification part 404 specifies the water leak B1 of the node N111a. Next, the identifying unit 404 searches the search keyword group “bolt phenomenon leaked water” by tracing the self node and the upper node from the leaf node N111a from which the leaked water B1 has been searched. Then, the identifying unit 404 identifies the phenomenon D1 of the node N111. Then, the specifying unit 404 follows the node and the upper node from the node N111 where the phenomenon D1 is searched, and continues to search for the search keyword group “volt phenomenon leakage”. Then, the specifying unit 404 specifies the bolt A1 of the node N1. When the root node N0 is reached, the search ends from the leaf node N111a. Thereby, the specifying unit 404 can specify a path “water leakage B1 → phenomenon D1 → bolt A1”.

Describing an example in which the search is started from the leaf node N211a, the specifying unit 404 specifies the water leak B3 at the leaf node N211a. Thereafter, the identifying unit 404 can search for the phenomenon D2 and the leaked water B2 from the node N211a where the leaked water B3 has been searched for, but reaches the leaf node N0 without searching for the bolt. . Therefore, the specifying unit 404 specifies the path “leakage B3 → phenomenon D2 → leakage B2”.

Also, an example in which the search is started from the leaf node N211c when the search keyword group is “pipe corner gap” will be described. The identifying unit 404 searches for a “gap” for the leaf node N211c but does not hit it, and thus searches for the node N211b, which is a higher node. The identifying unit 404 identifies the gap G at the node N211b. Then, the specifying unit 404 follows the self node and the upper node from the leaf node N211b in which the gap G is searched, and continues to search for the search keyword group “pipe corner gap”. Then, the identifying unit 404 searches for the corner F in the own node.

Next, the specifying unit 404 follows the node and the upper node from the node N211b where the corner F is searched, and continues to search for the search keyword group “pipe corner gap”. Then, the specifying unit 404 specifies the pipe E1 of the node N21. Then, the specifying unit 404 follows the node and the upper node from the node N21 where the pipe E1 has been searched, and continues to search for the search keyword group “pipe corner gap”. Then, the specifying unit 404 specifies the pipe E2 of the node N2. When the root node N0 is reached, the search ends from the leaf node N211c. Thereby, the specifying unit 404 can specify a path “gap G → corner F → pipe E1 → pipe E2”.

Thus, the “gap G” that is the start end (leaf node side) of the path may not exist in the leaf node N211c. For example, in the paragraph shown in the node N211b, even when the corner F and the gap G, which are words, are separated from each other, they may be connected to each other, so the specifying unit 404 specifies.

Also, the identifying unit 404 extracts the shortest path for each identified path. The shortest path is a path that includes the largest number of types of search keywords and has the smallest total (or average) distance between words in the identified paths. The reason for extracting the shortest path is that when multiple identical keywords appear in the identified path, it is unlikely that the keywords at a distant position are related. This is to evaluate the path. For example, a search keyword (pipe) may appear a plurality of times as in the path “gap G → corner F → pipe E1 → pipe E2” shown in FIG. In this case, a path with the shortest distance between words is extracted. As for the distance between words, at least one of the distance between words related to the number of nodes and the distance between words related to the number of words is applied according to the position of the search keyword included in the nodes constituting the path. For example, when both words exist in different nodes, the distance between words related to the number of nodes is applied, and when both words exist in the same node, the distance between words related to the number of words is applied. Details will be described later.

7A and 7B are explanatory diagrams showing a specific example of the shortest path. 7A and 7B, the search keyword group is “valve bolt / leakage”. 7A, (A) is a path P7 searched from the tree structure data T. FIG. The path P7 is a path that follows “leakage 73b → bolt 72b → leakage 73a → valve 71b → valve 71b → bolt 72a → valve 71a”. In FIG. 7A, a path P71 and a path P72 in (B) are examples of paths including the largest number of types of search keywords from the path P7 (in this case, paths including all search keywords). The identifying unit 404 calculates the inter-word distance for the paths P71 and P72. The path P71 specifies the search keyword group in which the order of “valve” and “bolt” in “valve bolt leakage” is switched. It should be noted that paths such as “leakage 73a → bolt 72a → valve 71a”, “leakage 73b → bolt 72b → valve 71a”, and “leakage 73b → bolt 72a → valve 71a” are also specified, but are omitted for simplification. To do.

In the case of the path P71, the sum d1 (= d11 + d12) of the inter-word distance d11 between the valve 71b and the bolt 72a and the inter-word distance d12 between the valve 71b and the water leakage 73a is the inter-word distance in the path P71. In the case of the path P72, the total distance d2 (= d21 + d22) of the inter-word distance d21 between the valve 71b and the bolt 72b and the inter-word distance d22 between the bolt 72b and the water leakage 73b is the inter-word distance in the path P72. The identifying unit 404 identifies the path having the shorter inter-word distance among the paths P71 and P72 as the shortest path of the path P7.

In FIG. 7B, the path Pb1 in (A) includes two

valves

71a and 71b. In this case, the inter-word distance between the valve 71b and the bolt 72 is shorter than the inter-word distance between the valve 71a and the bolt 72. Therefore, the identifying unit 404 identifies the path Pb2 from the path Pb1 and determines the shortest path. To do.

(B) The path Pc includes two

bolts

72a and 72b. In this case, the specifying unit 404 adds the inter-word distance between the valve 71 and the bolt 72 a and the inter-word distance between the bolt 72 a and the water leakage 73, the inter-word distance between the valve 71 and the bolt 72 b, the bolt 72 b and the water leakage 73. Are compared with the sum of the distances between words, and the shorter path is specified as the shortest path. Further, the specifying unit 404 gives priority to the bolt 72b that is a node close to the water leak 73 that is the start end of the path Pc on the leaf node side, and specifies the path of “water leak 73 → bolt 72b → valve 71” as the shortest path. Also good.

(C) Path Pd1 includes two

water leaks

73a and 73b. In this case, since the distance between the words of the bolt 72 and the leak 73a is shorter than the distance between the words of the bolt 72 and the leak 73b, the specifying unit 404 specifies the path Pd2 from the path Pd1 and sets it as the shortest path.

Even if the searched path includes a plurality of paths as shown in (A) to (C) of FIG. 7B, the path with the shortest distance between words may be specified as the shortest path.

Returning to FIG. 4, the calculation unit 405 determines, for each word set specified by the specifying unit 404 in the tree structure data T, the evaluation values Pf, Ps, Pid, Pq, IPFi is calculated. Specifically, for example, the calculation unit 405 calculates the following five values Pf, Ps, Pid, Pq, and IPFi, and uses these five values to calculate an evaluation value of a path (a specified word set). calculate. For example, a weighted linear sum is used as an evaluation function, and an evaluation value based on the evaluation function is calculated.

The evaluation value Pf is a value indicating how much the word included in the set of words specified by the specifying unit 404 is included in the search keyword group. The words that make up the search keyword group are not all described in the document. Therefore, the evaluation value Pf is added to the evaluation function as a term having a higher value as the ratio including the words constituting the search keyword group increases. The evaluation value Pf is calculated by the following formula (1).

Pf = New / Nsw (1)

New is the number of words included in the set of words specified by the specifying unit 404, and Nsw is the number of words included in the search keyword group.

For example, in FIG. 6, in the case where the search keyword group is “Bolt phenomenon leaked water” and the specified word set is {Bolt A1, phenomenon D1, leaked water B2}, since New = Nsw = 3, Pf = 1. On the other hand, when the specified word set is {phenomenon D2, leaked water B3}, since New = 2 and Nsw = 3, Pf = 2/3.

The evaluation value Ps is an evaluation value indicating the ratio of the words included in the set of words specified by the specifying unit 404 and the words in the chapter title. That is, the evaluation value Ps is a value indicating whether or not a word included in the set of words specified by the specifying unit 404 is described along the chapter structure. A word group selected as a specification item name or a search keyword group is main information in the document 1 and is often described in a title such as a chapter or a section. Therefore, the evaluation value Ps is added to the evaluation function as a term that becomes higher as the ratio of words included in the title increases. The evaluation value Ps is calculated by the following formula (2).

Ps = Nsm / New (2)

Nsm is the number of chapter titles including the word specified by the specifying unit 404.

The evaluation value Pid is the reciprocal of the distance between words. As the distance between words, the distance between words regarding the number of nodes and the distance between words regarding the number of words can be used.

Since the distance between words related to the number of words is calculated using a sequence of words arranged in the reading order, the distance between the word described in the title of Chapter 1 and the word described in the paragraph in Section 1.5 is a large value. Thus, there is a high possibility that the word and the meaning of the word are not connected. On the other hand, the inter-word distance related to the number of nodes, which is a feature of this method, is calculated using a column in which paragraphs and title nodes traced when the specifying unit 404 searches for a set of words.

For example, in the case of the path “leakage B1 → phenomenon D1 → bolt A1” searched in FIG. 6, the inter-word distance regarding the number of nodes is calculated for “leakage B1 → phenomenon D1”. In this case, there is no node between the node N111a of one word (leakage B1) and the node N111 of the other word (phenomenon D1). Therefore, the distance between words regarding the number of nodes of “leakage B1 → phenomenon D1” is 1. Further, the distance between words regarding the number of nodes of “phenomenon D1 → volt A1” is calculated. In this case, there is one node between the node N1 of one word (bolt A1) and the node N1111 of the other word (phenomenon D1). Therefore, the inter-word distance regarding the number of nodes of “phenomenon D1 → volt A1” is 2. And the distance between words of the path “leakage B1 → phenomenon D1 → bolt A1” is 1 + 2 = 3.

Also, as described above, the calculation method of the inter-word distance related to the number of nodes differs depending on whether two target words are present in the same node or in different nodes. The distance between words regarding the number of words may be applied to two words in the same node, and the distance between words regarding the number of nodes may be applied to two words at different nodes.

For example, in the case of the path “gap G → corner F → pipe E1 → pipe E2” searched in FIG. 6, “gap G → corner F → pipe E1” becomes the shortest path as shown in FIG. 7B. For “gap G → corner F” in the shortest path, the calculation unit 405 calculates an inter-word distance related to the number of words. For “corner F → pipe E1”, the calculation unit 405 calculates the inter-word distance related to the number of nodes.

Thus, for example, if there is a word described in the title of Chapter 1 and a word described in the paragraph in section 1.5, the distance between the words is 2, and there is a possibility that the meaning of the word and the word is connected. Is high. As the calculated distance between words is smaller, the set of words is more probable. Therefore, an evaluation value Pid that is the reciprocal of the distance between words is added to the evaluation function as a term of the evaluation function. The evaluation value Pid is calculated by the following formula (3).

W _i is the i-th word (i is an integer of 1 or more). D _i is the word distance between _{W i} and _{W i + 1.} The calculation method of the inter-word distance D _i is switched depending on the positional relationship between W _i and W _{i + 1} .

When W _i and W _{i + 1} are in the same node, the inter-word distance D _i is the inter-word distance related to the number of words, and D _i = the number of words between W _i and W _{i + 1 + 1} .

On the other hand, when W _i and W _{i + 1} are in different nodes, the inter-word distance D _i is the inter-word distance related to the number of nodes, and D _i = the number of nodes between N _i and N _{i + 1} +1. However, N _i is a node including W _i , and N _{i + 1} is a node including W _{i + 1} .

The evaluation value Pq is an evaluation value indicating the degree of coincidence between the word order in the search keyword group or the word order in the specification item name and the appearance order of the words in the chapter structure. When the document 100 is a specification, it is often described by applying the hierarchical structure of the specification to the chapter structure. For this reason, the degree of coincidence of the appearance order of words on the chapter structure is such that the order of description of the words constituting the specification item name matches the reverse order of the word sequence on the path generated by tracing the chapter structure from the lower hierarchy. High value. Therefore, the evaluation value Pq, which is the degree of coincidence of the appearance order of words on the chapter structure, is added to the evaluation function as a term of the evaluation function. The evaluation value Pq is calculated by the following formula (4).

Pq = Neq / New (4)

Neq is the number of words W _i and W _{i + 1} that appear consecutively in the word sequence of the search keyword group. For example, when the words “bolt”, “leakage”, and “cause” in the search keyword group are in this order, when the specifying unit 404 searches for “bolt”, “leakage”, and “cause” in this order, Neq Are “bolts” and “leakage” and “leakage” and “causes”. New is 3. Therefore, Pq = 2/3.

On the other hand, when the specifying unit 404 searches for “volt”, “cause”, and “water leakage” in this order, Neq is 0. New is 3. Therefore, Pq = 0. Neq may be a word W _i and the word W _{i + 1} of (another search keyword in between the words W _i and the word W _{i + 1} may be present) the appearance order is correct number.

The evaluation value IPFi is an average value of the importance of the word using the number of nodes (title or paragraph) in which the word included in the set of words specified by the specifying unit 404 appears. For example, the words constituting the specification item name are also used when describing other specification items or other than specification items. In order to reduce erroneous extraction of other specification items and words other than the specification items, when a characteristic word is included in the word set specified by the specifying unit 404, the evaluation value of the specified word set Need to be high.

There is IDF (Inverse Document Frequency) as an index indicating a characteristic word. IDF is the reciprocal of the number of documents in which words appear, and the larger this value, the smaller the number of documents that appear and the more characteristic words. In this embodiment, not the number of documents but the reciprocal of the number of nodes in which words appear in the tree structure data T (IPF: Inverse Paragraph Frequency) is used. The evaluation value IPFi is calculated by the following equation (5).

IPFi = log (Nen / Ntp) (5)

The logarithm is used to reduce the change in value. Nen is the number of nodes that the word W _i specified by the specifying unit 404 appears, Ntp is the total number of nodes in the document.

For each word set specified by the specifying unit 404, the calculation unit 405 uses the evaluation function that is a weighted linear sum having these values Pf, Ps, Pid, Pq, and IPFi as terms, to obtain a final evaluation value P. Is calculated. As the weights of Pf, Ps, Pid, Pq, and IPFi, values set in advance are used, but can be changed by operating the slider 505 before the evaluation value P is calculated. For example, when it is desired to change the evaluation method, the weights of the evaluation values Pf, Ps, Pid, Pq, and IPFi may be changed by the slider 505. Thereby, different search results can be obtained.

The display unit 406 generates output data based on the evaluation value P calculated by the calculation unit 405. Specifically, for example, the display unit 406 generates data (for example, XML data) related to an output screen that can display a set of words specified by the specifying unit 404 in descending order of the evaluation value P.

FIG. 8 is an explanatory diagram illustrating an output screen example 1 generated by the display unit 406. The output screen 800 of FIG. 8 is an output screen when the tree structure data T1 is searched by the specification item name of the specification item name dictionary 400. The output screen 800 includes an index portion that is a first panel 801 and a document display portion that is a second panel 802. In the first panel 801, in accordance with the descending order of the evaluation value P calculated by the calculation unit 405, the specification item name used for searching the tree structure data T1 and the category name corresponding to the specification item name are specified. The link to the description part and the ranking are displayed. On the second panel 802, the document 100 is displayed. By specifying the

links

813 and 823 of the first panel 801, the description location specified by the

links

813 and 823 is displayed on the second panel 802.

FIG. 9 is an explanatory diagram illustrating an output screen example 2 generated by the display unit 406. The output screen 900 of FIG. 8 is an output screen when the tree structure data T1 is searched by the input keyword group. The first panel 901 displays tree structure data excluding paragraphs and links to the specified description locations. The link to the description location is displayed at a position where the word closest to the leaf node is present among the words included in the description location (specified path). At this time, only the description position where the evaluation value P is larger than the threshold value designated in advance may be displayed. The document 100 is displayed on the second panel 902. By specifying the link (description location (1)) of the first panel 901, the description location 903 designated by the link is displayed.

Note that the weights of Pf, Ps, Pid, Pq, and IPFi may be changed in the state of the output screens 800 and 900 in FIGS. In this case, the evaluation value P is recalculated by the calculation unit 405 with the changed weight, and the display of the

first panels

801 and 901 is changed with the evaluation value P after the recalculation.

When the output device 304 is a display device external to the information search device 300, the display unit 406 transmits information related to the output screens 800 and 900 from the communication IF 305 to the external display device, thereby displaying the external display. Output screens 800 and 900 are displayed on the apparatus.

<Example of information search processing procedure>
FIG. 10 is a flowchart illustrating an example of an information search processing procedure performed by the information search apparatus 300. The information search apparatus 300 includes an acquisition process (step S1001) by the acquisition unit 401, a chapter structure analysis process (step S1002) by the analysis unit 402, a search keyword acquisition process by the input unit 403 or using the specification item name dictionary 400 ( Step S1003), a specifying process by the specifying unit 404 (Step S1004), a calculating process by the calculating unit 405 (Step S1005), and a display process by the display unit 406 (Step S1006) are executed. As a result, the series of processes is completed.

FIG. 11 is a flowchart showing a detailed processing procedure example of the chapter structure analysis processing (step S1002) shown in FIG. The information search device 300 generates a character line from the character information acquired by the acquisition unit 401 (step S1101). Next, the information retrieval apparatus 300 identifies a title such as a chapter, a section, or a term from the generated set of character lines (step S1102). Further, the information search device 300 identifies a paragraph from the character line other than the title identified in step S1102 in the generated set of character lines (step S1103). Then, the information search device 300 generates tree structure data by executing a hierarchical relationship analysis of the specified title and paragraph (step S1104). As a result, the chapter structure analysis processing (step S1002) is terminated, and the process proceeds to step S1003.

FIG. 12 is a flowchart showing a detailed processing procedure example of the specific processing (step S1004) shown in FIG. The information search apparatus 300 specifies a path including the search keyword group from the tree structure data T1 and T2 obtained by the chapter structure analysis process (step S1002) (step S1201). In step S1201, a path including a search keyword group may be used, and the appearance order of the search keyword path may be different from the input order of the search keyword group. Further, it is not necessary to include all search keywords, and a part of the search keyword group may be included.

In this case, for example, the number of search keywords that are allowed to be missing is set in advance, and the information search device 300 can identify the path if the number is within the allowable range. For example, if the number of search keywords that can be deleted is 1 and the search keywords are W1 to W3, a path including any two of W1 to W3 is specified. Next, the information retrieval apparatus 300 identifies the shortest path for each identified path (step S1202).

Thereby, the specific process (step S1004) is terminated, and the process proceeds to step S1005.

As described above, in this embodiment, the neighborhood search is performed because the distance between words related to the number of words is short, but an erroneous search of a set of words that does not substantially make sense can be suppressed. Also, if the nodes are in a serial hierarchical relationship, even if the hierarchy is deep, in other words, the inter-word distance related to the number of nodes is applied, so even if the inter-word distance related to the number of words is long, a probable search keyword Can be narrowed down. Thereby, the search accuracy is high. In addition, in this embodiment, even if there is an error in the chapter structure, the candidates with the shortest distance between words related to the number of titles are ranked in the top while observing the chapter structure. Can be specified.

The present invention is not limited to the above-described embodiments, and includes various modifications and equivalent configurations within the scope of the appended claims. For example, the above-described embodiments have been described in detail for easy understanding of the present invention, and the present invention is not necessarily limited to those having all the configurations described. A part of the configuration of one embodiment may be replaced with the configuration of another embodiment. Moreover, you may add the structure of another Example to the structure of a certain Example. In addition, for a part of the configuration of each embodiment, another configuration may be added, deleted, or replaced.

In addition, each of the above-described configurations, functions, processing units, processing means, etc. may be realized in hardware by designing a part or all of them, for example, with an integrated circuit, and the processor realizes each function. It may be realized by software by interpreting and executing the program to be executed.

Information such as programs, tables, and files that realize each function can be stored in a storage device such as a memory, a hard disk, and an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, and a DVD.

Also, the control lines and information lines indicate what is considered necessary for the explanation, and do not necessarily indicate all control lines and information lines necessary for mounting. In practice, it can be considered that almost all the components are connected to each other.

Claims

An information retrieval apparatus comprising a processor that executes a program and a storage device that stores the program,
The processor is
By searching a path following the search keyword group from the tree-structured data hierarchically structured with the title and paragraph in the document as nodes based on the character information indicating the character in the document and the position of the character, A specifying process for specifying the position of the search keyword included in the path on the tree structure data;
A calculation process for calculating an evaluation value related to a specific accuracy of the search keyword group based on a position on the tree structure data of a search keyword included in the path specified by the specifying process;
A display process for displaying a description portion corresponding to the path in the document based on the evaluation value calculated by the calculation process;
An information retrieval apparatus characterized by executing
In the calculation process, the processor is configured based on an interword distance between a first search keyword among search keywords included in the path and a second search keyword connected to the first search keyword through the path. The information retrieval apparatus according to claim 1, wherein an evaluation value is calculated.
In the calculation process, when the first search keyword and the second search keyword are included in different nodes, the processor determines the number of nodes between the first search keyword and the second search keyword. The information search apparatus according to claim 2, wherein the evaluation value is calculated based on the inter-word distance.
In the calculation process, the processor counts the number of words between the first search keyword and the second search keyword when the first search keyword and the second search keyword are included in the same node. The information search apparatus according to claim 3, wherein the evaluation value is calculated based on a distance between words corresponding to the word.
In the specifying process, the processor includes the largest number of types of the search keyword among the paths, and the first search keyword and the first search keyword by the path are included in the search keyword group. The path with the shortest sum of distances between words with the connected second search keyword is specified, and the position of the search keyword included in the specified path on the tree structure data is specified. The information search device according to 1.
In the specifying process, when the first search keyword and the second search keyword are included in different nodes, the processor determines the number of nodes between the first search keyword and the second search keyword. 6. The information search apparatus according to claim 5, wherein the inter-word distance is used for specifying the shortest path.
In the specifying process, the processor counts the number of words between the first search keyword and the second search keyword when the first search keyword and the second search keyword are included in the same node. The information retrieval apparatus according to claim 6, wherein a distance between words corresponding to the number is used to identify the shortest path.
The information search apparatus according to claim 1, wherein, in the calculation process, the processor calculates the evaluation value based on a ratio of search keywords included in the path in the search keyword group.
2. The information search apparatus according to claim 1, wherein, in the calculation process, the processor calculates the evaluation value based on a rate at which a search keyword included in the path appears in the title.
In the calculation process, the processor calculates the evaluation value based on a degree of coincidence between the order in which the search keyword group is arranged and the order in which the search keywords included in the path appear on the path. The information search apparatus according to claim 1, wherein
In the calculation process, the processor includes the number of paragraphs in which the search keyword included in the path appears, the total number of paragraphs in the document, the number of titles in which the search keyword included in the path appears, and the document. The information search device according to claim 1, wherein the evaluation value is calculated based on the total number of titles in the list and the importance of the search keyword included in the path obtained from the title.
A processor that executes a program stored in a storage device;
By searching a path following the search keyword group from the tree-structured data hierarchically structured with the title and paragraph in the document as nodes based on the character information indicating the character in the document and the position of the character, A specifying process for specifying the position of the search keyword included in the path on the tree structure data;
A calculation process for calculating an evaluation value related to a specific accuracy of the search keyword group based on a position on the tree structure data of a search keyword included in the path specified by the specifying process;
A display process for displaying a description portion corresponding to the path in the document based on the evaluation value calculated by the calculation process;
An information retrieval method characterized by executing
To the processor,
By searching a path following the search keyword group from the tree-structured data hierarchically structured with the title and paragraph in the document as nodes based on the character information indicating the character in the document and the position of the character, A specifying process for specifying the position of the search keyword included in the path on the tree structure data;
A calculation process for calculating an evaluation value related to a specific accuracy of the search keyword group based on a position on the tree structure data of a search keyword included in the path specified by the specifying process;
A display process for displaying a description portion corresponding to the path in the document based on the evaluation value calculated by the calculation process;
An information retrieval program characterized by causing