WO2016031055A1 - Information retrieval apparatus, information retrieval method, and information retrieval program - Google Patents

Information retrieval apparatus, information retrieval method, and information retrieval program Download PDF

Info

Publication number
WO2016031055A1
WO2016031055A1 PCT/JP2014/072762 JP2014072762W WO2016031055A1 WO 2016031055 A1 WO2016031055 A1 WO 2016031055A1 JP 2014072762 W JP2014072762 W JP 2014072762W WO 2016031055 A1 WO2016031055 A1 WO 2016031055A1
Authority
WO
WIPO (PCT)
Prior art keywords
search keyword
path
search
evaluation value
words
Prior art date
Application number
PCT/JP2014/072762
Other languages
French (fr)
Japanese (ja)
Inventor
関 峰伸
義行 小林
Original Assignee
株式会社日立製作所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社日立製作所 filed Critical 株式会社日立製作所
Priority to PCT/JP2014/072762 priority Critical patent/WO2016031055A1/en
Publication of WO2016031055A1 publication Critical patent/WO2016031055A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present invention relates to an information search device, an information search method, and an information search program for searching for information.
  • search query when a search query includes a plurality of keywords, there is a method called neighborhood search in which search results are ranked using the distance between keywords appearing in a document.
  • the search result ranking apparatus of Patent Document 1 uses the meaning of a text structure when ranking search results using the proximity of keyword groups in an input search query in a search for a structured text set. Evaluate the distance between keywords. The search result ranking apparatus obtains the proximity between the keywords using the distance, and based on the number of occurrences of the keyword in the search query in the entire text and the number of occurrences of the keyword in the text, Assess relevance. Thereafter, the search result ranking apparatus calculates a document score from the proximity between keywords and the relevance of text, and ranks the search results.
  • the above-described conventional technique has a problem of erroneously extracting word pairs that are not meaningful. That is, in the above-described prior art, since the hierarchical chapter structure of the document is not grasped, a word set cannot be extracted for each group of meanings of the document. For example, when the extraction range is narrower than the distance between words in a word set, the word set cannot be extracted. On the other hand, if the extraction range is too wide than the distance between words, words that are used for different meanings are extracted in combination, so a word set that does not originally have a meaning is erroneously extracted.
  • the distance between words is calculated when one word constituting a word set exists in the section title and the other word exists in the body of the section.
  • the hierarchical chapter structure of the document since the hierarchical chapter structure of the document is not grasped, when each word of the word set exists in a different chapter in the document, how to calculate the distance between words? It is unknown whether it is good. Therefore, for example, when a certain word included in a set of words is separately described in the title of Chapter 1 and the paragraph in Section 1.5, they cannot be extracted as a set.
  • the present invention aims to improve search accuracy.
  • An information search apparatus, an information search method, and an information search program according to an aspect of the invention disclosed in the present application are such that a processor includes a title in a document and a title in the document based on character information indicating a character in the document and the position of the character.
  • FIG. 1 is an explanatory diagram of an information search example 1 according to the present embodiment.
  • the information retrieval target document 100 is an electronic document converted into text data, and examples thereof include a product instruction manual, a maintenance manual, and a required specification.
  • an infrastructure facility maintenance manual hereinafter simply referred to as “maintenance manual” will be described as an example.
  • the information retrieval apparatus acquires the document 100 and performs chapter structure analysis on the acquired document 100.
  • the chapter structure analysis is processing for analyzing the chapter structure.
  • the chapter / section structure is tree structure data T1 indicating a logical hierarchical relationship composed of chapters, sections, and terms included in the document 100.
  • chapter, section, and section titles and paragraphs are structured as hierarchical nodes starting from the root node.
  • the title is an intermediate node and the paragraph is a leaf node.
  • the character string in the chart is a lower hierarchy of the paragraph and becomes a leaf node.
  • a symbol N # (# is a number) indicates a node constituting the tree structure data T1.
  • the tree structure data T1 which is the result of the chapter structure analysis is obtained by the chapter structure analysis.
  • the information search device searches with reference to the tree structure data T1.
  • search keywords for example, an input keyword given by a user's operation input and a set keyword set in advance as a dictionary (for example, a specification item name dictionary 400 described later).
  • (A1, B1) is specified as a set of “bolt” and “water leakage”.
  • the text including the water leak B1 is presumed to be an explanatory text about “water leak” caused by “volt”. In this way, when the nodes are in a serial hierarchical relationship, the search is performed assuming that there is a relationship between words.
  • the number of words between the bolt 11 and the water leakage 12 is small and close to each other, but the nodes A2 and B2 are not in a serial hierarchical relationship in the tree structure data T1 for the bolt 11 and the water leakage 12. Therefore, it is not specified from the tree structure data T1. Therefore, it is presumed that the sentence (section 2.1 title) including the water leakage B1 is not described as “water leakage” due to “volt”.
  • the number of words between two words is called the distance between words related to the number of words.
  • the distance between words related to the number of words is the number of words included between the words + 1.
  • the inter-word distance regarding the number of words is an index value indicating the number of words between words, and the shorter the distance between two words is, the closer the two words are in the document 100.
  • the distance between words related to the number of nodes is applied.
  • the inter-word distance related to the number of nodes is the number of nodes between two nodes having a serial hierarchical relationship in the tree structure data T1.
  • the distance between words related to the number of nodes is the distance between the node containing word A and the node containing word B in a path connecting the node containing word A and the node containing word B on the tree structure data.
  • FIG. 2 is an explanatory diagram of an information search example 2 according to the present embodiment.
  • FIG. 2 is an example of information retrieval when there is an error in the structural analysis.
  • the document 200 is a document in which the manual of the operation method and the maintenance content manual shown in FIG. 1 are combined as one file. Here, it is assumed that “valve”, “bolt”, and “water leakage” are given as search keywords.
  • (C, A3, B3), (C, A1, B1) are specified as a set of “valve”, “bolt”, and “leakage” by referring to the tree structure data T2.
  • the nodes are in a serial hierarchical relationship. Therefore, the water leaks B1 and B3 are presumed to be the places where the explanatory text about the “water leak” due to the “valve” and the “bolt” is written.
  • (C, A3, B3) is ranked higher than (C, A1, B1) because the inter-word distance regarding the number of nodes is shorter than (C, A1, B1).
  • the information retrieval device ranks candidates with a short inter-word distance related to the number of titles in the higher rank while complying with the chapter structure even if there is an error in the chapter structure.
  • the place to be described can be specified.
  • FIG. 3 is a block diagram illustrating a hardware configuration example of the information search apparatus.
  • the information search apparatus 300 includes a processor 301, a storage device 302, an input device 303, an output device 304, and a communication interface (communication IF) 305.
  • the processor 301, the storage device 302, the input device 303, the output device 304, and the communication IF 305 are connected by a bus.
  • the processor 301 controls the information search device 300.
  • the storage device 302 serves as a work area for the processor 301.
  • the storage device 302 is a non-temporary or temporary recording medium that stores various programs and data.
  • Examples of the storage device 302 include a ROM (Read Only Memory), a RAM (Random Access Memory), an HDD (Hard Disk Drive), and a flash memory.
  • the input device 303 inputs data. Examples of the input device 303 include a keyboard, a mouse, a touch panel, a numeric keypad, and a scanner.
  • the output device 304 outputs data. Examples of the output device 304 include a display and a printer.
  • the communication IF 305 is connected to a network and transmits / receives data.
  • FIG. 4 is a block diagram illustrating a functional configuration example of the information search apparatus 300 according to the present embodiment.
  • the information search apparatus 300 includes a specification item name dictionary 400, an acquisition unit 401, an analysis unit 402, an input unit 403, a specifying unit 404, a calculation unit 405, and a display unit 406.
  • the specification item name dictionary 400 realizes its function as information stored in the storage device 302 shown in FIG. 3, for example.
  • the acquisition unit 401, the analysis unit 402, the input unit 403, the specification unit 404, the calculation unit 405, and the display unit 406, for example, store the program stored in the storage device 302 illustrated in FIG. The function is realized by executing.
  • the specification item name dictionary 400 is information that stores specification item names given to the specifying unit 404 as setting keywords.
  • the specification item name dictionary 400 is a table in which category names and specification item names are associated with each other.
  • the specification item name dictionary 400 is information given in advance when the documents 100 and 200 (hereinafter collectively referred to as the document 1) are specifications. Since the section title in the specification often includes the specification item name, the retrieval process as in this embodiment is effective.
  • the category name is a name that identifies the category (type) to which the document 1 belongs.
  • the acquisition unit 401 acquires character information and graphic information from the document 1.
  • the character information includes a character code of a character existing in the document 1 and position information of the character.
  • the position information of the character includes a description page number of the character and four corner coordinates.
  • the four corner coordinates are coordinate values of four vertices of a rectangle surrounding the character when the origin is the lower left corner of the page specified by the page number of the character.
  • the character position can be specified by the character information.
  • the graphic information includes graphic data such as image data and table data existing in the document 1 and position information of the graphic data.
  • the position information of the graphic data includes a description page number of the graphic data and four corner coordinates.
  • the four corner coordinates are coordinate values of four vertices of a rectangle surrounding the graphic data when the origin is the lower left corner of the page specified by the page number of the graphic data.
  • the acquisition unit 401 also acquires the character as character information. In this case, the position information of each character is calculated using the position information of the graphic data.
  • the analysis unit 402 analyzes the chapter structure of the document 1. Specifically, for example, the analysis unit 402 extracts a character line using the character position information acquired by the acquisition unit 401, and determines whether the character line is a title such as a chapter, a section, or a term. To do.
  • a character line is a character string obtained by concatenating character groups having a distance between adjacent characters within a predetermined distance in a set of characters arranged in the horizontal direction. For example, when there is a number at the beginning of a character line, the analysis unit 402 determines that the character line is a title. Further, the analysis unit 402 may determine whether or not the title is a title by using a difference in font and character size from other character strings and a distance between character lines.
  • the analysis unit 402 identifies a paragraph including a plurality of character lines between titles.
  • a paragraph is a grouping of meanings in a sentence. The paragraph can be specified because it is separated from the title by indenting or widening the space between character lines.
  • the analysis unit 402 analyzes the hierarchical relationship between successive titles using the number of levels of title numbers.
  • the number of hierarchies is the depth of the hierarchies. For example, if the chapter is “1”, the hierarchy number is 1, the section “1.1” is the hierarchy number 2, and the term “1.1.1” is the hierarchy number 3.
  • the continuity is, for example, the section “1.2” appears after the section “1.1”, and the section “1.3” or the chapter “2” appears after the section “1.2”. This is a feature indicating the ascending order of the number of the title to be.
  • tree structure data T1 (hereinafter collectively referred to as tree structure data T) is generated.
  • FIG. 5 is an explanatory view showing an example of a chapter structure analysis result screen.
  • the analysis result display screen 500 includes a first panel 501, a second panel 502, a keyword input field 503, a search button 504, and a slider 505.
  • the tree structure data T is displayed in the index portion that is the first panel 501, and the document 1 is displayed in the document display portion that is the second panel 502. Further, by selecting a title of the tree structure data T from the input device 303, a sentence corresponding to the selected title is displayed on the document display unit.
  • the keyword input field 503 is an area for receiving a character string input by operating the input device 303.
  • the slider 505 is an interface for setting the weights of evaluation values Pf, Ps, Pid, Pq, and IPFi described later. Moving the slider to the left decreases the weight, and moving it to the right increases the weight.
  • the input unit 403 receives a character string input by operating the input device 303 from the keyword input field 503 of the analysis result display screen 500.
  • the input unit 403 analyzes the input character string when the search button 504 is pressed.
  • the input unit 403 passes the plurality of words to the specifying unit 404.
  • the input unit 403 cuts out a word from the sentence by morphological analysis and passes it to the specifying unit 404.
  • the specifying unit 404 specifies the description position of the search keyword group from the tree structure data T analyzed by the analyzing unit 402.
  • the search keyword group may be a plurality of words input to the keyword input field 503 of FIG.
  • the specification item name of the specification item name dictionary 400 may be used. If the specification item name is a character string that is not separated by a space, the specifying unit 404 cuts out a word from the character string by morphological analysis. For example, in the case of “water leakage due to loosening of bolts”, “bolt”, “loosening”, and “water leakage” are cut out.
  • the specifying unit 404 searches for a path including a word group by tracing from the paragraph serving as the leaf node of the tree structure data T to the top.
  • FIG. 6 is an explanatory diagram showing a search example of the tree structure data T1.
  • the search keyword group is “Bolt phenomenon leakage”
  • the specifying unit 404 searches from the leaf nodes N111a, N112a, N211a, N211c, and N212a of the tree structure data T1 in FIG. 1 toward the root node. .
  • the specifying unit 404 specifies the position of the search keyword “bolt phenomenon water leak” in the node being searched.
  • specification part 404 specifies the water leak B1 of the node N111a.
  • the identifying unit 404 searches the search keyword group “bolt phenomenon leaked water” by tracing the self node and the upper node from the leaf node N111a from which the leaked water B1 has been searched. Then, the identifying unit 404 identifies the phenomenon D1 of the node N111. Then, the specifying unit 404 follows the node and the upper node from the node N111 where the phenomenon D1 is searched, and continues to search for the search keyword group “volt phenomenon leakage”. Then, the specifying unit 404 specifies the bolt A1 of the node N1. When the root node N0 is reached, the search ends from the leaf node N111a. Thereby, the specifying unit 404 can specify a path “water leakage B1 ⁇ phenomenon D1 ⁇ bolt A1”.
  • the specifying unit 404 specifies the water leak B3 at the leaf node N211a. Thereafter, the identifying unit 404 can search for the phenomenon D2 and the leaked water B2 from the node N211a where the leaked water B3 has been searched for, but reaches the leaf node N0 without searching for the bolt. . Therefore, the specifying unit 404 specifies the path “leakage B3 ⁇ phenomenon D2 ⁇ leakage B2”.
  • the identifying unit 404 searches for a “gap” for the leaf node N211c but does not hit it, and thus searches for the node N211b, which is a higher node.
  • the identifying unit 404 identifies the gap G at the node N211b. Then, the specifying unit 404 follows the self node and the upper node from the leaf node N211b in which the gap G is searched, and continues to search for the search keyword group “pipe corner gap”. Then, the identifying unit 404 searches for the corner F in the own node.
  • the specifying unit 404 follows the node and the upper node from the node N211b where the corner F is searched, and continues to search for the search keyword group “pipe corner gap”. Then, the specifying unit 404 specifies the pipe E1 of the node N21. Then, the specifying unit 404 follows the node and the upper node from the node N21 where the pipe E1 has been searched, and continues to search for the search keyword group “pipe corner gap”. Then, the specifying unit 404 specifies the pipe E2 of the node N2. When the root node N0 is reached, the search ends from the leaf node N211c. Thereby, the specifying unit 404 can specify a path “gap G ⁇ corner F ⁇ pipe E1 ⁇ pipe E2”.
  • the “gap G” that is the start end (leaf node side) of the path may not exist in the leaf node N211c.
  • the specifying unit 404 specifies.
  • the identifying unit 404 extracts the shortest path for each identified path.
  • the shortest path is a path that includes the largest number of types of search keywords and has the smallest total (or average) distance between words in the identified paths.
  • the reason for extracting the shortest path is that when multiple identical keywords appear in the identified path, it is unlikely that the keywords at a distant position are related. This is to evaluate the path.
  • a search keyword may appear a plurality of times as in the path “gap G ⁇ corner F ⁇ pipe E1 ⁇ pipe E2” shown in FIG. In this case, a path with the shortest distance between words is extracted.
  • the distance between words at least one of the distance between words related to the number of nodes and the distance between words related to the number of words is applied according to the position of the search keyword included in the nodes constituting the path. For example, when both words exist in different nodes, the distance between words related to the number of nodes is applied, and when both words exist in the same node, the distance between words related to the number of words is applied. Details will be described later.
  • 7A and 7B are explanatory diagrams showing a specific example of the shortest path.
  • the search keyword group is “valve bolt / leakage”.
  • 7A, (A) is a path P7 searched from the tree structure data T.
  • the path P7 is a path that follows “leakage 73b ⁇ bolt 72b ⁇ leakage 73a ⁇ valve 71b ⁇ valve 71b ⁇ bolt 72a ⁇ valve 71a”.
  • a path P71 and a path P72 in (B) are examples of paths including the largest number of types of search keywords from the path P7 (in this case, paths including all search keywords).
  • the identifying unit 404 calculates the inter-word distance for the paths P71 and P72.
  • the path P71 specifies the search keyword group in which the order of “valve” and “bolt” in “valve bolt leakage” is switched. It should be noted that paths such as “leakage 73a ⁇ bolt 72a ⁇ valve 71a”, “leakage 73b ⁇ bolt 72b ⁇ valve 71a”, and “leakage 73b ⁇ bolt 72a ⁇ valve 71a” are also specified, but are omitted for simplification. To do.
  • the identifying unit 404 identifies the path having the shorter inter-word distance among the paths P71 and P72 as the shortest path of the path P7.
  • the path Pb1 in (A) includes two valves 71a and 71b.
  • the inter-word distance between the valve 71b and the bolt 72 is shorter than the inter-word distance between the valve 71a and the bolt 72. Therefore, the identifying unit 404 identifies the path Pb2 from the path Pb1 and determines the shortest path. To do.
  • the path Pc includes two bolts 72a and 72b.
  • the specifying unit 404 adds the inter-word distance between the valve 71 and the bolt 72 a and the inter-word distance between the bolt 72 a and the water leakage 73, the inter-word distance between the valve 71 and the bolt 72 b, the bolt 72 b and the water leakage 73.
  • the shorter path is specified as the shortest path.
  • the specifying unit 404 gives priority to the bolt 72b that is a node close to the water leak 73 that is the start end of the path Pc on the leaf node side, and specifies the path of “water leak 73 ⁇ bolt 72b ⁇ valve 71” as the shortest path. Also good.
  • Path Pd1 includes two water leaks 73a and 73b.
  • the specifying unit 404 specifies the path Pd2 from the path Pd1 and sets it as the shortest path.
  • the path with the shortest distance between words may be specified as the shortest path.
  • the calculation unit 405 determines, for each word set specified by the specifying unit 404 in the tree structure data T, the evaluation values Pf, Ps, Pid, Pq, IPFi is calculated. Specifically, for example, the calculation unit 405 calculates the following five values Pf, Ps, Pid, Pq, and IPFi, and uses these five values to calculate an evaluation value of a path (a specified word set). calculate. For example, a weighted linear sum is used as an evaluation function, and an evaluation value based on the evaluation function is calculated.
  • the evaluation value Pf is a value indicating how much the word included in the set of words specified by the specifying unit 404 is included in the search keyword group.
  • the words that make up the search keyword group are not all described in the document. Therefore, the evaluation value Pf is added to the evaluation function as a term having a higher value as the ratio including the words constituting the search keyword group increases.
  • the evaluation value Pf is calculated by the following formula (1).
  • New is the number of words included in the set of words specified by the specifying unit 404, and Nsw is the number of words included in the search keyword group.
  • the evaluation value Ps is an evaluation value indicating the ratio of the words included in the set of words specified by the specifying unit 404 and the words in the chapter title. That is, the evaluation value Ps is a value indicating whether or not a word included in the set of words specified by the specifying unit 404 is described along the chapter structure.
  • a word group selected as a specification item name or a search keyword group is main information in the document 1 and is often described in a title such as a chapter or a section. Therefore, the evaluation value Ps is added to the evaluation function as a term that becomes higher as the ratio of words included in the title increases.
  • the evaluation value Ps is calculated by the following formula (2).
  • Nsm is the number of chapter titles including the word specified by the specifying unit 404.
  • the evaluation value Pid is the reciprocal of the distance between words. As the distance between words, the distance between words regarding the number of nodes and the distance between words regarding the number of words can be used.
  • the distance between words related to the number of words is calculated using a sequence of words arranged in the reading order, the distance between the word described in the title of Chapter 1 and the word described in the paragraph in Section 1.5 is a large value. Thus, there is a high possibility that the word and the meaning of the word are not connected.
  • the inter-word distance related to the number of nodes which is a feature of this method, is calculated using a column in which paragraphs and title nodes traced when the specifying unit 404 searches for a set of words.
  • the inter-word distance regarding the number of nodes is calculated for “leakage B1 ⁇ phenomenon D1”.
  • the distance between words regarding the number of nodes of “leakage B1 ⁇ phenomenon D1” is 1.
  • the distance between words regarding the number of nodes of “phenomenon D1 ⁇ volt A1” is calculated.
  • the calculation method of the inter-word distance related to the number of nodes differs depending on whether two target words are present in the same node or in different nodes.
  • the distance between words regarding the number of words may be applied to two words in the same node, and the distance between words regarding the number of nodes may be applied to two words at different nodes.
  • the calculation unit 405 calculates an inter-word distance related to the number of words. For “corner F ⁇ pipe E1”, the calculation unit 405 calculates the inter-word distance related to the number of nodes.
  • the distance between the words is 2, and there is a possibility that the meaning of the word and the word is connected. Is high.
  • an evaluation value Pid that is the reciprocal of the distance between words is added to the evaluation function as a term of the evaluation function.
  • the evaluation value Pid is calculated by the following formula (3).
  • W i is the i-th word (i is an integer of 1 or more).
  • D i is the word distance between W i and W i + 1.
  • the calculation method of the inter-word distance D i is switched depending on the positional relationship between W i and W i + 1 .
  • the inter-word distance D i is the inter-word distance related to the number of nodes
  • D i the number of nodes between N i and N i + 1 +1.
  • N i is a node including W i
  • N i + 1 is a node including W i + 1 .
  • the evaluation value Pq is an evaluation value indicating the degree of coincidence between the word order in the search keyword group or the word order in the specification item name and the appearance order of the words in the chapter structure.
  • the degree of coincidence of the appearance order of words on the chapter structure is such that the order of description of the words constituting the specification item name matches the reverse order of the word sequence on the path generated by tracing the chapter structure from the lower hierarchy. High value. Therefore, the evaluation value Pq, which is the degree of coincidence of the appearance order of words on the chapter structure, is added to the evaluation function as a term of the evaluation function.
  • the evaluation value Pq is calculated by the following formula (4).
  • Neq may be a word W i and the word W i + 1 of (another search keyword in between the words W i and the word W i + 1 may be present) the appearance order is correct number.
  • the evaluation value IPFi is an average value of the importance of the word using the number of nodes (title or paragraph) in which the word included in the set of words specified by the specifying unit 404 appears.
  • the words constituting the specification item name are also used when describing other specification items or other than specification items.
  • the evaluation value of the specified word set Need is high.
  • IDF Inverse Document Frequency
  • IDF Inverse Document Frequency
  • IPFi log (Nen / Ntp) (5)
  • Nen is the number of nodes that the word W i specified by the specifying unit 404 appears
  • Ntp is the total number of nodes in the document.
  • the calculation unit 405 uses the evaluation function that is a weighted linear sum having these values Pf, Ps, Pid, Pq, and IPFi as terms, to obtain a final evaluation value P. Is calculated.
  • the weights of Pf, Ps, Pid, Pq, and IPFi values set in advance are used, but can be changed by operating the slider 505 before the evaluation value P is calculated. For example, when it is desired to change the evaluation method, the weights of the evaluation values Pf, Ps, Pid, Pq, and IPFi may be changed by the slider 505. Thereby, different search results can be obtained.
  • the display unit 406 generates output data based on the evaluation value P calculated by the calculation unit 405. Specifically, for example, the display unit 406 generates data (for example, XML data) related to an output screen that can display a set of words specified by the specifying unit 404 in descending order of the evaluation value P.
  • data for example, XML data
  • FIG. 8 is an explanatory diagram illustrating an output screen example 1 generated by the display unit 406.
  • the output screen 800 of FIG. 8 is an output screen when the tree structure data T1 is searched by the specification item name of the specification item name dictionary 400.
  • the output screen 800 includes an index portion that is a first panel 801 and a document display portion that is a second panel 802.
  • the first panel 801 in accordance with the descending order of the evaluation value P calculated by the calculation unit 405, the specification item name used for searching the tree structure data T1 and the category name corresponding to the specification item name are specified.
  • the link to the description part and the ranking are displayed.
  • the document 100 is displayed.
  • the links 813 and 823 of the first panel 801 the description location specified by the links 813 and 823 is displayed on the second panel 802.
  • FIG. 9 is an explanatory diagram illustrating an output screen example 2 generated by the display unit 406.
  • the output screen 900 of FIG. 8 is an output screen when the tree structure data T1 is searched by the input keyword group.
  • the first panel 901 displays tree structure data excluding paragraphs and links to the specified description locations.
  • the link to the description location is displayed at a position where the word closest to the leaf node is present among the words included in the description location (specified path). At this time, only the description position where the evaluation value P is larger than the threshold value designated in advance may be displayed.
  • the document 100 is displayed on the second panel 902. By specifying the link (description location (1)) of the first panel 901, the description location 903 designated by the link is displayed.
  • the weights of Pf, Ps, Pid, Pq, and IPFi may be changed in the state of the output screens 800 and 900 in FIGS.
  • the evaluation value P is recalculated by the calculation unit 405 with the changed weight, and the display of the first panels 801 and 901 is changed with the evaluation value P after the recalculation.
  • the display unit 406 transmits information related to the output screens 800 and 900 from the communication IF 305 to the external display device, thereby displaying the external display.
  • Output screens 800 and 900 are displayed on the apparatus.
  • FIG. 10 is a flowchart illustrating an example of an information search processing procedure performed by the information search apparatus 300.
  • the information search apparatus 300 includes an acquisition process (step S1001) by the acquisition unit 401, a chapter structure analysis process (step S1002) by the analysis unit 402, a search keyword acquisition process by the input unit 403 or using the specification item name dictionary 400 ( Step S1003), a specifying process by the specifying unit 404 (Step S1004), a calculating process by the calculating unit 405 (Step S1005), and a display process by the display unit 406 (Step S1006) are executed. As a result, the series of processes is completed.
  • FIG. 11 is a flowchart showing a detailed processing procedure example of the chapter structure analysis processing (step S1002) shown in FIG.
  • the information search device 300 generates a character line from the character information acquired by the acquisition unit 401 (step S1101).
  • the information retrieval apparatus 300 identifies a title such as a chapter, a section, or a term from the generated set of character lines (step S1102).
  • the information search device 300 identifies a paragraph from the character line other than the title identified in step S1102 in the generated set of character lines (step S1103).
  • the information search device 300 generates tree structure data by executing a hierarchical relationship analysis of the specified title and paragraph (step S1104).
  • the chapter structure analysis processing step S1002 is terminated, and the process proceeds to step S1003.
  • FIG. 12 is a flowchart showing a detailed processing procedure example of the specific processing (step S1004) shown in FIG.
  • the information search apparatus 300 specifies a path including the search keyword group from the tree structure data T1 and T2 obtained by the chapter structure analysis process (step S1002) (step S1201).
  • a path including a search keyword group may be used, and the appearance order of the search keyword path may be different from the input order of the search keyword group. Further, it is not necessary to include all search keywords, and a part of the search keyword group may be included.
  • the number of search keywords that are allowed to be missing is set in advance, and the information search device 300 can identify the path if the number is within the allowable range. For example, if the number of search keywords that can be deleted is 1 and the search keywords are W1 to W3, a path including any two of W1 to W3 is specified. Next, the information retrieval apparatus 300 identifies the shortest path for each identified path (step S1202).
  • step S1004 the specific process (step S1004) is terminated, and the process proceeds to step S1005.
  • the neighborhood search is performed because the distance between words related to the number of words is short, but an erroneous search of a set of words that does not substantially make sense can be suppressed.
  • the nodes are in a serial hierarchical relationship, even if the hierarchy is deep, in other words, the inter-word distance related to the number of nodes is applied, so even if the inter-word distance related to the number of words is long, a probable search keyword Can be narrowed down. Thereby, the search accuracy is high.
  • the candidates with the shortest distance between words related to the number of titles are ranked in the top while observing the chapter structure. Can be specified.
  • the present invention is not limited to the above-described embodiments, and includes various modifications and equivalent configurations within the scope of the appended claims.
  • the above-described embodiments have been described in detail for easy understanding of the present invention, and the present invention is not necessarily limited to those having all the configurations described.
  • a part of the configuration of one embodiment may be replaced with the configuration of another embodiment.
  • another configuration may be added, deleted, or replaced.
  • each of the above-described configurations, functions, processing units, processing means, etc. may be realized in hardware by designing a part or all of them, for example, with an integrated circuit, and the processor realizes each function. It may be realized by software by interpreting and executing the program to be executed.
  • Information such as programs, tables, and files that realize each function can be stored in a storage device such as a memory, a hard disk, and an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, and a DVD.
  • a storage device such as a memory, a hard disk, and an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, and a DVD.
  • control lines and information lines indicate what is considered necessary for the explanation, and do not necessarily indicate all control lines and information lines necessary for mounting. In practice, it can be considered that almost all the components are connected to each other.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This information retrieval apparatus executes: identification processing for tracing a path in search of a retrieval keyword group in tree-structured data constructed by hierarchizing, in the form of nodes, a title and paragraphs in a document on the basis of text information in the document to identify the positions of retrieval keywords along said path in the tree-structured data; calculation processing for calculating an evaluation value regarding the accuracy of identification of said retrieval keyword group on the basis of the positions of the retrieval keywords along the path in the tree-structured data, said positions being identified by the identification processing; and display processing for displaying the location of a description corresponding to the path in the document on the basis of the evaluation value calculated by the calculation processing.

Description

情報検索装置、情報検索方法、および情報検索プログラムInformation search device, information search method, and information search program
 本発明は、情報を検索する情報検索装置、情報検索方法、および情報検索プログラムに関する。 The present invention relates to an information search device, an information search method, and an information search program for searching for information.
 従来、検索クエリが複数のキーワードで有る場合に、文書内に出現するキーワードの距離を用いて検索結果のランキングをする近傍検索という方法がある。 Conventionally, when a search query includes a plurality of keywords, there is a method called neighborhood search in which search results are ranked using the distance between keywords appearing in a document.
 また、特許文献1の検索結果ランキング装置は、構造付きテキスト集合の検索において、入力された検索クエリ中のキーワード群の近接性を用いて検索結果のランキングを行う際に、テキスト構造の意味を元にキーワード間の距離を評価する。検索結果ランキング装置は、その距離を用いてキーワード間の近接性を求め、検索クエリ中のキーワードのテキスト全体での出現回数と当該キーワードのテキスト内での出現回数に基づいて該検索クエリに対するテキストの関連性を評価する。このあと、検索結果ランキング装置は、キーワード間の近接性とテキストの関連性から文書スコアを算出し、検索結果のランキングを行う。 Further, the search result ranking apparatus of Patent Document 1 uses the meaning of a text structure when ranking search results using the proximity of keyword groups in an input search query in a search for a structured text set. Evaluate the distance between keywords. The search result ranking apparatus obtains the proximity between the keywords using the distance, and based on the number of occurrences of the keyword in the search query in the entire text and the number of occurrences of the keyword in the text, Assess relevance. Thereafter, the search result ranking apparatus calculates a document score from the proximity between keywords and the relevance of text, and ranks the search results.
特開2010-282480号公報JP 2010-282480 A
 しかしながら、上述した従来技術では、意味のつながらない単語の組を誤って抽出するという問題がある。すなわち、上述した従来技術では、文書の階層的な章節構造を把握していないため、文書の意味のまとまり毎に単語の組を抽出することができない。たとえば、抽出範囲が単語の組の単語間距離より狭い場合、単語の組を抽出することができない。一方、抽出範囲が単語間距離より広すぎる場合、異なる意味に使われている離れた単語を組み合わせて抽出してしまうため、本来は意味がつながらない単語の組を誤抽出することになる。 However, the above-described conventional technique has a problem of erroneously extracting word pairs that are not meaningful. That is, in the above-described prior art, since the hierarchical chapter structure of the document is not grasped, a word set cannot be extracted for each group of meanings of the document. For example, when the extraction range is narrower than the distance between words in a word set, the word set cannot be extracted. On the other hand, if the extraction range is too wide than the distance between words, words that are used for different meanings are extracted in combination, so a word set that does not originally have a meaning is erroneously extracted.
 また、上述した従来技術では、単語の組を構成する一方の単語がセクションのタイトルに存在し他方の単語がそのセクションの本文に存在する場合に単語間距離を算出する。しかしながら、上述した従来技術では、文書の階層的な章節構造を把握していないため、単語の組の各単語が文書内の異なる章節に存在する場合、どのようにして単語間距離を算出すればよいか不明である。したがって、例えば、単語の組に含まれるある単語が1章のタイトルと1.5節の段落の中に分かれて記載されていた場合に、それらを組として抽出することができない。 Further, in the above-described conventional technology, the distance between words is calculated when one word constituting a word set exists in the section title and the other word exists in the body of the section. However, in the above-described prior art, since the hierarchical chapter structure of the document is not grasped, when each word of the word set exists in a different chapter in the document, how to calculate the distance between words? It is unknown whether it is good. Therefore, for example, when a certain word included in a set of words is separately described in the title of Chapter 1 and the paragraph in Section 1.5, they cannot be extracted as a set.
 本発明は、検索精度の向上を図ることを目的とする。 The present invention aims to improve search accuracy.
 本願において開示される発明の一側面となる情報検索装置、情報検索方法、および情報検索プログラムは、プロセッサが、文書内の文字および当該文字の位置を示す文字情報に基づいて前記文書内のタイトルおよび段落をノードとして階層構造化した木構造データの中から、検索キーワード群を辿るパスを探索することにより、当該パスに含まれる検索キーワードの前記木構造データ上の位置を特定する特定処理と、前記特定処理によって特定された前記パスに含まれる検索キーワードの前記木構造データ上の位置に基づいて、前記検索キーワード群の特定精度に関する評価値を算出する算出処理と、前記算出処理によって算出された評価値に基づいて、前記文書内の前記パスに該当する記載箇所を表示する表示処理と、を実行することを特徴とする。 An information search apparatus, an information search method, and an information search program according to an aspect of the invention disclosed in the present application are such that a processor includes a title in a document and a title in the document based on character information indicating a character in the document and the position of the character. A specific process for identifying a position on the tree structure data of a search keyword included in the path by searching a path that follows the search keyword group from the tree structure data hierarchically structured with paragraphs as nodes, and A calculation process for calculating an evaluation value related to the specific accuracy of the search keyword group based on a position on the tree structure data of a search keyword included in the path specified by the specific process, and an evaluation calculated by the calculation process Display processing for displaying a description portion corresponding to the path in the document based on the value. And butterflies.
 本発明の代表的な実施の形態によれば、検索精度の向上を図ることができる。前述した以外の課題、構成及び効果は、以下の実施例の説明により明らかにされる。 According to the representative embodiment of the present invention, it is possible to improve the search accuracy. Problems, configurations, and effects other than those described above will become apparent from the description of the following embodiments.
本実施例にかかる情報検索例1を示す説明図である。It is explanatory drawing which shows the information search example 1 concerning a present Example. 本実施例にかかる情報検索例2を示す説明図である。It is explanatory drawing which shows the information search example 2 concerning a present Example. 情報検索装置のハードウェア構成例を示すブロック図である。It is a block diagram which shows the hardware structural example of an information search device. 本実施例にかかる情報検索装置の機能的構成例を示すブロック図である。It is a block diagram which shows the functional structural example of the information search device concerning a present Example. 章節構造解析結果の画面例を示す説明図である。It is explanatory drawing which shows the example of a screen of a chapter structure analysis result. 木構造データの探索例を示す説明図である。It is explanatory drawing which shows the example of a search of tree structure data. 最短パスの特定例を示す説明図である。It is explanatory drawing which shows the specific example of the shortest path. 最短パスの特定例を示す説明図である。It is explanatory drawing which shows the specific example of the shortest path. 表示部によって生成される出力画面例1を示す説明図である。It is explanatory drawing which shows the example 1 of an output screen produced | generated by a display part. 表示部によって生成される出力画面例2を示す説明図である。It is explanatory drawing which shows the output screen example 2 produced | generated by the display part. 情報検索装置による情報検索処理手順例を示すフローチャートである。It is a flowchart which shows the example of an information search process sequence by an information search device. 図10に示した章節構造解析処理(ステップS1002)の詳細な処理手順例を示すフローチャートである。It is a flowchart which shows the detailed process sequence example of the chapter structure analysis process (step S1002) shown in FIG. 図10に示した特定処理(ステップS1004)の詳細な処理手順例を示すフローチャートである。It is a flowchart which shows the detailed process sequence example of the specific process (step S1004) shown in FIG.
 <情報検索例>
 図1は、本実施例にかかる情報検索例1を示す説明図である。情報検索対象となる文書100は、テキストデータ化された電子文書であり、たとえば、製品の取扱説明書、保守マニュアル、要求仕様書が挙げられる。本実施例では、インフラ施設の保守マニュアル(以下、単に「保守マニュアル」)を例に挙げて説明する。情報検索装置は、文書100を取得し、取得した文書100について章節構造解析を実行する。章節構造解析とは、章節構造を解析する処理である。
<Example of information search>
FIG. 1 is an explanatory diagram of an information search example 1 according to the present embodiment. The information retrieval target document 100 is an electronic document converted into text data, and examples thereof include a product instruction manual, a maintenance manual, and a required specification. In this embodiment, an infrastructure facility maintenance manual (hereinafter simply referred to as “maintenance manual”) will be described as an example. The information retrieval apparatus acquires the document 100 and performs chapter structure analysis on the acquired document 100. The chapter structure analysis is processing for analyzing the chapter structure.
 章節構造とは、文書100に含まれる章や節、項により構成される論理的な階層関係を示す木構造データT1である。章節構造の木構造データT1では、ルートノードを起点として章、節、項の各タイトルおよび段落が階層的なノードとして構成される。具体的には、木構造データT1では、タイトルが中間ノード、段落がリーフノードとなる。なお、段落内に図表がある場合、図表内の文字列が段落の下位の階層となり、リーフノードとなる。符号N#(#は番号)は木構造データT1を構成するノードを示す。 The chapter / section structure is tree structure data T1 indicating a logical hierarchical relationship composed of chapters, sections, and terms included in the document 100. In the chapter structure tree structure data T1, chapter, section, and section titles and paragraphs are structured as hierarchical nodes starting from the root node. Specifically, in the tree structure data T1, the title is an intermediate node and the paragraph is a leaf node. When there is a chart in a paragraph, the character string in the chart is a lower hierarchy of the paragraph and becomes a leaf node. A symbol N # (# is a number) indicates a node constituting the tree structure data T1.
 図1の保守マニュアル100では、「1.ボルト」、「2.配管」、…が章のタイトルである。また、「1.1配管終端の取り付け」、「2.1配管からの漏水」…が節のタイトルである。また、「1.1.1現象と原因」、「2.1.2対策」、…が項のタイトルである。また、たとえば、「水圧と振動でボルトが緩み、漏水した。」など、先頭に項目番号がない文章が段落である。 In the maintenance manual 100 of FIG. 1, “1. Bolt”, “2. Piping”, and so on are titles of chapters. The titles of the sections are “1.1 Installation of pipe end”, “2.1 Water leakage from pipe”, and so on. “1.1.1 Phenomenon and Cause”, “2.1.2 Countermeasures”,... For example, a sentence that does not have an item number at the beginning is a paragraph, such as “Bolt has loosened due to water pressure and vibration and water leaked.”
 章節構造解析により章節構造解析結果である木構造データT1が得られる。情報検索装置は、検索キーワードが与えられると、木構造データT1を参照して検索する。ここでは、例として、「ボルト」および「漏水」が検索キーワードとして与えられたものとする。検索キーワードには、たとえば、ユーザの操作入力により与えられる入力キーワードと、あらかじめ辞書(たとえば、後述する仕様項目名辞書400)として設定された設定キーワードの2種類がある。 The tree structure data T1 which is the result of the chapter structure analysis is obtained by the chapter structure analysis. When the search keyword is given, the information search device searches with reference to the tree structure data T1. Here, as an example, it is assumed that “bolt” and “leakage” are given as search keywords. There are two types of search keywords, for example, an input keyword given by a user's operation input and a set keyword set in advance as a dictionary (for example, a specification item name dictionary 400 described later).
 木構造データT1を参照することで、「ボルト」および「漏水」の組として、(A1,B1)が特定される。(A1,B1)は、ノード同士が直列的な階層関係にあるため、漏水B1を含む文章は、「ボルト」に起因する「漏水」についての説明文と推測される。このようにノード同士が直列的な階層関係にある場合には、単語間の関連性があるとして探索される。 Referring to the tree structure data T1, (A1, B1) is specified as a set of “bolt” and “water leakage”. In (A1, B1), since the nodes are in a serial hierarchical relationship, the text including the water leak B1 is presumed to be an explanatory text about “water leak” caused by “volt”. In this way, when the nodes are in a serial hierarchical relationship, the search is performed assuming that there is a relationship between words.
 なお、文書100において、たとえば、ボルト11および漏水12の間の単語数が少なく近接しているが、ボルト11と漏水12については木構造データT1においてノードA2、B2が直列的な階層関係にないため、木構造データT1からは特定されない。したがって、漏水B1を含む文章(2.1節のタイトル)は、「ボルト」に起因する「漏水」について記載されていないと推測される。 In the document 100, for example, the number of words between the bolt 11 and the water leakage 12 is small and close to each other, but the nodes A2 and B2 are not in a serial hierarchical relationship in the tree structure data T1 for the bolt 11 and the water leakage 12. Therefore, it is not specified from the tree structure data T1. Therefore, it is presumed that the sentence (section 2.1 title) including the water leakage B1 is not described as “water leakage” due to “volt”.
 ここで、2つの単語間の単語数を、単語数に関する単語間距離と称す。具体的には、たとえば、単語数に関する単語間距離は、単語と単語の間に含まれる単語の数+1である。この単語数に関する単語間距離は、単語間の単語数の多さを示す指標値であり、2つの単語の単語間距離が短いほど、両単語は文書100内で近接していることを示す。 Here, the number of words between two words is called the distance between words related to the number of words. Specifically, for example, the distance between words related to the number of words is the number of words included between the words + 1. The inter-word distance regarding the number of words is an index value indicating the number of words between words, and the shorter the distance between two words is, the closer the two words are in the document 100.
 また、本発明の特徴として、ノード数に関する単語間距離が適用される。ノード数に関する単語間距離とは、木構造データT1において、直列的な階層関係がある2つのノード間のノード数である。具体的には、たとえば、ノード数に関する単語間距離は、木構造データ上で、単語Aを含むノードと単語Bを含むノードがつながるパスにおいて、単語Aを含むノードと単語Bを含むノードの間に含まれるノード(タイトルおよび段落)の数+1である。 Also, as a feature of the present invention, the distance between words related to the number of nodes is applied. The inter-word distance related to the number of nodes is the number of nodes between two nodes having a serial hierarchical relationship in the tree structure data T1. Specifically, for example, the distance between words related to the number of nodes is the distance between the node containing word A and the node containing word B in a path connecting the node containing word A and the node containing word B on the tree structure data. The number of nodes (titles and paragraphs) included in +1.
 近傍検索では、単語数に関する単語間距離のみを用いるため、実質的には意味がつながらない(A2,B2)のような組を誤検索してしまう。これに対し、木構造データT1を用いて検索することにより、実質的には意味がつながらない(A2,B2)のような組の誤検索を抑制することができる。さらに、ノード数に関する単語間距離を用いることで、確からしい組み合わせを上位にランキングすることができる。意味のつながる単語の組は、木構造上でも近くに存在するためである。 In the neighborhood search, since only the distance between words related to the number of words is used, a group such as (A2, B2) that does not substantially make sense is erroneously searched. On the other hand, by searching using the tree structure data T1, it is possible to suppress erroneous search of a set such as (A2, B2) that does not substantially make sense. Furthermore, the probable combinations can be ranked higher by using the distance between words related to the number of nodes. This is because word groups with meanings are close together even on a tree structure.
 図2は、本実施例にかかる情報検索例2を示す説明図である。図2は、構造解析に誤りがある場合の情報検索例である。文書200は、作業方法のマニュアルと図1に示した保守内容のマニュアルとが1つのファイルとして結合された文書とする。ここでは、「バルブ」、「ボルト」および「漏水」が検索キーワードとして与えられたものとする。 FIG. 2 is an explanatory diagram of an information search example 2 according to the present embodiment. FIG. 2 is an example of information retrieval when there is an error in the structural analysis. The document 200 is a document in which the manual of the operation method and the maintenance content manual shown in FIG. 1 are combined as one file. Here, it is assumed that “valve”, “bolt”, and “water leakage” are given as search keywords.
 文書200では、作業方法の5章である「5.バルブ」のあとに、保守内容の1章である「1.ボルト」が出現する。木構造データT2は、「5.1.2 手順2」であるノードN512の下位に、保守内容の1章である「1.ボルト」以降のノードN1、N11、N111、N111a、N112、N112a、…を有する。本来であれば、「1.ボルト」は、「5.バルブ」と並列になることが正しい。しかし、マニュアルや設計文書は章節の階層数が多くなり、5.1.2の中に、新たな章節構造が含まれる場合があるため、章節構造解析処理にて、それらの区別を誤ってしまう場合があるためである。 In Document 200, “1. Bolt” that is Chapter 1 of the maintenance contents appears after “5. Valve” that is Chapter 5 of the working method. The tree-structured data T2 includes nodes N1, N11, N111, N111a, N112, N112a and subsequent nodes after “1. Volt”, which is chapter 1 of the maintenance content, below the node N512 that is “5.1.2 Procedure 2”. Has ... Under normal circumstances, it is correct that “1. Volt” is in parallel with “5. Valve”. However, manuals and design documents have a higher number of chapters, and new chapter structures may be included in 5.1.2, so they are mistakenly distinguished in chapter structure analysis processing. This is because there are cases.
 木構造データT2を参照することで、「バルブ」、「ボルト」および「漏水」の組として、(C,A3,B3),(C,A1,B1)が特定される。(C,A3,B3),(C,A1,B1)については、ノード同士が直列的な階層関係にある。したがって、漏水B1,B3は、「バルブ」および「ボルト」に起因する「漏水」についての説明文の記載箇所と推測される。しかし、(C,A3,B3)は、(C,A1,B1)に比べ、ノード数に関する単語間距離が短いため、(C,A1,B1)よりも上位にランキングされる。 (C, A3, B3), (C, A1, B1) are specified as a set of “valve”, “bolt”, and “leakage” by referring to the tree structure data T2. For (C, A3, B3) and (C, A1, B1), the nodes are in a serial hierarchical relationship. Therefore, the water leaks B1 and B3 are presumed to be the places where the explanatory text about the “water leak” due to the “valve” and the “bolt” is written. However, (C, A3, B3) is ranked higher than (C, A1, B1) because the inter-word distance regarding the number of nodes is shorter than (C, A1, B1).
 このように、情報検索装置は、章節構造に誤りがあっても章節構造を遵守しつつ、タイトル数に関する単語間距離が短い候補を上位にランキングするため、章節構造に誤りがあっても確からしい記載箇所を特定することができる。 In this way, the information retrieval device ranks candidates with a short inter-word distance related to the number of titles in the higher rank while complying with the chapter structure even if there is an error in the chapter structure. The place to be described can be specified.
 <ハードウェア構成例>
 図3は、情報検索装置のハードウェア構成例を示すブロック図である。情報検索装置300は、プロセッサ301と、記憶デバイス302と、入力デバイス303と、出力デバイス304と、通信インタフェース(通信IF)305と、を有する。プロセッサ301、記憶デバイス302、入力デバイス303、出力デバイス304、および通信IF305は、バスにより接続される。プロセッサ301は、情報検索装置300を制御する。記憶デバイス302は、プロセッサ301の作業エリアとなる。また、記憶デバイス302は、各種プログラムやデータを記憶する非一時的なまたは一時的な記録媒体である。記憶デバイス302としては、たとえば、ROM(Read Only Memory)、RAM(Random Access Memory)、HDD(Hard Disk Drive)、フラッシュメモリがある。入力デバイス303は、データを入力する。入力デバイス303としては、たとえば、キーボード、マウス、タッチパネル、テンキー、スキャナがある。出力デバイス304は、データを出力する。出力デバイス304としては、たとえば、ディスプレイ、プリンタがある。通信IF305は、ネットワークと接続し、データを送受信する。
<Hardware configuration example>
FIG. 3 is a block diagram illustrating a hardware configuration example of the information search apparatus. The information search apparatus 300 includes a processor 301, a storage device 302, an input device 303, an output device 304, and a communication interface (communication IF) 305. The processor 301, the storage device 302, the input device 303, the output device 304, and the communication IF 305 are connected by a bus. The processor 301 controls the information search device 300. The storage device 302 serves as a work area for the processor 301. The storage device 302 is a non-temporary or temporary recording medium that stores various programs and data. Examples of the storage device 302 include a ROM (Read Only Memory), a RAM (Random Access Memory), an HDD (Hard Disk Drive), and a flash memory. The input device 303 inputs data. Examples of the input device 303 include a keyboard, a mouse, a touch panel, a numeric keypad, and a scanner. The output device 304 outputs data. Examples of the output device 304 include a display and a printer. The communication IF 305 is connected to a network and transmits / receives data.
 <機能的構成例>
 図4は、本実施例にかかる情報検索装置300の機能的構成例を示すブロック図である。情報検索装置300は、仕様項目名辞書400と、取得部401と、解析部402と、入力部403と、特定部404と、算出部405と、表示部406と、を有する。仕様項目名辞書400は、具体的には、たとえば、図3に示した記憶デバイス302に記憶された情報として、その機能を実現する。取得部401、解析部402、入力部403、特定部404、算出部405、および表示部406は、具体的には、たとえば、図3に示した記憶デバイス302に記憶されたプログラムをプロセッサ301に実行させることにより、その機能を実現する。
<Functional configuration example>
FIG. 4 is a block diagram illustrating a functional configuration example of the information search apparatus 300 according to the present embodiment. The information search apparatus 300 includes a specification item name dictionary 400, an acquisition unit 401, an analysis unit 402, an input unit 403, a specifying unit 404, a calculation unit 405, and a display unit 406. Specifically, the specification item name dictionary 400 realizes its function as information stored in the storage device 302 shown in FIG. 3, for example. Specifically, the acquisition unit 401, the analysis unit 402, the input unit 403, the specification unit 404, the calculation unit 405, and the display unit 406, for example, store the program stored in the storage device 302 illustrated in FIG. The function is realized by executing.
 仕様項目名辞書400とは、特定部404に設定キーワードとして与えられる仕様項目名を記憶した情報である。具体的には、たとえば、仕様項目名辞書400は、カテゴリ名と仕様項目名とを対応付けたテーブルである。仕様項目名辞書400は、文書100,200(以下、総称して文書1とする)が仕様書である場合に、あらかじめ与えられる情報である。仕様書内の章節タイトルに、仕様項目名が含まれることが多いため、本実施例のような検索処理が有効となる。カテゴリ名とは文書1が属するカテゴリ(種類)を特定する名称である。 The specification item name dictionary 400 is information that stores specification item names given to the specifying unit 404 as setting keywords. Specifically, for example, the specification item name dictionary 400 is a table in which category names and specification item names are associated with each other. The specification item name dictionary 400 is information given in advance when the documents 100 and 200 (hereinafter collectively referred to as the document 1) are specifications. Since the section title in the specification often includes the specification item name, the retrieval process as in this embodiment is effective. The category name is a name that identifies the category (type) to which the document 1 belongs.
 取得部401は、文書1から文字情報と図形情報とを取得する。文字情報は、文書1内に存在する文字の文字コードと、当該文字の位置情報と、を含む。当該文字の位置情報は、当該文字の記載ページ番号と、4隅座標と、を含む。4隅座標は、当該文字の記載ページ番号で特定されるページの左下端を原点とした場合の当該文字を囲う矩形の4頂点の座標値である。文字情報により文字の位置の特定が可能になる。図形情報は、文書1内に存在する画像データや表データなどの図形データと、当該図形データの位置情報と、を含む。当該図形データの位置情報は、当該図形データの記載ページ番号と、4隅座標と、を含む。4隅座標は、当該図形データの記載ページ番号で特定されるページの左下端を原点とした場合の当該図形データを囲う矩形の4頂点の座標値である。図形情報に文字が含まれる場合、当該文字についても取得部401は、文字情報として取得する。この場合、各文字の位置情報は、図形データの位置情報を用いて算出する。 The acquisition unit 401 acquires character information and graphic information from the document 1. The character information includes a character code of a character existing in the document 1 and position information of the character. The position information of the character includes a description page number of the character and four corner coordinates. The four corner coordinates are coordinate values of four vertices of a rectangle surrounding the character when the origin is the lower left corner of the page specified by the page number of the character. The character position can be specified by the character information. The graphic information includes graphic data such as image data and table data existing in the document 1 and position information of the graphic data. The position information of the graphic data includes a description page number of the graphic data and four corner coordinates. The four corner coordinates are coordinate values of four vertices of a rectangle surrounding the graphic data when the origin is the lower left corner of the page specified by the page number of the graphic data. When the graphic information includes a character, the acquisition unit 401 also acquires the character as character information. In this case, the position information of each character is calculated using the position information of the graphic data.
 解析部402は、文書1の章節構造を解析する。具体的には、たとえば、解析部402は、取得部401によって取得された文字の位置情報を用いて文字行を抽出し、文字行が章、節、項などのタイトルであるか否かを判定する。文字行とは、水平方向に並ぶ文字の集合のうち隣接しあう文字間距離が所定距離以内である文字群を連結した文字列である。解析部402は、たとえば、文字行の先頭に番号がある場合に、当該文字行がタイトルであると判定する。さらに、解析部402は、他の文字列とのフォントや文字の大きさの違い、文字行間の距離を用いてタイトルであるか否かを判定してもよい。 The analysis unit 402 analyzes the chapter structure of the document 1. Specifically, for example, the analysis unit 402 extracts a character line using the character position information acquired by the acquisition unit 401, and determines whether the character line is a title such as a chapter, a section, or a term. To do. A character line is a character string obtained by concatenating character groups having a distance between adjacent characters within a predetermined distance in a set of characters arranged in the horizontal direction. For example, when there is a number at the beginning of a character line, the analysis unit 402 determines that the character line is a title. Further, the analysis unit 402 may determine whether or not the title is a title by using a difference in font and character size from other character strings and a distance between character lines.
 また、解析部402は、タイトル間にある複数の文字行を含む段落を特定する。段落は、文章の中にある意味の纏まりである。段落では、字下げや文字行間を広くすることによりタイトルと区切られるため、特定可能である。また、解析部402は、タイトルの番号の階層数を用いて連続するタイトルの階層関係を解析する。階層数とは、階層の深さである。例えば、章「1」ならば階層数1、節「1.1」ならば階層数2、項「1.1.1」ならば階層数3の値となる。連続性とは、たとえば、節「1.1」の次には節「1.2」が出現する、節「1.2」の次には節「1.3」または章「2」が出現するというタイトルの番号の昇順を示す特徴である。これにより、木構造データT1,T2(以下、総称して木構造データT)が生成される。 Further, the analysis unit 402 identifies a paragraph including a plurality of character lines between titles. A paragraph is a grouping of meanings in a sentence. The paragraph can be specified because it is separated from the title by indenting or widening the space between character lines. Further, the analysis unit 402 analyzes the hierarchical relationship between successive titles using the number of levels of title numbers. The number of hierarchies is the depth of the hierarchies. For example, if the chapter is “1”, the hierarchy number is 1, the section “1.1” is the hierarchy number 2, and the term “1.1.1” is the hierarchy number 3. The continuity is, for example, the section “1.2” appears after the section “1.1”, and the section “1.3” or the chapter “2” appears after the section “1.2”. This is a feature indicating the ascending order of the number of the title to be. As a result, tree structure data T1, T2 (hereinafter collectively referred to as tree structure data T) is generated.
 図5は、章節構造解析結果の画面例を示す説明図である。解析結果表示画面500は、第1パネル501と、第2パネル502と、キーワード入力欄503と、検索ボタン504と、スライダ505と、を有する。第1パネル501であるインデックス部には、木構造データTが表示され、第2パネル502である文書表示部には、文書1が表示される。また、入力デバイス303から木構造データTのタイトルを選択することにより、選択したタイトルに該当する箇所の文章が文書表示部に表示される。キーワード入力欄503は、入力デバイス303の操作により入力される文字列を受け付ける領域である。検索ボタン504が押下されることで、入力された単語群について検索が実行される。スライダ505は、後述する各評価値Pf,Ps,Pid,Pq,IPFiの重みを設定するインタフェースである。スライダを左側に移動させると重みが小さくなり、右側に移動させると重みが大きくなる。 FIG. 5 is an explanatory view showing an example of a chapter structure analysis result screen. The analysis result display screen 500 includes a first panel 501, a second panel 502, a keyword input field 503, a search button 504, and a slider 505. The tree structure data T is displayed in the index portion that is the first panel 501, and the document 1 is displayed in the document display portion that is the second panel 502. Further, by selecting a title of the tree structure data T from the input device 303, a sentence corresponding to the selected title is displayed on the document display unit. The keyword input field 503 is an area for receiving a character string input by operating the input device 303. When the search button 504 is pressed, a search is executed for the input word group. The slider 505 is an interface for setting the weights of evaluation values Pf, Ps, Pid, Pq, and IPFi described later. Moving the slider to the left decreases the weight, and moving it to the right increases the weight.
 図4に戻り、入力部403は、解析結果表示画面500のキーワード入力欄503から入力デバイス303の操作により入力される文字列を受け付ける。入力部403は、検索ボタン504の押下により、入力された文字列を解析する。入力された文字列が、スペースで区切られた複数の単語である場合、入力部403は、当該複数の単語を特定部404に渡す。また、入力された文字列が文章である場合、入力部403は、形態素解析により当該文章の中から単語を切り出して特定部404に渡す。 Returning to FIG. 4, the input unit 403 receives a character string input by operating the input device 303 from the keyword input field 503 of the analysis result display screen 500. The input unit 403 analyzes the input character string when the search button 504 is pressed. When the input character string is a plurality of words separated by a space, the input unit 403 passes the plurality of words to the specifying unit 404. When the input character string is a sentence, the input unit 403 cuts out a word from the sentence by morphological analysis and passes it to the specifying unit 404.
 特定部404は、解析部402によって解析された木構造データTの中から、検索キーワード群の記載位置を特定する。ここで、検索キーワード群は、入力デバイス303により図5のキーワード入力欄503に入力された複数の単語でもよい。また、仕様項目名辞書400の仕様項目名でもよい。なお、仕様項目名がスペースで区切られていない文字列である場合、特定部404は、形態素解析により当該文字列の中から単語を切り出す。たとえば、『ボルトの緩みによる漏水』の場合、「ボルト」、「緩み」、「漏水」が切り出される。特定部404は、木構造データTのリーフノードとなる段落から上位へ辿ることで、単語群を含むパスを探索する。 The specifying unit 404 specifies the description position of the search keyword group from the tree structure data T analyzed by the analyzing unit 402. Here, the search keyword group may be a plurality of words input to the keyword input field 503 of FIG. The specification item name of the specification item name dictionary 400 may be used. If the specification item name is a character string that is not separated by a space, the specifying unit 404 cuts out a word from the character string by morphological analysis. For example, in the case of “water leakage due to loosening of bolts”, “bolt”, “loosening”, and “water leakage” are cut out. The specifying unit 404 searches for a path including a word group by tracing from the paragraph serving as the leaf node of the tree structure data T to the top.
 図6は、木構造データT1の探索例を示す説明図である。たとえば、検索キーワード群が「ボルト 現象 漏水」である場合、特定部404は、図1の木構造データT1のリーフノードN111a、N112a、N211a、N211c、N212aの各々からルートノードに向かって探索をする。特定部404は、探索中のノードにおいて、検索キーワードである「ボルト 現象 漏水」の位置を特定する。 FIG. 6 is an explanatory diagram showing a search example of the tree structure data T1. For example, when the search keyword group is “Bolt phenomenon leakage”, the specifying unit 404 searches from the leaf nodes N111a, N112a, N211a, N211c, and N212a of the tree structure data T1 in FIG. 1 toward the root node. . The specifying unit 404 specifies the position of the search keyword “bolt phenomenon water leak” in the node being searched.
 リーフノードN111aから探索を開始した例について説明すると、特定部404は、ノードN111aの漏水B1を特定する。次に、特定部404は、漏水B1が探索されたリーフノードN111aから自ノードおよび上位ノードをたどり、検索キーワード群「ボルト 現象 漏水」を探索する。そして、特定部404は、ノードN111の現象D1を特定する。そして、特定部404は、現象D1が探索されたノードN111から自ノードおよび上位ノードをたどり、検索キーワード群「ボルト 現象 漏水」の探索を継続する。そして、特定部404は、ノードN1のボルトA1を特定する。そして、ルートノードN0に到達した場合に、リーフノードN111aから探索が終了する。これにより、特定部404は、「漏水B1→現象D1→ボルトA1」というパスを特定することができる。 If the example which started the search from the leaf node N111a is demonstrated, the specific | specification part 404 specifies the water leak B1 of the node N111a. Next, the identifying unit 404 searches the search keyword group “bolt phenomenon leaked water” by tracing the self node and the upper node from the leaf node N111a from which the leaked water B1 has been searched. Then, the identifying unit 404 identifies the phenomenon D1 of the node N111. Then, the specifying unit 404 follows the node and the upper node from the node N111 where the phenomenon D1 is searched, and continues to search for the search keyword group “volt phenomenon leakage”. Then, the specifying unit 404 specifies the bolt A1 of the node N1. When the root node N0 is reached, the search ends from the leaf node N111a. Thereby, the specifying unit 404 can specify a path “water leakage B1 → phenomenon D1 → bolt A1”.
 また、リーフノードN211aから探索を開始した例について説明すると、特定部404は、リーフノードN211aで漏水B3を特定する。このあと、特定部404は、漏水B3が探索されたノードN211aから自ノードおよび上位ノードをたどっても、現象D2および漏水B2までは探索できるが、ボルトを探索できずにリーフノードN0に到達する。したがって、特定部404は、「漏水B3→現象D2→漏水B2」というパスを特定することになる。 Describing an example in which the search is started from the leaf node N211a, the specifying unit 404 specifies the water leak B3 at the leaf node N211a. Thereafter, the identifying unit 404 can search for the phenomenon D2 and the leaked water B2 from the node N211a where the leaked water B3 has been searched for, but reaches the leaf node N0 without searching for the bolt. . Therefore, the specifying unit 404 specifies the path “leakage B3 → phenomenon D2 → leakage B2”.
 また、検索キーワード群が「配管 コーナー 隙間」である場合について、リーフノードN211cから探索を開始した例について説明する。特定部404は、リーフノードN211cについて「隙間」を探索するがヒットしないため、上位ノードであるノードN211bを探索する。特定部404は、ノードN211bにおいて隙間Gを特定する。そして、特定部404は、隙間Gが探索されたリーフノードN211bから自ノードおよび上位ノードをたどり、検索キーワード群「配管 コーナー 隙間」の探索を継続する。そして、特定部404は、自ノード内のコーナーFを探索する。 Also, an example in which the search is started from the leaf node N211c when the search keyword group is “pipe corner gap” will be described. The identifying unit 404 searches for a “gap” for the leaf node N211c but does not hit it, and thus searches for the node N211b, which is a higher node. The identifying unit 404 identifies the gap G at the node N211b. Then, the specifying unit 404 follows the self node and the upper node from the leaf node N211b in which the gap G is searched, and continues to search for the search keyword group “pipe corner gap”. Then, the identifying unit 404 searches for the corner F in the own node.
 つぎに、特定部404は、コーナーFが探索されたノードN211bから自ノードおよび上位ノードをたどり、検索キーワード群「配管 コーナー 隙間」の探索を継続する。そして、特定部404は、ノードN21の配管E1を特定する。そして、特定部404は、配管E1が探索されたノードN21から自ノードおよび上位ノードをたどり、検索キーワード群「配管 コーナー 隙間」の探索を継続する。そして、特定部404は、ノードN2の配管E2を特定する。そして、ルートノードN0に到達した場合に、リーフノードN211cから探索が終了する。これにより、特定部404は、「隙間G→コーナーF→配管E1→配管E2」というパスを特定することができる。 Next, the specifying unit 404 follows the node and the upper node from the node N211b where the corner F is searched, and continues to search for the search keyword group “pipe corner gap”. Then, the specifying unit 404 specifies the pipe E1 of the node N21. Then, the specifying unit 404 follows the node and the upper node from the node N21 where the pipe E1 has been searched, and continues to search for the search keyword group “pipe corner gap”. Then, the specifying unit 404 specifies the pipe E2 of the node N2. When the root node N0 is reached, the search ends from the leaf node N211c. Thereby, the specifying unit 404 can specify a path “gap G → corner F → pipe E1 → pipe E2”.
 このように、パスの始端(リーフノード側)である「隙間G」はリーフノードN211cに存在しなくてもよい。たとえば、ノードN211bに示した段落の中では、単語であるコーナーFおよび隙間Gが離れて存在する場合でも、それらは意味のつながりを持つ場合があるため、特定部404は特定する。 Thus, the “gap G” that is the start end (leaf node side) of the path may not exist in the leaf node N211c. For example, in the paragraph shown in the node N211b, even when the corner F and the gap G, which are words, are separated from each other, they may be connected to each other, so the specifying unit 404 specifies.
 また、特定部404は、特定したパスごとに、最短パスを抽出する。最短パスとは、特定したパスの中で、検索キーワードの種類数を最も多く含み、単語間距離の総和(或いは平均)が最も小さいパスである。最短パスを抽出する理由は、特定したパスの中に複数の同じキーワードが出現した場合に、離れた位置にあるキーワードは関連している可能性が低いため、離れた位置にあるキーワードを除いてパスの評価をするためである。たとえば、図6に示したパス「隙間G→コーナーF→配管E1→配管E2」のように、検索キーワード(配管)が複数回出現する場合がある。この場合、単語間距離が最短となるパスを抽出する。なお、単語間距離については、パスを構成するノードに含まれる検索キーワードの位置に応じて、ノード数に関する単語間距離と単語数に関する単語間距離の少なくともいずれかが適用される。例えば、両単語が異なるノードに存在する場合には、ノード数に関する単語間距離が適用され、両単語が同一のノードに存在する場合には、単語数に関する単語間距離が適用される。詳細については後述する。 Also, the identifying unit 404 extracts the shortest path for each identified path. The shortest path is a path that includes the largest number of types of search keywords and has the smallest total (or average) distance between words in the identified paths. The reason for extracting the shortest path is that when multiple identical keywords appear in the identified path, it is unlikely that the keywords at a distant position are related. This is to evaluate the path. For example, a search keyword (pipe) may appear a plurality of times as in the path “gap G → corner F → pipe E1 → pipe E2” shown in FIG. In this case, a path with the shortest distance between words is extracted. As for the distance between words, at least one of the distance between words related to the number of nodes and the distance between words related to the number of words is applied according to the position of the search keyword included in the nodes constituting the path. For example, when both words exist in different nodes, the distance between words related to the number of nodes is applied, and when both words exist in the same node, the distance between words related to the number of words is applied. Details will be described later.
 図7Aおよび図7Bは、最短パスの特定例を示す説明図である。図7Aおよび図7Bでは、検索キーワード群を「バルブ ボルト 漏水」とする。図7Aにおいて、(A)は、木構造データTから探索されたパスP7である。パスP7は、「漏水73b→ボルト72b→漏水73a→バルブ71b→バルブ71b→ボルト72a→バルブ71a」を辿るパスである。図7Aにおいて、(B)のパスP71とパスP72は、パスP7から検索キーワードの種類数を最も多く含むパス(この場合はすべての検索キーワードを含むパス)の例である。特定部404は、パスP71、P72について単語間距離を算出する。パスP71は、検索キーワード群を「バルブ ボルト 漏水」における「バルブ」と「ボルト」の順序が入れ替わっているが特定される。なお、「漏水73a→ボルト72a→バルブ71a」や「漏水73b→ボルト72b→バルブ71a」、「漏水73b→ボルト72a→バルブ71a」といったパスも特定されるが、説明を簡略化するため、省略する。 7A and 7B are explanatory diagrams showing a specific example of the shortest path. 7A and 7B, the search keyword group is “valve bolt / leakage”. 7A, (A) is a path P7 searched from the tree structure data T. FIG. The path P7 is a path that follows “leakage 73b → bolt 72b → leakage 73a → valve 71b → valve 71b → bolt 72a → valve 71a”. In FIG. 7A, a path P71 and a path P72 in (B) are examples of paths including the largest number of types of search keywords from the path P7 (in this case, paths including all search keywords). The identifying unit 404 calculates the inter-word distance for the paths P71 and P72. The path P71 specifies the search keyword group in which the order of “valve” and “bolt” in “valve bolt leakage” is switched. It should be noted that paths such as “leakage 73a → bolt 72a → valve 71a”, “leakage 73b → bolt 72b → valve 71a”, and “leakage 73b → bolt 72a → valve 71a” are also specified, but are omitted for simplification. To do.
 パスP71の場合、バルブ71bとボルト72aの単語間距離d11と、バルブ71bと漏水73aの単語間距離d12との総和d1(=d11+d12)がパスP71における単語間距離となる。パスP72の場合、バルブ71bとボルト72bの単語間距離d21と、ボルト72bと漏水73bの単語間距離d22との総和d2(=d21+d22)がパスP72における単語間距離となる。特定部404は、パスP71、P72のうち単語間距離が短い方のパスをパスP7の最短パスとして特定する。 In the case of the path P71, the sum d1 (= d11 + d12) of the inter-word distance d11 between the valve 71b and the bolt 72a and the inter-word distance d12 between the valve 71b and the water leakage 73a is the inter-word distance in the path P71. In the case of the path P72, the total distance d2 (= d21 + d22) of the inter-word distance d21 between the valve 71b and the bolt 72b and the inter-word distance d22 between the bolt 72b and the water leakage 73b is the inter-word distance in the path P72. The identifying unit 404 identifies the path having the shorter inter-word distance among the paths P71 and P72 as the shortest path of the path P7.
 また、図7Bにおいて、(A)のパスPb1には、2つのバルブ71a,71bが含まれる。この場合、バルブ71aとボルト72との単語間距離よりもバルブ71bとボルト72との単語間距離の方が短くなるため、特定部404は、パスPb1からパスPb2を特定して、最短パスとする。 In FIG. 7B, the path Pb1 in (A) includes two valves 71a and 71b. In this case, the inter-word distance between the valve 71b and the bolt 72 is shorter than the inter-word distance between the valve 71a and the bolt 72. Therefore, the identifying unit 404 identifies the path Pb2 from the path Pb1 and determines the shortest path. To do.
 (B)のパスPcには、2つのボルト72a,72bが含まれる。この場合、特定部404は、バルブ71とボルト72aとの単語間距離とボルト72aと漏水73との単語間距離との総和と、バルブ71とボルト72bとの単語間距離とボルト72bと漏水73との単語間距離との総和と、を比較し、短い方のパスを最短パスとして特定する。また、特定部404は、リーフノード側のパスPcの始端である漏水73に近いノードであるボルト72bを優先して、「漏水73→ボルト72b→バルブ71」というパスを最短パスとして特定してもよい。 (B) The path Pc includes two bolts 72a and 72b. In this case, the specifying unit 404 adds the inter-word distance between the valve 71 and the bolt 72 a and the inter-word distance between the bolt 72 a and the water leakage 73, the inter-word distance between the valve 71 and the bolt 72 b, the bolt 72 b and the water leakage 73. Are compared with the sum of the distances between words, and the shorter path is specified as the shortest path. Further, the specifying unit 404 gives priority to the bolt 72b that is a node close to the water leak 73 that is the start end of the path Pc on the leaf node side, and specifies the path of “water leak 73 → bolt 72b → valve 71” as the shortest path. Also good.
 (C)のパスPd1には、2つの漏水73a,73bが含まれる。この場合、ボルト72と漏水73bの単語間距離よりもボルト72と漏水73aの単語間距離の方が短くなるため、特定部404は、パスPd1からパスPd2を特定して、最短パスとする。 (C) Path Pd1 includes two water leaks 73a and 73b. In this case, since the distance between the words of the bolt 72 and the leak 73a is shorter than the distance between the words of the bolt 72 and the leak 73b, the specifying unit 404 specifies the path Pd2 from the path Pd1 and sets it as the shortest path.
 なお、探索されたパスに図7Bの(A)~(C)に示したようなパスが複数含まれる場合であっても、単語間距離が最短となるパスを最短パスとして特定すればよい。 Even if the searched path includes a plurality of paths as shown in (A) to (C) of FIG. 7B, the path with the shortest distance between words may be specified as the shortest path.
 図4に戻り、算出部405は、木構造データTにおける、特定部404によって特定された単語の組ごとに、当該単語の組の存在位置に基づいて、評価値Pf,Ps,Pid,Pq,IPFiを算出する。具体的には、たとえば、算出部405は、下記の5つの値Pf,Ps,Pid,Pq,IPFiを算出し、この5つの値を用いてパス(特定された単語の組)の評価値を算出する。例えば、重み付き線形和を評価関数とし、当該評価関数による評価値を算出する。 Returning to FIG. 4, the calculation unit 405 determines, for each word set specified by the specifying unit 404 in the tree structure data T, the evaluation values Pf, Ps, Pid, Pq, IPFi is calculated. Specifically, for example, the calculation unit 405 calculates the following five values Pf, Ps, Pid, Pq, and IPFi, and uses these five values to calculate an evaluation value of a path (a specified word set). calculate. For example, a weighted linear sum is used as an evaluation function, and an evaluation value based on the evaluation function is calculated.
 評価値Pfは、特定部404によって特定された単語の組に含まれる単語が、検索キーワード群にどのくらい含まれる割合を示す値である。検索キーワード群を構成する単語が,文書にすべて記載されるとは限らない。そのため、検索キーワード群を構成する単語を含む割合が多いほど高い値となる項として、評価値Pfが評価関数に加えられる。評価値Pfは、下記式(1)により算出される。 The evaluation value Pf is a value indicating how much the word included in the set of words specified by the specifying unit 404 is included in the search keyword group. The words that make up the search keyword group are not all described in the document. Therefore, the evaluation value Pf is added to the evaluation function as a term having a higher value as the ratio including the words constituting the search keyword group increases. The evaluation value Pf is calculated by the following formula (1).
 Pf=New/Nsw・・・(1) Pf = New / Nsw (1)
 Newは、特定部404によって特定された単語の組に含まれる単語数であり、Nswは、検索キーワード群に含まれる単語数である。 New is the number of words included in the set of words specified by the specifying unit 404, and Nsw is the number of words included in the search keyword group.
 たとえば、図6において、検索キーワード群が「ボルト 現象 漏水」の例において、特定された単語の組が{ボルトA1、現象D1、漏水B2}の場合は、New=Nsw=3であるため、Pf=1となる。一方、特定された単語の組が{現象D2,漏水B3}の場合は、New=2、Nsw=3であるため、Pf=2/3となる。 For example, in FIG. 6, in the case where the search keyword group is “Bolt phenomenon leaked water” and the specified word set is {Bolt A1, phenomenon D1, leaked water B2}, since New = Nsw = 3, Pf = 1. On the other hand, when the specified word set is {phenomenon D2, leaked water B3}, since New = 2 and Nsw = 3, Pf = 2/3.
 評価値Psは、特定部404によって特定された単語の組に含まれる単語が章節タイトル内の単語と一致した割合を示す評価値である。すなわち、評価値Psは、特定部404によって特定された単語の組に含まれる単語が,章節構造に沿って記述されているかどうかを示す値である。仕様項目名や検索キーワード群として選ばれる単語群は、文書1の中で主となる情報であり、章や節などのタイトルに記載されることが多い。そのため、タイトルに含まれる単語の割合が多いほど高い値となる項として、評価値Psが評価関数に加えられる。評価値Psは、下記式(2)により算出される。 The evaluation value Ps is an evaluation value indicating the ratio of the words included in the set of words specified by the specifying unit 404 and the words in the chapter title. That is, the evaluation value Ps is a value indicating whether or not a word included in the set of words specified by the specifying unit 404 is described along the chapter structure. A word group selected as a specification item name or a search keyword group is main information in the document 1 and is often described in a title such as a chapter or a section. Therefore, the evaluation value Ps is added to the evaluation function as a term that becomes higher as the ratio of words included in the title increases. The evaluation value Ps is calculated by the following formula (2).
 Ps=Nsm/New・・・(2) Ps = Nsm / New (2)
 Nsmは、特定部404によって特定された単語を含む章節タイトルの数である。 Nsm is the number of chapter titles including the word specified by the specifying unit 404.
 評価値Pidは、単語間距離の逆数である。単語間距離には、ノード数に関する単語間距離と単語数に関する単語間距離を用いることができる。 The evaluation value Pid is the reciprocal of the distance between words. As the distance between words, the distance between words regarding the number of nodes and the distance between words regarding the number of words can be used.
 単語数に関する単語間距離は、読み順にしたがって単語を並べた列を用いて算出されるため、1章のタイトルに記載された単語と1.5節のパラグラフに記載された単語の距離は大きな値となり、単語と単語の意味がつがなっていない可能性が高くなってしまう。これに対し、本方式の特徴であるノード数に関する単語間距離は、特定部404により単語の組を探索する際に辿った段落とタイトルのノードを並べた列を用いて算出される。 Since the distance between words related to the number of words is calculated using a sequence of words arranged in the reading order, the distance between the word described in the title of Chapter 1 and the word described in the paragraph in Section 1.5 is a large value. Thus, there is a high possibility that the word and the meaning of the word are not connected. On the other hand, the inter-word distance related to the number of nodes, which is a feature of this method, is calculated using a column in which paragraphs and title nodes traced when the specifying unit 404 searches for a set of words.
 たとえば、図6において探索されたパス「漏水B1→現象D1→ボルトA1」の場合は、「漏水B1→現象D1」についてノード数に関する単語間距離が算出される。この場合、一方の単語(漏水B1)のノードN111aと他方の単語(現象D1)のノードN111との間にはノードがない。したがって、「漏水B1→現象D1」のノード数に関する単語間距離は1となる。また、「現象D1→ボルトA1」のノード数に関する単語間距離が算出される。この場合、一方の単語(ボルトA1)のノードN1と他方の単語(現象D1)のノードN1111との間にあるノードは、1つである。したがって、「現象D1→ボルトA1」のノード数に関する単語間距離は2となる。そして、パス「漏水B1→現象D1→ボルトA1」の単語間距離は、1+2=3となる。 For example, in the case of the path “leakage B1 → phenomenon D1 → bolt A1” searched in FIG. 6, the inter-word distance regarding the number of nodes is calculated for “leakage B1 → phenomenon D1”. In this case, there is no node between the node N111a of one word (leakage B1) and the node N111 of the other word (phenomenon D1). Therefore, the distance between words regarding the number of nodes of “leakage B1 → phenomenon D1” is 1. Further, the distance between words regarding the number of nodes of “phenomenon D1 → volt A1” is calculated. In this case, there is one node between the node N1 of one word (bolt A1) and the node N1111 of the other word (phenomenon D1). Therefore, the inter-word distance regarding the number of nodes of “phenomenon D1 → volt A1” is 2. And the distance between words of the path “leakage B1 → phenomenon D1 → bolt A1” is 1 + 2 = 3.
 また、上述したように、ノード数に関する単語間距離は、対象となる2つの単語が同一ノードに存在するか異なるノードに存在するかによって算出方法が異なる。同一ノード内の2つの単語については、単語数に関する単語間距離が適用され、異なるノードの2つの単語については、ノード数に関する単語間距離が適用されてもよい。 Also, as described above, the calculation method of the inter-word distance related to the number of nodes differs depending on whether two target words are present in the same node or in different nodes. The distance between words regarding the number of words may be applied to two words in the same node, and the distance between words regarding the number of nodes may be applied to two words at different nodes.
 たとえば、図6において探索されたパス「隙間G→コーナーF→配管E1→配管E2」の場合、図7Bに示したように、「隙間G→コーナーF→配管E1」が最短パスとなる。この最短パスにおいて「隙間G→コーナーF」については、算出部405は、単語数に関する単語間距離を算出する。「コーナーF→配管E1」については、算出部405は、ノード数に関する単語間距離を算出する。 For example, in the case of the path “gap G → corner F → pipe E1 → pipe E2” searched in FIG. 6, “gap G → corner F → pipe E1” becomes the shortest path as shown in FIG. 7B. For “gap G → corner F” in the shortest path, the calculation unit 405 calculates an inter-word distance related to the number of words. For “corner F → pipe E1”, the calculation unit 405 calculates the inter-word distance related to the number of nodes.
 このように、たとえば、1章のタイトルに記載された単語と1.5節のパラグラフに記載された単語がある場合、単語間距離は2となり、単語と単語の意味がつがなっている可能性が高いことを示す。算出された単語間距離が小さいほど確からしい単語の組であるため、単語間距離の逆数である評価値Pidが評価関数の項として、評価関数に加えられる。評価値Pidは、下記式(3)により算出される。 Thus, for example, if there is a word described in the title of Chapter 1 and a word described in the paragraph in section 1.5, the distance between the words is 2, and there is a possibility that the meaning of the word and the word is connected. Is high. As the calculated distance between words is smaller, the set of words is more probable. Therefore, an evaluation value Pid that is the reciprocal of the distance between words is added to the evaluation function as a term of the evaluation function. The evaluation value Pid is calculated by the following formula (3).
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001
 Wは、i番目(iは1以上の整数)の単語である。Dは、WとWi+1との単語間距離である。WとWi+1との位置関係によって単語間距離Dの算出方法が切り替えられる。 W i is the i-th word (i is an integer of 1 or more). D i is the word distance between W i and W i + 1. The calculation method of the inter-word distance D i is switched depending on the positional relationship between W i and W i + 1 .
 WとWi+1が同一ノード内にある場合、単語間距離Dは、単語数に関する単語間距離であり、D=WとWi+1との間にある単語数+1となる。 When W i and W i + 1 are in the same node, the inter-word distance D i is the inter-word distance related to the number of words, and D i = the number of words between W i and W i + 1 + 1 .
 一方、WとWi+1が異なるノード内にある場合、単語間距離Dは、ノード数に関する単語間距離であり、D=NとNi+1との間にあるノード数+1となる。ただし、NはWを含むノードであり、Ni+1は、Wi+1を含むノードである。 On the other hand, when W i and W i + 1 are in different nodes, the inter-word distance D i is the inter-word distance related to the number of nodes, and D i = the number of nodes between N i and N i + 1 +1. However, N i is a node including W i , and N i + 1 is a node including W i + 1 .
 評価値Pqは、検索キーワード群内の単語の並び順、或いは仕様項目名内の単語の並び順と、章節構造上での単語の出現順の一致度を示す評価値である。文書100が仕様書である場合、仕様の階層構造を章節構造に当てはめて記載されることが多い。そのため,章節構造上での単語の出現順の一致度は、仕様項目名を構成する単語の記載順と、章節構造を下位階層から辿って生成したパス上の単語の並びの逆順が一致するほど高い値となる。したがって、章節構造上での単語の出現順の一致度である評価値Pqが評価関数の項として、評価関数に加えられる。評価値Pqは、下記式(4)により算出される。 The evaluation value Pq is an evaluation value indicating the degree of coincidence between the word order in the search keyword group or the word order in the specification item name and the appearance order of the words in the chapter structure. When the document 100 is a specification, it is often described by applying the hierarchical structure of the specification to the chapter structure. For this reason, the degree of coincidence of the appearance order of words on the chapter structure is such that the order of description of the words constituting the specification item name matches the reverse order of the word sequence on the path generated by tracing the chapter structure from the lower hierarchy. High value. Therefore, the evaluation value Pq, which is the degree of coincidence of the appearance order of words on the chapter structure, is added to the evaluation function as a term of the evaluation function. The evaluation value Pq is calculated by the following formula (4).
 Pq=Neq/New・・・(4) Pq = Neq / New (4)
 Neqは、検索キーワード群の単語の並びにおいて、単語Wと単語Wi+1が連続で出現する数である。たとえば、検索キーワード群の単語「ボルト」、「漏水」、「原因」がこの並び順である場合に、特定部404によって「ボルト」、「漏水」、「原因」の順に探索された場合、Neqは、「ボルト」および「漏水」と、「漏水」および「原因」との2つである。Newは3である。したがって、Pq=2/3となる。 Neq is the number of words W i and W i + 1 that appear consecutively in the word sequence of the search keyword group. For example, when the words “bolt”, “leakage”, and “cause” in the search keyword group are in this order, when the specifying unit 404 searches for “bolt”, “leakage”, and “cause” in this order, Neq Are “bolts” and “leakage” and “leakage” and “causes”. New is 3. Therefore, Pq = 2/3.
 一方、特定部404によって「ボルト」、「原因」、「漏水」の順に探索された場合、Neqは0である。Newは3である。したがって、Pq=0となる。Neqは、単語Wと単語Wi+1の出現順番が合っている(単語Wと単語Wi+1の間に別の検索キーワードが存在しても良い)数としてもよい。 On the other hand, when the specifying unit 404 searches for “volt”, “cause”, and “water leakage” in this order, Neq is 0. New is 3. Therefore, Pq = 0. Neq may be a word W i and the word W i + 1 of (another search keyword in between the words W i and the word W i + 1 may be present) the appearance order is correct number.
 評価値IPFiは、特定部404によって特定された単語の組に含まれる単語が出現するノード(タイトルまたは段落)の数を用いた当該単語の重要度の平均値である。たとえば、仕様項目名を構成する単語は、他の仕様項目や仕様項目以外を記載する際にも用いられる。他の仕様項目や仕様項目以外の単語の誤抽出を低減するため、特定部404によって特定された単語の組に特徴的な単語が含まれる場合には、当該特定された単語の組の評価値を高くする必要がある。 The evaluation value IPFi is an average value of the importance of the word using the number of nodes (title or paragraph) in which the word included in the set of words specified by the specifying unit 404 appears. For example, the words constituting the specification item name are also used when describing other specification items or other than specification items. In order to reduce erroneous extraction of other specification items and words other than the specification items, when a characteristic word is included in the word set specified by the specifying unit 404, the evaluation value of the specified word set Need to be high.
 特徴的な単語を示す指標として、IDF(Inverse Document Frequency)がある。IDFは、単語の出現した文書の数の逆数であり、この値が大きいほど出現する文書の数が少なく、特徴的な単語となる。本実施例の場合、文書の数ではなく、木構造データT内で単語が出現したノードの数の逆数(IPF:Inverse Paragraph Frequency)を用いる。評価値IPFiは、下記式(5)により算出される。 There is IDF (Inverse Document Frequency) as an index indicating a characteristic word. IDF is the reciprocal of the number of documents in which words appear, and the larger this value, the smaller the number of documents that appear and the more characteristic words. In this embodiment, not the number of documents but the reciprocal of the number of nodes in which words appear in the tree structure data T (IPF: Inverse Paragraph Frequency) is used. The evaluation value IPFi is calculated by the following equation (5).
 IPFi=log(Nen/Ntp)・・・(5) IPFi = log (Nen / Ntp) (5)
 対数化したのは値の変化を小さくするためである。Nenは、特定部404によって特定された単語Wが出現するノード数であり、Ntpは、文書内にあるノードの総数である。 The logarithm is used to reduce the change in value. Nen is the number of nodes that the word W i specified by the specifying unit 404 appears, Ntp is the total number of nodes in the document.
 算出部405は、特定部404によって特定された単語の組ごとに、これらの値Pf,Ps,Pid,Pq,IPFiを項とする重み付き線形和である評価関数により、最終的な評価値Pを算出する。Pf,Ps,Pid,Pq,IPFiの各重みは、あらかじめ設定された値が用いられるが、評価値Pの算出前にスライダ505を操作することにより変更可能である。たとえば、評価方法を変えたい場合は、評価値Pf,Ps,Pid,Pq,IPFiの各重みをスライダ505により変更すればよい。これにより、異なる検索結果を得ることができる。 For each word set specified by the specifying unit 404, the calculation unit 405 uses the evaluation function that is a weighted linear sum having these values Pf, Ps, Pid, Pq, and IPFi as terms, to obtain a final evaluation value P. Is calculated. As the weights of Pf, Ps, Pid, Pq, and IPFi, values set in advance are used, but can be changed by operating the slider 505 before the evaluation value P is calculated. For example, when it is desired to change the evaluation method, the weights of the evaluation values Pf, Ps, Pid, Pq, and IPFi may be changed by the slider 505. Thereby, different search results can be obtained.
 表示部406は、算出部405によって算出された評価値Pに基づいて、出力データを生成する。具体的には、たとえば、表示部406は、評価値Pの高い順に、特定部404によって特定された単語の組を表示可能な出力画面に関するデータ(たとえば、XMLデータ)を生成する。 The display unit 406 generates output data based on the evaluation value P calculated by the calculation unit 405. Specifically, for example, the display unit 406 generates data (for example, XML data) related to an output screen that can display a set of words specified by the specifying unit 404 in descending order of the evaluation value P.
 図8は、表示部406によって生成される出力画面例1を示す説明図である。図8の出力画面800は、仕様項目名辞書400の仕様項目名により木構造データT1を探索した場合の出力画面である。出力画面800は、第1パネル801であるインデックス部と第2パネル802である文書表示部とを有する。第1パネル801には、算出部405によって算出された評価値Pの降順にしたがって、木構造データT1の探索に使用した仕様項目名と、当該仕様項目名に対応するカテゴリ名と、特定された記載箇所へのリンクと、がランキング表示される。第2パネル802には、文書100が表示される。第1パネル801のリンク813,823を指定することで、リンク813,823によって指定される記載箇所が第2パネル802に表示される。 FIG. 8 is an explanatory diagram illustrating an output screen example 1 generated by the display unit 406. The output screen 800 of FIG. 8 is an output screen when the tree structure data T1 is searched by the specification item name of the specification item name dictionary 400. The output screen 800 includes an index portion that is a first panel 801 and a document display portion that is a second panel 802. In the first panel 801, in accordance with the descending order of the evaluation value P calculated by the calculation unit 405, the specification item name used for searching the tree structure data T1 and the category name corresponding to the specification item name are specified. The link to the description part and the ranking are displayed. On the second panel 802, the document 100 is displayed. By specifying the links 813 and 823 of the first panel 801, the description location specified by the links 813 and 823 is displayed on the second panel 802.
 図9は、表示部406によって生成される出力画面例2を示す説明図である。図8の出力画面900は、入力キーワード群により木構造データT1を探索した場合の出力画面である。第1パネル901には、段落を除く木構造データと特定された記載箇所へのリンクが表示される。記載箇所へのリンクは、記載箇所(特定されたパス)に含まれる単語の中で最もリーフノードに近い単語が存在する位置に表示される。この際、評価値Pが事前に指定された閾値よりも大きい記載位置のみを表示してもよい。第2パネル902には、文書100が表示される。第1パネル901のリンク(記載箇所(1))を指定することで、当該リンクによって指定される記載箇所903が表示される。 FIG. 9 is an explanatory diagram illustrating an output screen example 2 generated by the display unit 406. The output screen 900 of FIG. 8 is an output screen when the tree structure data T1 is searched by the input keyword group. The first panel 901 displays tree structure data excluding paragraphs and links to the specified description locations. The link to the description location is displayed at a position where the word closest to the leaf node is present among the words included in the description location (specified path). At this time, only the description position where the evaluation value P is larger than the threshold value designated in advance may be displayed. The document 100 is displayed on the second panel 902. By specifying the link (description location (1)) of the first panel 901, the description location 903 designated by the link is displayed.
 なお、図8および図9の出力画面800、900の状態で、Pf,Ps,Pid,Pq,IPFiの各重みを変更してもよい。この場合、変更後の重みにより評価値Pが算出部405により再計算され、再計算後の評価値Pにより、第1パネル801,901の表示が変更される。 Note that the weights of Pf, Ps, Pid, Pq, and IPFi may be changed in the state of the output screens 800 and 900 in FIGS. In this case, the evaluation value P is recalculated by the calculation unit 405 with the changed weight, and the display of the first panels 801 and 901 is changed with the evaluation value P after the recalculation.
 なお、出力デバイス304が情報検索装置300の外部の表示装置である場合には、表示部406は、出力画面800,900に関する情報を通信IF305から外部の表示装置に送信することにより、外部の表示装置に出力画面800,900を表示させる。 When the output device 304 is a display device external to the information search device 300, the display unit 406 transmits information related to the output screens 800 and 900 from the communication IF 305 to the external display device, thereby displaying the external display. Output screens 800 and 900 are displayed on the apparatus.
 <情報検索処理手順例>
 図10は、情報検索装置300による情報検索処理手順例を示すフローチャートである。情報検索装置300は、取得部401による取得処理(ステップS1001)、解析部402による章節構造解析処理(ステップS1002)、入力部403による、または、仕様項目名辞書400を用いた検索キーワード取得処理(ステップS1003)、特定部404に8よる特定処理(ステップS1004)、算出部405による算出処理(ステップS1005)、および表示部406による表示処理(ステップS1006)を実行する。これにより、一連の処理を終了する。
<Example of information search processing procedure>
FIG. 10 is a flowchart illustrating an example of an information search processing procedure performed by the information search apparatus 300. The information search apparatus 300 includes an acquisition process (step S1001) by the acquisition unit 401, a chapter structure analysis process (step S1002) by the analysis unit 402, a search keyword acquisition process by the input unit 403 or using the specification item name dictionary 400 ( Step S1003), a specifying process by the specifying unit 404 (Step S1004), a calculating process by the calculating unit 405 (Step S1005), and a display process by the display unit 406 (Step S1006) are executed. As a result, the series of processes is completed.
 図11は、図10に示した章節構造解析処理(ステップS1002)の詳細な処理手順例を示すフローチャートである。情報検索装置300は、取得部401によって取得された文字情報から文字行を生成する(ステップS1101)。つぎに、情報検索装置300は、生成した文字行の集合の中から、章や節、項などのタイトルを特定する(ステップS1102)。また、情報検索装置300は、生成した文字行の集合のうち、ステップS1102で特定されたタイトル以外の文字行から段落を特定する(ステップS1103)。そして、情報検索装置300は、特定されたタイトルおよび段落の階層関係解析を実行することにより、木構造データを生成する(ステップS1104)。これにより、章節構造解析処理(ステップS1002)を終了し、ステップS1003に移行する。 FIG. 11 is a flowchart showing a detailed processing procedure example of the chapter structure analysis processing (step S1002) shown in FIG. The information search device 300 generates a character line from the character information acquired by the acquisition unit 401 (step S1101). Next, the information retrieval apparatus 300 identifies a title such as a chapter, a section, or a term from the generated set of character lines (step S1102). Further, the information search device 300 identifies a paragraph from the character line other than the title identified in step S1102 in the generated set of character lines (step S1103). Then, the information search device 300 generates tree structure data by executing a hierarchical relationship analysis of the specified title and paragraph (step S1104). As a result, the chapter structure analysis processing (step S1002) is terminated, and the process proceeds to step S1003.
 図12は、図10に示した特定処理(ステップS1004)の詳細な処理手順例を示すフローチャートである。情報検索装置300は、検索キーワード群を含むパスを、章節構造解析処理(ステップS1002)で得られた木構造データT1、T2から特定する(ステップS1201)。なお、ステップS1201では、検索キーワード群を含むパスであればよく、検索キーワードのパスでの出現順は、検索キーワード群の入力順と異なっていてもよい。また、検索キーワードをすべて含む必要はなく、検索キーワード群の一部が含まれている場合でもよい。 FIG. 12 is a flowchart showing a detailed processing procedure example of the specific processing (step S1004) shown in FIG. The information search apparatus 300 specifies a path including the search keyword group from the tree structure data T1 and T2 obtained by the chapter structure analysis process (step S1002) (step S1201). In step S1201, a path including a search keyword group may be used, and the appearance order of the search keyword path may be different from the input order of the search keyword group. Further, it is not necessary to include all search keywords, and a part of the search keyword group may be included.
 この場合、たとえば、検索キーワードの欠落が許容される個数をあらかじめ設定しておき、許容範囲内の欠落であれば、情報検索装置300は、パスを特定することができる。たとえば、検索キーワードの欠落が許容される個数を1、検索キーワードがW1~W3の3個である場合、W1~W3のいずれか2個を含むパスが特定される。つぎに、情報検索装置300は、特定されたパスごとに、最短パスを特定する(ステップS1202)。 In this case, for example, the number of search keywords that are allowed to be missing is set in advance, and the information search device 300 can identify the path if the number is within the allowable range. For example, if the number of search keywords that can be deleted is 1 and the search keywords are W1 to W3, a path including any two of W1 to W3 is specified. Next, the information retrieval apparatus 300 identifies the shortest path for each identified path (step S1202).
 これにより、特定処理(ステップS1004)を終了し、ステップS1005に移行する。 Thereby, the specific process (step S1004) is terminated, and the process proceeds to step S1005.
 以上説明したように、本実施例では、近傍検索では単語数に関する単語間距離が短いため検索されるが、実質的には意味がつながらない単語の組の誤検索を抑制することができる。また、ノード同士が直列的な階層関係にあれば、階層が深くても、換言すれば、ノード数に関する単語間距離を適用するため、単語数に関する単語間距離が長くても、確からしい検索キーワードの組を絞り込むことができる。これにより、検索精度が高い。また、本実施例では、章節構造に誤りがあっても章節構造を遵守しつつ、タイトル数に関する単語間距離が短い候補を上位にランキングするため、章節構造に誤りがあっても確からしい記載箇所を特定することができる。 As described above, in this embodiment, the neighborhood search is performed because the distance between words related to the number of words is short, but an erroneous search of a set of words that does not substantially make sense can be suppressed. Also, if the nodes are in a serial hierarchical relationship, even if the hierarchy is deep, in other words, the inter-word distance related to the number of nodes is applied, so even if the inter-word distance related to the number of words is long, a probable search keyword Can be narrowed down. Thereby, the search accuracy is high. In addition, in this embodiment, even if there is an error in the chapter structure, the candidates with the shortest distance between words related to the number of titles are ranked in the top while observing the chapter structure. Can be specified.
 なお、本発明は前述した実施例に限定されるものではなく、添付した特許請求の範囲の趣旨内における様々な変形例及び同等の構成が含まれる。例えば、前述した実施例は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに本発明は限定されない。また、ある実施例の構成の一部を他の実施例の構成に置き換えてもよい。また、ある実施例の構成に他の実施例の構成を加えてもよい。また、各実施例の構成の一部について、他の構成の追加・削除・置換をしてもよい。 The present invention is not limited to the above-described embodiments, and includes various modifications and equivalent configurations within the scope of the appended claims. For example, the above-described embodiments have been described in detail for easy understanding of the present invention, and the present invention is not necessarily limited to those having all the configurations described. A part of the configuration of one embodiment may be replaced with the configuration of another embodiment. Moreover, you may add the structure of another Example to the structure of a certain Example. In addition, for a part of the configuration of each embodiment, another configuration may be added, deleted, or replaced.
 また、前述した各構成、機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路で設計する等により、ハードウェアで実現してもよく、プロセッサがそれぞれの機能を実現するプログラムを解釈し実行することにより、ソフトウェアで実現してもよい。 In addition, each of the above-described configurations, functions, processing units, processing means, etc. may be realized in hardware by designing a part or all of them, for example, with an integrated circuit, and the processor realizes each function. It may be realized by software by interpreting and executing the program to be executed.
 各機能を実現するプログラム、テーブル、ファイル等の情報は、メモリ、ハードディスク、SSD(Solid State Drive)等の記憶装置、又は、ICカード、SDカード、DVD等の記録媒体に格納することができる。 Information such as programs, tables, and files that realize each function can be stored in a storage device such as a memory, a hard disk, and an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, and a DVD.
 また、制御線や情報線は説明上必要と考えられるものを示しており、実装上必要な全ての制御線や情報線を示しているとは限らない。実際には、ほとんど全ての構成が相互に接続されていると考えてよい。 Also, the control lines and information lines indicate what is considered necessary for the explanation, and do not necessarily indicate all control lines and information lines necessary for mounting. In practice, it can be considered that almost all the components are connected to each other.

Claims (13)

  1.  プログラムを実行するプロセッサと、前記プログラムを格納する記憶デバイスと、を有する情報検索装置であって、
     前記プロセッサは、
     文書内の文字および当該文字の位置を示す文字情報に基づいて前記文書内のタイトルおよび段落をノードとして階層構造化した木構造データの中から、検索キーワード群を辿るパスを探索することにより、当該パスに含まれる検索キーワードの前記木構造データ上の位置を特定する特定処理と、
     前記特定処理によって特定された前記パスに含まれる検索キーワードの前記木構造データ上の位置に基づいて、前記検索キーワード群の特定精度に関する評価値を算出する算出処理と、
     前記算出処理によって算出された評価値に基づいて、前記文書内の前記パスに該当する記載箇所を表示する表示処理と、
     を実行することを特徴とする情報検索装置。
    An information retrieval apparatus comprising a processor that executes a program and a storage device that stores the program,
    The processor is
    By searching a path following the search keyword group from the tree-structured data hierarchically structured with the title and paragraph in the document as nodes based on the character information indicating the character in the document and the position of the character, A specifying process for specifying the position of the search keyword included in the path on the tree structure data;
    A calculation process for calculating an evaluation value related to a specific accuracy of the search keyword group based on a position on the tree structure data of a search keyword included in the path specified by the specifying process;
    A display process for displaying a description portion corresponding to the path in the document based on the evaluation value calculated by the calculation process;
    An information retrieval apparatus characterized by executing
  2.  前記算出処理では、前記プロセッサは、前記パスに含まれる検索キーワードのうち第1の検索キーワードと前記パスにより前記第1の検索キーワードとつながる第2の検索キーワードとの単語間距離に基づいて、前記評価値を算出することを特徴とする請求項1に記載の情報検索装置。 In the calculation process, the processor is configured based on an interword distance between a first search keyword among search keywords included in the path and a second search keyword connected to the first search keyword through the path. The information retrieval apparatus according to claim 1, wherein an evaluation value is calculated.
  3.  前記算出処理では、前記プロセッサは、前記第1の検索キーワードおよび前記第2の検索キーワードが異なるノードに含まれる場合、前記第1の検索キーワードおよび前記第2の検索キーワードとの間のノード数に応じた単語間距離に基づいて、前記評価値を算出することを特徴とする請求項2に記載の情報検索装置。 In the calculation process, when the first search keyword and the second search keyword are included in different nodes, the processor determines the number of nodes between the first search keyword and the second search keyword. The information search apparatus according to claim 2, wherein the evaluation value is calculated based on the inter-word distance.
  4.  前記算出処理では、前記プロセッサは、前記第1の検索キーワードおよび前記第2の検索キーワードが同一のノードに含まれる場合、前記第1の検索キーワードおよび前記第2の検索キーワードとの間の単語数に応じた単語間距離に基づいて、前記評価値を算出することを特徴とする請求項3に記載の情報検索装置。 In the calculation process, the processor counts the number of words between the first search keyword and the second search keyword when the first search keyword and the second search keyword are included in the same node. The information search apparatus according to claim 3, wherein the evaluation value is calculated based on a distance between words corresponding to the word.
  5.  前記特定処理では、前記プロセッサは、前記パスの中から、前記検索キーワードの種類数を最も多く含み、かつ、前記検索キーワード群のうち第1の検索キーワードと前記パスにより前記第1の検索キーワードとつながる第2の検索キーワードとの単語間距離の総和が最短となるパスを特定して、当該特定したパスに含まれる検索キーワードの前記木構造データ上の位置を特定することを特徴とする請求項1に記載の情報検索装置。 In the specifying process, the processor includes the largest number of types of the search keyword among the paths, and the first search keyword and the first search keyword by the path are included in the search keyword group. The path with the shortest sum of distances between words with the connected second search keyword is specified, and the position of the search keyword included in the specified path on the tree structure data is specified. The information search device according to 1.
  6.  前記特定処理では、前記プロセッサは、前記第1の検索キーワードおよび前記第2の検索キーワードが異なるノードに含まれる場合、前記第1の検索キーワードおよび前記第2の検索キーワードとの間のノード数に応じた単語間距離を、前記最短となるパスの特定に用いることを特徴とする請求項5に記載の情報検索装置。 In the specifying process, when the first search keyword and the second search keyword are included in different nodes, the processor determines the number of nodes between the first search keyword and the second search keyword. 6. The information search apparatus according to claim 5, wherein the inter-word distance is used for specifying the shortest path.
  7.  前記特定処理では、前記プロセッサは、前記第1の検索キーワードおよび前記第2の検索キーワードが同一のノードに含まれる場合、前記第1の検索キーワードおよび前記第2の検索キーワードとの間の単語数に応じた単語間距離を、前記最短となるパスの特定に用いることを特徴とする請求項6に記載の情報検索装置。 In the specifying process, the processor counts the number of words between the first search keyword and the second search keyword when the first search keyword and the second search keyword are included in the same node. The information retrieval apparatus according to claim 6, wherein a distance between words corresponding to the number is used to identify the shortest path.
  8.  前記算出処理では、前記プロセッサは、前記検索キーワード群のうち前記パスに含まれる検索キーワードの割合に基づいて、前記評価値を算出することを特徴とする請求項1に記載の情報検索装置。 The information search apparatus according to claim 1, wherein, in the calculation process, the processor calculates the evaluation value based on a ratio of search keywords included in the path in the search keyword group.
  9.  前記算出処理では、前記プロセッサは、前記パスに含まれる検索キーワードが前記タイトルに出現する割合に基づいて、前記評価値を算出することを特徴とする請求項1に記載の情報検索装置。 2. The information search apparatus according to claim 1, wherein, in the calculation process, the processor calculates the evaluation value based on a rate at which a search keyword included in the path appears in the title.
  10.  前記算出処理では、前記プロセッサは、前記検索キーワード群の並び順と、前記パスに含まれる検索キーワードの前記パス上での出現順と、の一致度に基づいて、前記評価値を算出することを特徴とする請求項1に記載の情報検索装置。 In the calculation process, the processor calculates the evaluation value based on a degree of coincidence between the order in which the search keyword group is arranged and the order in which the search keywords included in the path appear on the path. The information search apparatus according to claim 1, wherein
  11.  前記算出処理では、前記プロセッサは、前記パスに含まれる検索キーワードが出現する段落の数と、前記文書内の段落の総数と、前記パスに含まれる検索キーワードが出現するタイトルの数と、前記文書内のタイトルの総数と、から得られる前記パスに含まれる検索キーワードの重要度に基づいて、前記評価値を算出することを特徴とする請求項1に記載の情報検索装置。 In the calculation process, the processor includes the number of paragraphs in which the search keyword included in the path appears, the total number of paragraphs in the document, the number of titles in which the search keyword included in the path appears, and the document. The information search device according to claim 1, wherein the evaluation value is calculated based on the total number of titles in the list and the importance of the search keyword included in the path obtained from the title.
  12.  記憶デバイスに格納されたプログラムを実行するプロセッサが、
     文書内の文字および当該文字の位置を示す文字情報に基づいて前記文書内のタイトルおよび段落をノードとして階層構造化した木構造データの中から、検索キーワード群を辿るパスを探索することにより、当該パスに含まれる検索キーワードの前記木構造データ上の位置を特定する特定処理と、
     前記特定処理によって特定された前記パスに含まれる検索キーワードの前記木構造データ上の位置に基づいて、前記検索キーワード群の特定精度に関する評価値を算出する算出処理と、
     前記算出処理によって算出された評価値に基づいて、前記文書内の前記パスに該当する記載箇所を表示する表示処理と、
     を実行することを特徴とする情報検索方法。
    A processor that executes a program stored in a storage device;
    By searching a path following the search keyword group from the tree-structured data hierarchically structured with the title and paragraph in the document as nodes based on the character information indicating the character in the document and the position of the character, A specifying process for specifying the position of the search keyword included in the path on the tree structure data;
    A calculation process for calculating an evaluation value related to a specific accuracy of the search keyword group based on a position on the tree structure data of a search keyword included in the path specified by the specifying process;
    A display process for displaying a description portion corresponding to the path in the document based on the evaluation value calculated by the calculation process;
    An information retrieval method characterized by executing
  13.  プロセッサに、
     文書内の文字および当該文字の位置を示す文字情報に基づいて前記文書内のタイトルおよび段落をノードとして階層構造化した木構造データの中から、検索キーワード群を辿るパスを探索することにより、当該パスに含まれる検索キーワードの前記木構造データ上の位置を特定する特定処理と、
     前記特定処理によって特定された前記パスに含まれる検索キーワードの前記木構造データ上の位置に基づいて、前記検索キーワード群の特定精度に関する評価値を算出する算出処理と、
     前記算出処理によって算出された評価値に基づいて、前記文書内の前記パスに該当する記載箇所を表示する表示処理と、
     を実行させることを特徴とする情報検索プログラム。
    To the processor,
    By searching a path following the search keyword group from the tree-structured data hierarchically structured with the title and paragraph in the document as nodes based on the character information indicating the character in the document and the position of the character, A specifying process for specifying the position of the search keyword included in the path on the tree structure data;
    A calculation process for calculating an evaluation value related to a specific accuracy of the search keyword group based on a position on the tree structure data of a search keyword included in the path specified by the specifying process;
    A display process for displaying a description portion corresponding to the path in the document based on the evaluation value calculated by the calculation process;
    An information retrieval program characterized by causing
PCT/JP2014/072762 2014-08-29 2014-08-29 Information retrieval apparatus, information retrieval method, and information retrieval program WO2016031055A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2014/072762 WO2016031055A1 (en) 2014-08-29 2014-08-29 Information retrieval apparatus, information retrieval method, and information retrieval program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2014/072762 WO2016031055A1 (en) 2014-08-29 2014-08-29 Information retrieval apparatus, information retrieval method, and information retrieval program

Publications (1)

Publication Number Publication Date
WO2016031055A1 true WO2016031055A1 (en) 2016-03-03

Family

ID=55398990

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2014/072762 WO2016031055A1 (en) 2014-08-29 2014-08-29 Information retrieval apparatus, information retrieval method, and information retrieval program

Country Status (1)

Country Link
WO (1) WO2016031055A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597341A (en) * 2018-05-25 2021-04-02 中科寒武纪科技股份有限公司 Video retrieval method and video retrieval mapping relation generation method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07225770A (en) * 1994-02-10 1995-08-22 Fuji Xerox Co Ltd Data retrieval device
JP2008146209A (en) * 2006-12-07 2008-06-26 Just Syst Corp Document retrieval device, document retrieval method and document retrieval program
WO2009048130A1 (en) * 2007-10-12 2009-04-16 Nec Corporation Document rating calculation system, document rating calculation method and program

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07225770A (en) * 1994-02-10 1995-08-22 Fuji Xerox Co Ltd Data retrieval device
JP2008146209A (en) * 2006-12-07 2008-06-26 Just Syst Corp Document retrieval device, document retrieval method and document retrieval program
WO2009048130A1 (en) * 2007-10-12 2009-04-16 Nec Corporation Document rating calculation system, document rating calculation method and program

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597341A (en) * 2018-05-25 2021-04-02 中科寒武纪科技股份有限公司 Video retrieval method and video retrieval mapping relation generation method and device

Similar Documents

Publication Publication Date Title
JP5316158B2 (en) Information processing apparatus, full-text search method, full-text search program, and recording medium
US10360294B2 (en) Methods and systems for efficient and accurate text extraction from unstructured documents
JP6775935B2 (en) Document processing equipment, methods, and programs
JP2013105321A (en) Document processing device, method of analyzing relationship between document constituents and program
JP2007072646A (en) Retrieval device, retrieval method, and program therefor
JP4832952B2 (en) Database analysis system, database analysis method and program
JP6108212B2 (en) Synonym extraction system, method and program
JP5780036B2 (en) Extraction program, extraction method and extraction apparatus
JP5869948B2 (en) Passage dividing method, apparatus, and program
WO2016031055A1 (en) Information retrieval apparatus, information retrieval method, and information retrieval program
KR101113787B1 (en) Apparatus and method for indexing text
JP2010272006A (en) Relation extraction apparatus, relation extraction method and program
JP2019211959A (en) Search method, search program, and search device
JP5541014B2 (en) Book information search device, book information search system, book information search method and program
KR20230003184A (en) information retrieval system
WO2012061983A1 (en) Seed set expansion
JP6777445B2 (en) Citation map generator, citation map generation method and computer program
JPWO2020157887A1 (en) Sentence structure vectorization device, sentence structure vectorization method, and sentence structure vectorization program
JP6565565B2 (en) Information processing apparatus, name determination method, and name determination program
JP5417359B2 (en) Document evaluation support system and document evaluation support method
JP2014235584A (en) Document analysis system, document analysis method, and program
JP6213019B2 (en) Sequence extraction method, sequence extraction program, and sequence extraction device
US11100099B2 (en) Data acquisition device, data acquisition method, and recording medium
JP5998779B2 (en) SEARCH DEVICE, SEARCH METHOD, AND PROGRAM
JP2014146076A (en) Character string extraction method, character string extraction apparatus, and character string extraction program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14900539

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14900539

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP