WO2016162961A1 - Text search device - Google Patents

Text search device Download PDF

Info

Publication number
WO2016162961A1
WO2016162961A1 PCT/JP2015/060904 JP2015060904W WO2016162961A1 WO 2016162961 A1 WO2016162961 A1 WO 2016162961A1 JP 2015060904 W JP2015060904 W JP 2015060904W WO 2016162961 A1 WO2016162961 A1 WO 2016162961A1
Authority
WO
WIPO (PCT)
Prior art keywords
specification item
sentence
text
representative
search
Prior art date
Application number
PCT/JP2015/060904
Other languages
French (fr)
Japanese (ja)
Inventor
貴元 松井
慶 今沢
Original Assignee
株式会社日立製作所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社日立製作所 filed Critical 株式会社日立製作所
Priority to PCT/JP2015/060904 priority Critical patent/WO2016162961A1/en
Publication of WO2016162961A1 publication Critical patent/WO2016162961A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present invention relates to a text search device that extracts a specific text from an electronic document.
  • Patent Document 1 Japanese Patent No. 2885487
  • word notation and semantic category are extracted from the search sentence, similarity with index of registered documents (index of word notation and semantic category attached to headings and paragraphs) is calculated, It is described that search and display are performed in order from the chapter having the highest sum of the similarity of the paragraphs to the paragraph having the highest similarity of each paragraph in the same chapter.
  • the word notation “format” and “format” are semantic categories [FMT]
  • the word notation “output” and “write” are semantic categories [OUT]
  • the word notation “designation” is semantic category [SITE].
  • the degree of similarity for each paragraph is calculated based on the presence or absence of the corresponding “semantic category”.
  • the total value determines the order of search and display. Therefore, even if the paragraphs are searched with the same similarity, one is the content that you want to search, the other is the content that you do not want to detect, or the content that you want to detect when the total similarity score is lower Some things tend to happen. So-called erroneous search (false detection) is likely to occur.
  • the description of a specific word may be omitted. In that case, the similarity is low, and as a result, the location is not searched.
  • An object of the present invention is to improve search accuracy.
  • the search accuracy can be improved.
  • FIG. 1 shows a configuration example of a text search apparatus 100 that constitutes the system of the present embodiment.
  • the text search apparatus 100 includes a data storage unit 101, a model generation processing unit 102, a model storage unit 103, a text extraction processing unit 104, an interface unit 105, and a bus.
  • the interface unit 105 is an input device that inputs a technical document such as a requirement specification or an output device that outputs an extracted sentence and the like, for example, a keyboard, a mouse, a display, and a printer.
  • a graphical user interface is configured on the screen of the interface unit 105 based on the processing of the text extraction processing unit 104, and various types of information are displayed.
  • the data storage unit 101 includes known elements such as HDD and MO, for example, and includes a specification item tree structure 1011, a technical document result 1012, a sentence-specific topic probability result 1013, and a sentence-specific representative specification item result 1014. . These data information, programs, and the like may be in a format that is acquired / referenced from the outside via a communication network.
  • the model generation processing unit 102 includes known elements such as a CPU, a RAM, and a ROM.
  • the model generation processing unit 102 is a part that performs processing for realizing the characteristic functions of the present embodiment, and includes a topic-to-representative specification item linking engine 1021, a sentence-specific topic probability estimation model generation engine 1022, a representative A specification item certainty factor calculation model generation engine 1023.
  • the representative item specification means a specification item including lower specification items defined in the upper node of the specification item to be searched in the specification item tree structure data 1011. In the following embodiments, a case where the representative specification item is the specification item of the highest node will be described as an example.
  • the model storage unit 103 includes, for example, known elements such as HDD and MO, and includes a sentence-specific topic probability estimation model 1031 and a representative specification item certainty factor estimation model 1032. These data information, programs, and the like may be in a format that is acquired / referenced from the outside via a communication network.
  • the sentence extraction processing unit 104 is composed of known elements such as a CPU, a RAM, and a ROM.
  • the sentence extraction processing unit 104 is a part that performs processing for realizing the characteristic functions of the present embodiment.
  • the sentence-specific topic probability calculation engine 1041, the representative specification item certainty factor calculation engine 1042, and the in-sentence search word proximity degree A calculation engine 1043 and a composite index calculation engine 1044 are included.
  • this apparatus 100 has known elements such as an OS, middleware, and application, and in particular, has an existing processing function for displaying a GUI screen in a web page format on the interface unit 105 such as a display.
  • the description location extraction processing unit 104 performs processing of drawing and displaying a predetermined screen, processing of data information input by the user on the screen, and the like using the above-described existing processing function.
  • FIG. 2 is a data definition example of the specification item tree structure 1011.
  • the specification item tree structure 1011 is a hierarchical definition of specification item groups that may be used to determine product specifications.
  • Specification item 3 “101103,“ Specification item 2 ”110102 and“ Specification item 1 ”110101 describe specification items in nodes (vertices) as higher-level specifications, and a parent-child relationship in which they are connected by edges (connection lines) Is defined in a tree structure.
  • FIG. 3 is a table example of the technical document performance 1012.
  • sentence IDs uniquely identifying each sentence of the document constituent unit in the technical document stored in the technical document record 1012 are arranged.
  • the horizontal axis stores, from the right side, information uniquely identifying a sentence with a sentence ID and a document constituent unit to which the sentence belongs.
  • a file name is stored as information for uniquely identifying a sentence
  • a chapter title and a section title are stored as information for uniquely identifying a document constituent unit.
  • the document composition unit is a unit that hierarchically divides a sentence.
  • “part” (part), “section” (subsection) It may be a “section”, “paragraph”, etc.
  • FIG. 4 is a table example of the topic probability result by sentence 1013.
  • Technical text IDs are arranged on the vertical axis (row).
  • topic IDs that uniquely identify topics included in the sentence described in the sentence ID are arranged.
  • the probability that the text with the text ID is a text about the topic with each topic ID is stored.
  • the cell portion where the horizontal axis “sentence A” and the vertical axis “topic 2” intersect is “0.2”. It means that the probability of handling was 0.2.
  • FIG. 5 is a table example of the representative specification item result 1014 by text.
  • This table indicates to which representative specification item the text of each document constituent unit relates, and the text ID is arranged on the vertical axis (row) of the table, and the representative specification item ID is arranged on the vertical axis (row).
  • the portion of the cell where the horizontal axis “sentence ID” and the vertical axis “representative specification item ID” intersect is flag data indicating the presence / absence of a word group in each sentence (in this example, “1”, no "0") is stored. Taking the example in FIG. 3, the cell portion where the horizontal axis “sentence 1” and the vertical axis “representative specification item B” intersect is “1”. Means appearing.
  • FIG. 6 is a mathematical example of a sentence-specific topic probability estimation model. This model estimates the topic probability of a sentence by using the word distribution by sentence of each document constituent unit as an input. At this time, there is Latent Dirichlet Allocation (LDA) as a usable technique, and FIG. 6 is an example of a mathematical formula when LDA is used as the technique.
  • LDA Latent Dirichlet Allocation
  • FIG. 7 is an example of a mathematical expression of the representative specification item certainty factor estimation model. This model estimates what typical specification items are described in each sentence based on the topic probability of each sentence. The formula in FIG. 7 is an example in the case of using Support Vector Machine (SVM).
  • SVM Support Vector Machine
  • FIG. 8 is a diagram showing the overall process flow.
  • step S801 a requirement specification which is a technical document and a specification item created from the requirement specification are acquired as input information via the interface unit 105.
  • the requirement specification entered here is referred to as “new technical document”, and the specification item is referred to as “search specification item”.
  • step S802 each sentence that is a document constituent unit (eg, chapter or section) of a new technical document is extracted, each sentence is divided into words, and the number of occurrences of each word (hereinafter referred to as “word distribution”). .)
  • FIG. 12 shows an example of the total result of the word distribution by sentence.
  • the sentence-specific word distribution and sentence-specific topic probability estimation model 1031 obtained here are input to the sentence-specific topic probability calculation engine 1041 to calculate the topic probability of each sentence. Details will be described later in (9) and FIG.
  • step S803 the representative specification item defined in the parent node of the search specification item is acquired from the specification item tree structure 1011. If the specification item to be searched does not have a parent node, the search specification item is set as a representative specification item.
  • this representative specification item is referred to as “search representative specification item”.
  • the specification item tree structure is generated by the engine 1021 with the specification item and the representative specification item.
  • step S804 the retrieval representative specification item acquired in step 803, the topic probability of each sentence calculated in step 802, and the representative specification item certainty factor estimation model 1032 are input to the representative specification item certainty factor calculation engine 1042, and the new technology
  • a representative specification item certainty factor (hereinafter referred to as “confidence factor”), which means the probability that each sentence in the document constituent unit of the document describes a topic related to the retrieval representative specification item, is calculated. Details of the processing steps for acquiring the retrieval representative specification item will be described later with reference to (10), (11), FIG. 13 and FIG.
  • step S805 the words constituting the search specification item are input to the in-sentence search word proximity calculation engine 1043, and the description corresponding to the search specification item is included in each sentence of the document constituent unit of the new technical document.
  • the in-text search word proximity (hereinafter referred to as “proximity”) that means the probability is calculated.
  • the target to be input here may be only a search specification item having a certainty factor or more.
  • step S806 the certainty factor calculated in step 804 and the proximity calculated in step 805 are input to the composite index calculation engine 1044, and the probability that the new technical text is described for the topic related to the representative specification item is obtained.
  • a composite index is calculated which means that the probability of being high and including a description related to the search specification item is high.
  • step S807 sentences relating to each search specification item in the new technical document are extracted based on the composite index in step S806 and displayed as an extraction result.
  • the preceding and following sentences are displayed, and the sentence relating to the search specification item is highlighted.
  • each sentence is extracted in units of document structure of the new technical document acquired in step S801, and a morphological analysis is performed to create a word list in which each sentence is divided into words.
  • FIG. 10 shows an example of a word list.
  • “Device A”, “To”, and “Device B” from the new technical document 1001 that “the arrangement of device A and device B has been changed, so please extend the wiring length.” “” “” Place “” “” “” “” “Change” “” to “” Now “” So “”, “” Wiring "” “” “Length” “” “Extension” “” “” “Please” " A word list 1002 divided into “.” Is created.
  • step S8022 the part of speech is determined for each word in the word list 1002 created in step 8021.
  • step S8023 only words having an arbitrary part of speech are selected from the part of speech determined in step 8022.
  • nouns are extracted. Nouns are more closely related to modifiers such as adjectives and adjective adverbs as words expressing the meaning of product or service specification items. Therefore, by selecting only the noun word from all, processing in a short time is possible with almost no reduction in extraction accuracy.
  • step S8024 the word appearance frequency (word distribution) in each sentence is totaled for each sentence ID for the part-of-speech word selected in step 8023.
  • step S8025 it is confirmed whether or not the sentence-specific topic probability estimation model 1031 is the latest with respect to the sentence stored in the technical sentence result 1012. If the sentence-specific topic probability estimation model is not updated, a sentence-specific topic probability estimation model 1031 is acquired in step S8026. When the sentence-specific topic probability estimation model 1031 is updated, the sentence-specific topic probability estimation model 1031 is updated and acquired in step S8027. Details of the update process of the sentence-specific topic probability estimation model will be described later with reference to FIG.
  • step S8028 the sentence-specific topic probability calculation engine 1041 calculates the topic probability of each sentence using the sentence-specific topic probability estimation model acquired in step S8026 or S8027.
  • Document-specific topic rate estimation model update processing flow FIG. 11 is a diagram showing an update processing flow of the sentence-specific topic rate estimation model.
  • step S80271 a technical document that has not been subjected to this processing is acquired from the technical document record 1012. Note that the sentence-specific topic probability estimation model 1031 may be re-created with the technical document that has already undergone this processing as a processing target.
  • step S80272 similarly to step 8021, morphological analysis is performed on the technical text acquired in step S80271 to generate a word list.
  • step S80273 as in step S8022, the part-of-speech determination process for each word in the word list created in step S80272 is executed.
  • step S80274 as in step S8023, only words of any part of speech are selected from the part of speech determined in step S80273.
  • step S80275 as in step S8024, the appearance frequencies of words corresponding to the part of speech selected in step S80274 are tabulated.
  • step S80276 the data processed in steps S80271 to S80275 and the sentence-specific topic probability estimation model 1031 are input to the topic probability estimation model generation engine 1022 to update the sentence-specific topic rate estimation model 1031.
  • (11) Representative specification items Confidence Level Calculation Processing Flow FIG. 13 is a diagram showing a representative specification item reliability calculation processing flow.
  • step S8041 it is confirmed whether the representative specification item certainty factor estimation model 1032 is up-to-date with respect to the representative specification item result 1014 by technical text.
  • the representative specification item certainty factor estimation model 1032 is acquired in step S8042.
  • the representative specification item certainty factor estimation model is updated and acquired in step S8043. Details of the update process of the representative specification item certainty factor estimation model will be described later with reference to FIG.
  • FIG. 14 is a diagram showing an update processing flow of the representative specification item certainty factor estimation model.
  • step S80431 a sentence-specific topic probability record newly registered after the previous process is acquired from the sentence-specific topic probability record 1013.
  • step S80432 the representative specification item result by sentence newly registered after the previous process is acquired from the representative specification item result by sentence 1014.
  • step S80433 the representative specification item estimation model generation engine 1023 adds the data added to the sentence-specific topic probability record acquired in step S80431 and the sentence-specific representative specification item record acquired in step S80432, and representative specification item certainty factor estimation.
  • the model 1032 is input to the representative specification item estimation model generation engine, and the representative specification item certainty factor estimation model is updated.
  • Output screen FIG. 15 shows an output screen. An example of a GUI screen displayed on the interface unit 104 will be described with reference to FIG.
  • the GUI screen includes a document structure display area 1501 for a new technical document, a specification hierarchical structure display area 1502 for a product or service related to the new technical text, and a text display area 1503 for displaying a text of the document structure unit of the new technical document. Mainly composed.
  • chapter structure display area 1501 chapter titles, section titles, and the like are displayed hierarchically based on the chapter attribute information stored in the technical document performance 1012.
  • the document structure highlighting display 1504 indicates the position in the sentence corresponding to the specification item designated by the user.
  • the document structure probability display 1505 indicates the probability (representative specification item certainty) that a topic related to the representative specification item defined in the parent node of the specification item designated by the user is described. In the example of FIG. 15, the probability is displayed as a 5-level bar display, but other probability display forms may be used.
  • the specification hierarchical structure display area 1502 based on the specification item tree structure stored in the specification item tree structure 1011, the specification items related to the target product or service are displayed hierarchically, and this item is displayed below the corresponding specification item.
  • the extraction result by the process of the invention is displayed, and the probability of correctness of the extraction result (sentence in words in the sentence) is displayed in order from the top.
  • the specification hierarchical structure highlight display 1506 indicates candidates for extraction results of specification items designated by the user.
  • a specification hierarchy highlighting 1507 indicates representative specification items defined in the parent node of the specification item specified by the user.
  • the specification hierarchical structure probability display 1508 indicates the probability of correctness of the extraction result of the corresponding specification item (text search word proximity). In the example of FIG. 15, the probability is displayed as a 5-level bar display, but other probability display forms may be used.
  • the in-text highlighting area 1509 shows a sentence corresponding to the in-specification hierarchy highlighting 1506 designated by the user.
  • this screen there is a composite index ratio setting 1510 for setting a ratio for synthesizing representative specification item certainty factor and search word proximity in text, which is displayed in the output screen according to the operation of the setting lever by the user. Recalculates the probability, and dynamically changes and visualizes the probability display.
  • the user Based on this output screen, the user sequentially specifies the candidate extraction results of the specification items to be confirmed in the new technical document, which text structure is positioned, how likely it is, and the text It is possible to quickly identify the location where information necessary for engineering work is described from a technical document such as a requirement specification by confirming how it is specifically described in the list. Thereby, it is possible to shorten the period required for reading and understanding the technical document and the entire engineering work.
  • DESCRIPTION OF SYMBOLS 100 ... Text search device 101 ... Data storage part 102 ... Model production

Abstract

The text search device according to the present invention stores, in advance, specification item tree structures, each for managing specification items in a tree structure, and also stores, in advance, topic probabilities and a representative specification item for each item of text in other selected documents. The text search device then performs: a process for acquiring at least one specification item, which serves as a search key, and a technical document; a process for calculating topic probabilities for each of the individual texts constituting the technical document, on the basis of the word distribution in the text; a process for determining, from a specification item tree structure, the representative specification item associated with the search specification item; a process for calculating a representative specification item certainty level for each text on the basis of the topic probabilities for the text; a process for calculating an intra-text word proximity level for each text on the basis of a word or words constituting the search specification item; a process for combining the certainty level and the proximity level to calculate a combined measure; and a process for extracting one or more texts relating to the search specification item for a new technical document, on the basis of the combined measure, and displaying the extracted one or more texts as extraction results. In this way the text search device improves the accuracy of text search results.

Description

文章検索装置Text search device
 本発明は、電子化された文書から特定の文章を抽出する文章検索装置に関する。 The present invention relates to a text search device that extracts a specific text from an electronic document.
 本技術分野の背景技術として、特許第2885487号(特許文献1)がある。この公報には、検索文から単語表記と意味カテゴリを抽出し、登録文書のインデックス(見出しおよび段落に付された単語表記や意味カテゴリのインデックス)との類似度を算出し、見出しの類似度と段落の類似度との和が大きい章から、同一章内では各段落の類似度が高い段落から順に、検索、表示することが記載されている。 As a background art in this technical field, there is Japanese Patent No. 2885487 (Patent Document 1). In this gazette, word notation and semantic category are extracted from the search sentence, similarity with index of registered documents (index of word notation and semantic category attached to headings and paragraphs) is calculated, It is described that search and display are performed in order from the chapter having the highest sum of the similarity of the paragraphs to the paragraph having the highest similarity of each paragraph in the same chapter.
特許第2885487号Japanese Patent No. 2885487
 上記公報第3ページ左欄では、単語表記「書式」「フォーマット」を意味カテゴリ[FMT]、単語表記「出力」「書き込み」を意味カテゴリ[OUT]、単語表記「指定」を意味カテゴリ[SITE]とし、該当する「意味カテゴリ」の有無で、各段落についての類似度を算出している。しかし、検索、表示する順位を決定するのは、その合計値のみである。そのため、同じ類似度として検索された段落であっても、一方は検索したい内容で、他方は検出したくなかった内容であることや、類似度の合計点が低い方が検出したかった内容であることも生じやすい。いわゆる誤検索(誤検出)が発生しやすい。 In the left column on the third page of the above publication, the word notation “format” and “format” are semantic categories [FMT], the word notation “output” and “write” are semantic categories [OUT], and the word notation “designation” is semantic category [SITE]. And the degree of similarity for each paragraph is calculated based on the presence or absence of the corresponding “semantic category”. However, only the total value determines the order of search and display. Therefore, even if the paragraphs are searched with the same similarity, one is the content that you want to search, the other is the content that you do not want to detect, or the content that you want to detect when the total similarity score is lower Some things tend to happen. So-called erroneous search (false detection) is likely to occur.
 また、文書によっては、特定の単語の記載を省略している場合もある。その場合には、類似度が低くなるため、結果的に当該個所は検索漏れとなる。 Also, depending on the document, the description of a specific word may be omitted. In that case, the similarity is low, and as a result, the location is not searched.
 本発明の目的は、検索精度を向上することにある。 An object of the present invention is to improve search accuracy.
 本願は、具体的な解決手段を複数含むものであるが、代表的なものは次のとおりである。 This application includes a plurality of specific solutions, but typical ones are as follows.
 仕様項目をツリー構造で管理する仕様項目ツリー構造と、他の文書の文章毎のトピック確率と代表仕様項目を記憶しておき、検索キーとなる仕様項目と技術文書とを取得する処理と、技術文書の文書構成単位の各文章に含まれる単語分布から各文章のトピック確率を算出する処理と、仕様項目ツリー構造から仕様項目の代表仕様項目を取得する処理と、トピック確率から文章毎に代表仕様項目確信度を算出する処理と、検索仕様項目を構成する単語をもとに文章内検索単語近接度を算出する処理と、確信度と近接度との合成指標を算出する処理と、合成指標をもとに、新規技術文書における各検索仕様項目に関する文章を抽出し、抽出結果として表示する処理とを実行する文章検索装置。 A specification item tree structure for managing specification items in a tree structure, a topic probability and representative specification item for each sentence of other documents, a process for acquiring specification items and technical documents as search keys, and a technology Processing to calculate the topic probability of each sentence from the word distribution contained in each sentence of the document composition unit of the document, processing to obtain the representative specification item of the specification item from the specification item tree structure, and representative specification for each sentence from the topic probability A process for calculating the item certainty factor, a process for calculating the search word proximity in the sentence based on the words constituting the search specification item, a process for calculating a composite index of the certainty factor and the proximity, and a composite index A text search apparatus that executes a process of extracting text related to each search specification item in a new technical document and displaying it as an extraction result.
 本発明によれば、検索精度を向上することができる。 According to the present invention, the search accuracy can be improved.
文章検索装置の構成例を示す図である。It is a figure which shows the structural example of a text search apparatus. 仕様項目ツリー構造のデータ定義例を示す図である。It is a figure which shows the example of data definition of a specification item tree structure. 技術文書実績のテーブル例を示す図である。It is a figure which shows the example of a table of technical document results. 文章別トピック確率実績のテーブル例を示す図である。It is a figure which shows the example of a table of the topic probability track record according to sentences. 文章別代表仕様項目実績のテーブル例を示す図である。It is a figure which shows the example of a table of the representative specification item performance according to text. 文章別トピック確率推定モデルの計算式例を示す図である。It is a figure which shows the example of a calculation formula of the topic probability estimation model classified by sentences. 代表仕様項目確信度推定モデルの計算式例を示す図である。It is a figure which shows the example of a calculation formula of a representative specification item reliability estimation model. 全体処理フローを示す図である。It is a figure which shows the whole processing flow. 新規技術文書の文章別トピック確率計算処理フローを示す図である。It is a figure which shows the topic probability calculation processing flow according to sentence of a new technical document. 単語リストの例を示す図である。It is a figure which shows the example of a word list. 文章別トピック確率推定モデルの更新処理フローを示す図である。It is a figure which shows the update process flow of the topic probability estimation model according to sentences. 文章別単語分布の集計結果例を示す図である。It is a figure which shows the example of an aggregation result of word distribution according to sentences. 代表仕様項目確信度の算出処理フローを示す図である。It is a figure which shows the calculation process flow of representative specification item reliability. 代表仕様項目確信度推定モデルの更新処理フローを示す図である。It is a figure which shows the update process flow of a representative specification item reliability estimation model. 出力画面を示す図である。It is a figure which shows an output screen.
 以下、図面を用いて実施例を説明する。なお、実施の形態を説明するための全図において、同一部には原則として同一符号を付し、その繰り返しの説明は省略する。 Hereinafter, examples will be described with reference to the drawings. Note that components having the same function are denoted by the same reference symbols throughout the drawings for describing the embodiment, and the repetitive description thereof will be omitted.
 (1)システム構成
 図1は、本実施の形態のシステムを構成する文章検索装置100の構成例を示している。
(1) System Configuration FIG. 1 shows a configuration example of a text search apparatus 100 that constitutes the system of the present embodiment.
 文章検索装置100は、データ記憶部101、モデル生成処理部102、モデル記憶部103、文章抽出処理部104、インタフェース部105、およびバスなどで構成される。 The text search apparatus 100 includes a data storage unit 101, a model generation processing unit 102, a model storage unit 103, a text extraction processing unit 104, an interface unit 105, and a bus.
 インタフェース部105は、ユーザの操作により、要求仕様書など技術文書の入力を行う入力装置や抽出文章などの出力を行う出力装置であり、例えばキーボード、マウス、ディスプレイ、プリンタなどがある。本システムでは、文章抽出処理部104の処理に基づいて、インタフェース部105の画面で、グラフィカルユーザインタフェース(GUI)を構成し、各種の情報を表示する。 The interface unit 105 is an input device that inputs a technical document such as a requirement specification or an output device that outputs an extracted sentence and the like, for example, a keyboard, a mouse, a display, and a printer. In this system, a graphical user interface (GUI) is configured on the screen of the interface unit 105 based on the processing of the text extraction processing unit 104, and various types of information are displayed.
 データ記憶部101は、例えばHDDやMO等の公知の要素により構成され、仕様項目ツリー構造1011と、技術文書実績1012と、文章別トピック確率実績1013と、文章別代表仕様項目実績1014とを含む。またこれらの各データ情報およびプログラム等は、通信ネットワークを介して外部から取得・参照される形式としてもよい。 The data storage unit 101 includes known elements such as HDD and MO, for example, and includes a specification item tree structure 1011, a technical document result 1012, a sentence-specific topic probability result 1013, and a sentence-specific representative specification item result 1014. . These data information, programs, and the like may be in a format that is acquired / referenced from the outside via a communication network.
 モデル生成処理部102は、例えばCPU,RAM,ROM等の公知の要素により構成される。モデル生成処理部102は、本実施例の特徴的な機能を実現する処理を行う部分であり、トピックと代表仕様項目との紐付けエンジン1021と、文章別トピック確率推定モデル生成エンジン1022と、代表仕様項目確信度計算モデル生成エンジン1023とを有する。代表項目仕様とは、仕様項目ツリー構造データ1011において、検索対象の仕様項目の上位ノードに定義されている、下位仕様項目を包含する仕様項目を意味する。以下の実施例においては、代表仕様項目が最上位ノードの仕様項目となる場合を例に説明するが、中間層の上位ノードの仕様項目となる場合もあり得る。 The model generation processing unit 102 includes known elements such as a CPU, a RAM, and a ROM. The model generation processing unit 102 is a part that performs processing for realizing the characteristic functions of the present embodiment, and includes a topic-to-representative specification item linking engine 1021, a sentence-specific topic probability estimation model generation engine 1022, a representative A specification item certainty factor calculation model generation engine 1023. The representative item specification means a specification item including lower specification items defined in the upper node of the specification item to be searched in the specification item tree structure data 1011. In the following embodiments, a case where the representative specification item is the specification item of the highest node will be described as an example.
 モデル記憶部103は、例えばHDDやMO等の公知の要素により構成され、文章別トピック確率推定モデル1031と、代表仕様項目確信度推定モデル1032とを含む。またこれらの各データ情報およびプログラム等は、通信ネットワークを介して外部から取得・参照される形式としてもよい。 The model storage unit 103 includes, for example, known elements such as HDD and MO, and includes a sentence-specific topic probability estimation model 1031 and a representative specification item certainty factor estimation model 1032. These data information, programs, and the like may be in a format that is acquired / referenced from the outside via a communication network.
 文章抽出処理部104は、例えばCPU,RAM,ROM等の公知の要素により構成される。文章抽出処理部104は、本実施例の特徴的な機能を実現する処理を行う部分であり、文章別トピック確率計算エンジン1041と、代表仕様項目確信度計算エンジン1042と、文章内検索単語近接度計算エンジン1043と、合成指標計算エンジン1044とを有する。 The sentence extraction processing unit 104 is composed of known elements such as a CPU, a RAM, and a ROM. The sentence extraction processing unit 104 is a part that performs processing for realizing the characteristic functions of the present embodiment. The sentence-specific topic probability calculation engine 1041, the representative specification item certainty factor calculation engine 1042, and the in-sentence search word proximity degree A calculation engine 1043 and a composite index calculation engine 1044 are included.
 なお、本装置100は、図示しないが、OS、ミドルウェア、アプリケーションなどの公知の要素を有し、特にディスプレイなどのインタフェース部105にGUI画面をWebページ形式などで表示するための既存の処理機能を備える。記載箇所抽出処理部104は、上記の既存の処理機能を用いて、所定の画面を描画し表示する処理や、画面でユーザ入力されるデータ情報の処理などを行う。 Although not shown, this apparatus 100 has known elements such as an OS, middleware, and application, and in particular, has an existing processing function for displaying a GUI screen in a web page format on the interface unit 105 such as a display. Prepare. The description location extraction processing unit 104 performs processing of drawing and displaying a predetermined screen, processing of data information input by the user on the screen, and the like using the above-described existing processing function.
 (2)仕様項目ツリー構造
 図2は、仕様項目ツリー構造1011のデータ定義例である。仕様項目ツリー構造1011は、製品の仕様決定に使用される可能性のある仕様項目群を階層的に定義したものであり、例えば“仕様項目4”101104“仕様項目5”101105の上位仕様として“仕様項目3”101103が、更にその上位仕様として“仕様項目2”110102、“仕様項目1”110101が、ノード(頂点)に仕様項目を記載し、それらをエッジ(接続線)で繋いだ親子関係がツリー構造で定義されている。
(2) Specification Item Tree Structure FIG. 2 is a data definition example of the specification item tree structure 1011. The specification item tree structure 1011 is a hierarchical definition of specification item groups that may be used to determine product specifications. Specification item 3 “101103,“ Specification item 2 ”110102 and“ Specification item 1 ”110101 describe specification items in nodes (vertices) as higher-level specifications, and a parent-child relationship in which they are connected by edges (connection lines) Is defined in a tree structure.
 (3)技術文書実績
 図3は、技術文書実績1012のテーブル例である。テーブルの縦軸(行)には、技術文書実績1012に格納された技術文書中の文書構成単位の各文章を固有識別する文章IDが並ぶ。横軸(列)には、右側から、文章IDが付された文章および当該文章が属する文書構成単位を固有識別する情報が格納されている。本実施例の場合、文章を固有識別する情報としてファイル名、文書構成単位を固有識別する情報として、章タイトルおよび節タイトルが格納されている。文書構成単位とは、文章を階層的に区切る単位であり、本実施例の「章」(chapter)、「節」(section)の他、「部」(part)、「項」(subsection)、「節」「段落」(paragraph)などであっても構わない。
(3) Technical Document Performance FIG. 3 is a table example of the technical document performance 1012. On the vertical axis (row) of the table, sentence IDs uniquely identifying each sentence of the document constituent unit in the technical document stored in the technical document record 1012 are arranged. The horizontal axis (column) stores, from the right side, information uniquely identifying a sentence with a sentence ID and a document constituent unit to which the sentence belongs. In this embodiment, a file name is stored as information for uniquely identifying a sentence, and a chapter title and a section title are stored as information for uniquely identifying a document constituent unit. The document composition unit is a unit that hierarchically divides a sentence. In addition to “chapter” and “section” of this embodiment, “part” (part), “section” (subsection), It may be a “section”, “paragraph”, etc.
 (4)文章別トピック確率実績
 図4は、文章別トピック確率実績1013のテーブル例である。縦軸(行)には、技術文章IDが並ぶ。横軸(列)には、文章IDの記載文章に含まれるトピックを固有識別するトピックIDが並ぶ。横軸「文章ID」と縦軸「トピックID」とが交差するセルの部分は、文章IDの文章が各トピックIDのトピックについての文章である確率が格納される。図3の例を取ると、横軸「文章A」と縦軸「トピック2」とが交差するセル部分が「0.2」であることは、「文章A」が「トピック2」のトピックを扱っている確率が0.2であったことを意味する。
(4) Sentence Probability Result by Sentence FIG. 4 is a table example of the topic probability result by sentence 1013. Technical text IDs are arranged on the vertical axis (row). On the horizontal axis (column), topic IDs that uniquely identify topics included in the sentence described in the sentence ID are arranged. In the portion of the cell where the horizontal axis “text ID” and the vertical axis “topic ID” intersect, the probability that the text with the text ID is a text about the topic with each topic ID is stored. Taking the example of FIG. 3, the cell portion where the horizontal axis “sentence A” and the vertical axis “topic 2” intersect is “0.2”. It means that the probability of handling was 0.2.
 (5)文章別代表仕様項目実績
 図5は、文章別代表仕様項目実績1014のテーブル例である。このテーブルは、各文書構成単位の文章がどの代表仕様項目に関係するのかを示すもので、テーブルの縦軸(行)に文章IDが並び、縦軸(行)に代表仕様項目IDが並ぶ。横軸「文章ID」と縦軸「代表仕様項目ID」とが交差するセルの部分は、各文章における単語群の出現有無を示すフラグデータ(本実施例の場合、有は「1」、無は「0」)が格納される。図3での例を取ると、横軸「文章1」と縦軸「代表仕様項目B」とが交差するセル部分が「1」であることは、「文章1」に「代表仕様項目B」が出現することを意味する。
(5) Representative Specification Item Result by Text FIG. 5 is a table example of the representative specification item result 1014 by text. This table indicates to which representative specification item the text of each document constituent unit relates, and the text ID is arranged on the vertical axis (row) of the table, and the representative specification item ID is arranged on the vertical axis (row). The portion of the cell where the horizontal axis “sentence ID” and the vertical axis “representative specification item ID” intersect is flag data indicating the presence / absence of a word group in each sentence (in this example, “1”, no "0") is stored. Taking the example in FIG. 3, the cell portion where the horizontal axis “sentence 1” and the vertical axis “representative specification item B” intersect is “1”. Means appearing.
 (6)文章別トピック確率推定モデル
 図6は、文章別トピック確率推定モデルの数式例である。このモデルは、各文書構成単位の文章別単語分布を入力として、その文章のトピック確率を推定するものである。この際、利用可能な手法としてLatent Dirichlet Allocation(LDA)などがあり、図6は手法としてLDAを用いた場合の数式例である。
(6) Sentence-specific topic probability estimation model FIG. 6 is a mathematical example of a sentence-specific topic probability estimation model. This model estimates the topic probability of a sentence by using the word distribution by sentence of each document constituent unit as an input. At this time, there is Latent Dirichlet Allocation (LDA) as a usable technique, and FIG. 6 is an example of a mathematical formula when LDA is used as the technique.
 (7)代表仕様項目確信度推定モデル
 図7は、代表仕様項目確信度推定モデルの数式例である。このモデルは、文章別トピック確率をもとに、各文章がどのような代表仕様項目について記載しているのかを推定するものである。図7の数式は、Support Vector Machine(SVM)を用いた場合の例である。
(7) Representative specification item certainty factor estimation model FIG. 7 is an example of a mathematical expression of the representative specification item certainty factor estimation model. This model estimates what typical specification items are described in each sentence based on the topic probability of each sentence. The formula in FIG. 7 is an example in the case of using Support Vector Machine (SVM).
 (8)全体処理フロー
 図8は全体処理フローを示す図である。
(8) Overall Process Flow FIG. 8 is a diagram showing the overall process flow.
 ステップS801において、インタフェース部105を介して、技術文書である要求仕様書とその要求仕様書から作成する仕様項目とを入力情報として取得する。以下、ここで入力した要求仕様書を「新規技術文書」、同仕様項目を「検索仕様項目」と称する。 In step S801, a requirement specification which is a technical document and a specification item created from the requirement specification are acquired as input information via the interface unit 105. Hereinafter, the requirement specification entered here is referred to as “new technical document”, and the specification item is referred to as “search specification item”.
 ステップS802において、新規技術文書の文書構成単位(例、章や節)となる各文章を抽出し、各文章を単語に分割し、それらの各単語の出現数(以下、「単語分布」と称する。)を求める。図12に文章別単語分布の集計結果例を示す。ここで求めた文章別単語分布と文章別トピック確率推定モデル1031を文章別トピック確率計算エンジン1041に入力して各文章のトピック確率を計算する。詳細については、以下(9)及び図9で後述する。 In step S802, each sentence that is a document constituent unit (eg, chapter or section) of a new technical document is extracted, each sentence is divided into words, and the number of occurrences of each word (hereinafter referred to as “word distribution”). .) FIG. 12 shows an example of the total result of the word distribution by sentence. The sentence-specific word distribution and sentence-specific topic probability estimation model 1031 obtained here are input to the sentence-specific topic probability calculation engine 1041 to calculate the topic probability of each sentence. Details will be described later in (9) and FIG.
 ステップS803において、仕様項目ツリー構造1011から、検索仕様項目の親ノードに定義されている代表仕様項目を取得する。もし、検索対象の仕様項目に親ノードがなければ、当該検索仕様項目を代表仕様項目とする。以下、この代表仕様項目を「検索代表仕様項目」と称する。なお、この仕様項目ツリー構造は、仕様項目と代表仕様項目との紐付きエンジン1021により生成される。 In step S803, the representative specification item defined in the parent node of the search specification item is acquired from the specification item tree structure 1011. If the specification item to be searched does not have a parent node, the search specification item is set as a representative specification item. Hereinafter, this representative specification item is referred to as “search representative specification item”. The specification item tree structure is generated by the engine 1021 with the specification item and the representative specification item.
 ステップS804において、前記ステップ803で取得した検索代表仕様項目、ステップ802で計算した各文章のトピック確率及び代表仕様項目確信度推定モデル1032を代表仕様項目確信度計算エンジン1042に入力して、新規技術文書の文書構成単位の各文章が検索代表仕様項目に関連するトピックを記載している確率を意味する代表仕様項目確信度(以下、「確信度」と称する。)を算出する。検索代表仕様項目取得の処理ステップの詳細については、(10)、(11)、図13及び図14を用いて後述する。 In step S804, the retrieval representative specification item acquired in step 803, the topic probability of each sentence calculated in step 802, and the representative specification item certainty factor estimation model 1032 are input to the representative specification item certainty factor calculation engine 1042, and the new technology A representative specification item certainty factor (hereinafter referred to as “confidence factor”), which means the probability that each sentence in the document constituent unit of the document describes a topic related to the retrieval representative specification item, is calculated. Details of the processing steps for acquiring the retrieval representative specification item will be described later with reference to (10), (11), FIG. 13 and FIG.
 ステップS805において、検索仕様項目を構成する単語を文章内検索単語近接度計算エンジン1043に入力して、新規技術文書の文書構成単位の各文章内に検索仕様項目に該当する記載が含まれている確率を意味する文章内検索単語近接度(以下、「近接度」と称する。)を算出する。なお、ここで入力する対象は、確信度が一定以上の検索仕様項目のみを対象としてもよい。 In step S805, the words constituting the search specification item are input to the in-sentence search word proximity calculation engine 1043, and the description corresponding to the search specification item is included in each sentence of the document constituent unit of the new technical document. The in-text search word proximity (hereinafter referred to as “proximity”) that means the probability is calculated. The target to be input here may be only a search specification item having a certainty factor or more.
 ステップS806において、前記ステップ804で算出した確信度及び前記ステップ805で算出した近接度を合成指標計算エンジン1044に入力して、新規技術文章が代表仕様項目に関連するトピックについて記載されている確率が高く、かつ検索仕様項目に関連する記載が含まれている確率が高いことを意味する合成指標を算出する。 In step S806, the certainty factor calculated in step 804 and the proximity calculated in step 805 are input to the composite index calculation engine 1044, and the probability that the new technical text is described for the topic related to the representative specification item is obtained. A composite index is calculated which means that the probability of being high and including a description related to the search specification item is high.
 ステップS807において、ステップS806の合成指標をもとに、新規技術文書における各検索仕様項目に関する文章を抽出し、抽出結果として表示する。表示あたっては、前後の文章を表示し、検索仕様項目に関する文章を強調表示する。 In step S807, sentences relating to each search specification item in the new technical document are extracted based on the composite index in step S806 and displayed as an extraction result. In the display, the preceding and following sentences are displayed, and the sentence relating to the search specification item is highlighted.
 (9)新規技術文書の文章別トピック確率計算処理フロー
 図9は新規技術文書の文章別トピック確率計算処理フローを示す図である。
(9) Topic-Specific Topic Probability Calculation Processing Flow for New Technical Document FIG.
 ステップS8021において、ステップS801で取得した新規技術文書の文書構成単位で各文章を抽出し、形態素解析を行うことで、各文章を単語単位に分割した単語リストをそれぞれ作成する。図10に、単語リストの例を示す。図10の例では、「機器Aと機器Bの配置が変更になりましたので、配線の長さを延長してください。」という新規技術文書1001から、「機器A」「と」「機器B」「の」「配置」「が」「変更」「に」「なりました」「ので」「、」「配線」「の」「長さ」「を」「延長」「して」「ください」「。」と分割した単語リスト1002を作成する。 In step S8021, each sentence is extracted in units of document structure of the new technical document acquired in step S801, and a morphological analysis is performed to create a word list in which each sentence is divided into words. FIG. 10 shows an example of a word list. In the example of FIG. 10, “Device A”, “To”, and “Device B” from the new technical document 1001 that “the arrangement of device A and device B has been changed, so please extend the wiring length.” "" "" Place "" "" "" "Change" "" to "" Now "" So "", "" Wiring "" "" "Length" "" "Extension" "" "" "Please" " A word list 1002 divided into “.” Is created.
 ステップS8022において、ステップ8021において作成した単語リスト1002の各単語について、品詞を判定する。 In step S8022, the part of speech is determined for each word in the word list 1002 created in step 8021.
 ステップS8023において、ステップ8022において判定した品詞の中から任意の品詞の単語のみを選択する。本実施例では、名詞のみを抽出する。製品やサービスの仕様項目の意味を表現する単語としては、形容詞、形容副詞などの修飾語よりも名詞などの方がより密接な関係にある。したがって、全ての中から名詞の単語のみを選択することで、ほとんど抽出精度を下げることなく、短時間での処理が可能となっている。 In step S8023, only words having an arbitrary part of speech are selected from the part of speech determined in step 8022. In this embodiment, only nouns are extracted. Nouns are more closely related to modifiers such as adjectives and adjective adverbs as words expressing the meaning of product or service specification items. Therefore, by selecting only the noun word from all, processing in a short time is possible with almost no reduction in extraction accuracy.
 ステップS8024において、ステップ8023において選択した品詞の単語について、各文章内での単語出現頻度(単語分布)を、各文章ID毎に集計する。 In step S8024, the word appearance frequency (word distribution) in each sentence is totaled for each sentence ID for the part-of-speech word selected in step 8023.
 ステップS8025において、技術文章実績1012に格納された文章に対して文章別トピック確率推定モデル1031が最新になっているか否かを確認する。文章別トピック確率推定モデルを更新しない場合は、ステップS8026において、文章別トピック確率推定モデル1031を取得する。文章別トピック確率推定モデル1031を更新する場合は、ステップS8027において、文章別トピック確率推定モデル1031を更新し、取得する。文章別トピック確率推定モデルの更新処理の詳細については、図11を用いて後述する。 In step S8025, it is confirmed whether or not the sentence-specific topic probability estimation model 1031 is the latest with respect to the sentence stored in the technical sentence result 1012. If the sentence-specific topic probability estimation model is not updated, a sentence-specific topic probability estimation model 1031 is acquired in step S8026. When the sentence-specific topic probability estimation model 1031 is updated, the sentence-specific topic probability estimation model 1031 is updated and acquired in step S8027. Details of the update process of the sentence-specific topic probability estimation model will be described later with reference to FIG.
 ステップS8028において、文章別トピック確率計算エンジン1041が、ステップS8026またはS8027で取得した文章別トピック確率推定モデルを用いて、各文章のトピック確率を計算する
 (10)文章別トピック率推定モデル更新処理フロー
 図11は、文章別トピック率推定モデルの更新処理フローを示す図である。
In step S8028, the sentence-specific topic probability calculation engine 1041 calculates the topic probability of each sentence using the sentence-specific topic probability estimation model acquired in step S8026 or S8027. (10) Document-specific topic rate estimation model update processing flow FIG. 11 is a diagram showing an update processing flow of the sentence-specific topic rate estimation model.
 ステップS80271において、技術文書実績1012から本処理を行っていない技術文書を取得する。なお、既に本処理を行った技術文書も処理対象として、文章別トピック確率推定モデル1031を再作成してもよい。 In step S80271, a technical document that has not been subjected to this processing is acquired from the technical document record 1012. Note that the sentence-specific topic probability estimation model 1031 may be re-created with the technical document that has already undergone this processing as a processing target.
 ステップS80272において、ステップ8021と同様に、ステップS80271で取得した技術文章に対して、形態素解析を実行し、単語リストを生成する。 In step S80272, similarly to step 8021, morphological analysis is performed on the technical text acquired in step S80271 to generate a word list.
 ステップS80273において、ステップS8022と同様に、ステップS80272で作成された単語リストの各単語の品詞判定処理を実行する。 In step S80273, as in step S8022, the part-of-speech determination process for each word in the word list created in step S80272 is executed.
 ステップS80274において、ステップS8023と同様に、ステップS80273で判定された品詞の中から任意の品詞の単語のみを選択する。 In step S80274, as in step S8023, only words of any part of speech are selected from the part of speech determined in step S80273.
 ステップS80275において、ステップS8024と同様に、ステップS80274で選択した品詞に該当する単語の出現頻度を集計する。 In step S80275, as in step S8024, the appearance frequencies of words corresponding to the part of speech selected in step S80274 are tabulated.
 ステップS80276において、前記ステップS80271からS80275で処理したデータと文章別トピック確率推定モデル1031をトピック確率推定モデル生成エンジン1022に入力して、文章別トピック率推定モデル1031を更新する
 (11)代表仕様項目確信度算出処理フロー
 図13は、代表仕様項目確信度の算出処理フローを示す図である。
In step S80276, the data processed in steps S80271 to S80275 and the sentence-specific topic probability estimation model 1031 are input to the topic probability estimation model generation engine 1022 to update the sentence-specific topic rate estimation model 1031. (11) Representative specification items Confidence Level Calculation Processing Flow FIG. 13 is a diagram showing a representative specification item reliability calculation processing flow.
 ステップS8041において、技術文章別代表仕様項目実績1014に対して、代表仕様項目確信度推定モデル1032が最新になっているか否かを確認する。 In step S8041, it is confirmed whether the representative specification item certainty factor estimation model 1032 is up-to-date with respect to the representative specification item result 1014 by technical text.
 代表仕様項目確信度推定モデル1032を更新しない場合は、ステップS8042において、代表仕様項目確信度推定モデル1032を取得する。 If the representative specification item certainty factor estimation model 1032 is not updated, the representative specification item certainty factor estimation model 1032 is acquired in step S8042.
 代表仕様項目確信度推定モデルを更新する場合は、ステップS8043において、代表仕様項目確信度推定モデルを更新し、取得する。代表仕様項目確信度推定モデルの更新処理の詳細については、図14を用いて後述する。 When updating the representative specification item certainty factor estimation model, the representative specification item certainty factor estimation model is updated and acquired in step S8043. Details of the update process of the representative specification item certainty factor estimation model will be described later with reference to FIG.
 ステップS8044において、ステップ802で計算した各文章のトピック確率及び代表仕様項目確信度推定モデル1032を代表仕様項目確信度計算エンジン1042に入力して、各文章の代表仕様項目確信度を算出する。
(12) 代表仕様項目確信度推定モデル更新処理フロー
 図14は、代表仕様項目確信度推定モデルの更新処理フローを示す図である。
In step S8044, the topic probability of each sentence calculated in step 802 and the representative specification item certainty factor estimation model 1032 are input to the representative specification item certainty factor calculation engine 1042, and the representative specification item certainty factor of each sentence is calculated.
(12) Representative specification item certainty factor estimation model update processing flow FIG. 14 is a diagram showing an update processing flow of the representative specification item certainty factor estimation model.
 ステップS80431において、文章別トピック確率実績1013から前回処理以降に新規に登録された文章別トピック確率実績を取得する。 In step S80431, a sentence-specific topic probability record newly registered after the previous process is acquired from the sentence-specific topic probability record 1013.
 ステップS80432において、文章別代表仕様項目実績1014から前回処理以降に新規に登録された文章別代表仕様項目実績を取得する。 In step S80432, the representative specification item result by sentence newly registered after the previous process is acquired from the representative specification item result by sentence 1014.
 ステップS80433において、代表仕様項目推定モデル生成エンジン1023は、前記ステップS80431で取得した文章別トピック確率実績と前記ステップS80432で取得した文章別代表仕様項目実績のそれぞれ追加したデータと代表仕様項目確信度推定モデル1032を、代表仕様項目推定モデル生成エンジンに入力して、代表仕様項目確信度推定モデルを更新する。
(13)出力画面
 図15は出力画面を示す図である。図15を用いて、インタフェース部104で表示するGUI画面例について説明する。
In step S80433, the representative specification item estimation model generation engine 1023 adds the data added to the sentence-specific topic probability record acquired in step S80431 and the sentence-specific representative specification item record acquired in step S80432, and representative specification item certainty factor estimation. The model 1032 is input to the representative specification item estimation model generation engine, and the representative specification item certainty factor estimation model is updated.
(13) Output screen FIG. 15 shows an output screen. An example of a GUI screen displayed on the interface unit 104 will be described with reference to FIG.
 GUI画面は、新規技術文書の文書構成表示領域1501と、新規技術文章に関連する製品またはサービスの仕様階層構造表示領域1502と、新規技術文書の文書構成単位の文章を表示する本文表示領域1503から主に構成される。 The GUI screen includes a document structure display area 1501 for a new technical document, a specification hierarchical structure display area 1502 for a product or service related to the new technical text, and a text display area 1503 for displaying a text of the document structure unit of the new technical document. Mainly composed.
 章節構造表示領域1501には、技術文書実績1012に格納された章節属性情報をもとに、章タイトル、節タイトルなどが階層的に表示される。文書構成内強調表示1504は、ユーザが指定した仕様項目に該当する文章内における位置づけを示す。文書構成内確率表示1505は、ユーザが指定した仕様項目の親ノードに定義されている代表仕様項目に関連するトピックが記載されている確率(代表仕様項目確信度)を示す。図15の例では、確率を5段階のバー表示としているが、他の確率表示形態を取ってもよい。 In the chapter structure display area 1501, chapter titles, section titles, and the like are displayed hierarchically based on the chapter attribute information stored in the technical document performance 1012. The document structure highlighting display 1504 indicates the position in the sentence corresponding to the specification item designated by the user. The document structure probability display 1505 indicates the probability (representative specification item certainty) that a topic related to the representative specification item defined in the parent node of the specification item designated by the user is described. In the example of FIG. 15, the probability is displayed as a 5-level bar display, but other probability display forms may be used.
 仕様階層構造表示領域1502には、仕様項目ツリー構造1011に格納された仕様項目ツリー構造をもとに、対象製品またはサービスに関連する仕様項目が階層的に表示され、該当仕様項目の下位に本発明の処理による抽出結果が表示される、抽出結果の正しさの確率(文章内検索単語近接度)が上位のものから順に表示される。仕様階層構造内強調表示1506は、ユーザが指定した仕様項目の抽出結果の候補を示す。仕様階層構造内強調表示1507は、ユーザが指定した仕様項目の親ノードで定義されている代表仕様項目を示す。仕様階層構造内確率表示1508は、該当仕様項目の抽出結果の正しさの確率(文章内検索単語近接度)を示す。図15の例では、確率を5段階のバー表示としているが、他の確率表示形態を取ってもよい。 In the specification hierarchical structure display area 1502, based on the specification item tree structure stored in the specification item tree structure 1011, the specification items related to the target product or service are displayed hierarchically, and this item is displayed below the corresponding specification item. The extraction result by the process of the invention is displayed, and the probability of correctness of the extraction result (sentence in words in the sentence) is displayed in order from the top. The specification hierarchical structure highlight display 1506 indicates candidates for extraction results of specification items designated by the user. A specification hierarchy highlighting 1507 indicates representative specification items defined in the parent node of the specification item specified by the user. The specification hierarchical structure probability display 1508 indicates the probability of correctness of the extraction result of the corresponding specification item (text search word proximity). In the example of FIG. 15, the probability is displayed as a 5-level bar display, but other probability display forms may be used.
 本文表示領域1503には、検索結果として、新規技術文書の文章が表示される。本文内強調表示領域1509は、ユーザが指定した仕様階層構造内強調表示1506に該当する文章を示す。 In the text display area 1503, the text of the new technical document is displayed as a search result. The in-text highlighting area 1509 shows a sentence corresponding to the in-specification hierarchy highlighting 1506 designated by the user.
 また、本画面内には代表仕様項目確信度と文章内検索単語近接度を合成する比率を設定する合成指標比率設定1510があり、ユーザによる設定レバーの操作に応じて、出力画面内に表示されている確率を再計算し、確率表示を動的に変更し、可視化する。 Also, in this screen, there is a composite index ratio setting 1510 for setting a ratio for synthesizing representative specification item certainty factor and search word proximity in text, which is displayed in the output screen according to the operation of the setting lever by the user. Recalculates the probability, and dynamically changes and visualizes the probability display.
 ユーザは、この出力画面をもとに、新規技術文書において確認したい仕様項目の抽出結果の候補を順次指定しながら、それがどの文章構造に位置付けられ、その尤もらしさがどの程度なのか、また本文の中で具体的にどう記載されているのかを一覧で確認することで、要求仕様書などの技術文書から、エンジニアリング業務において必要となる情報の記載箇所を高速に特定することが可能となる。それにより、技術文書の読解に要する期間、およびエンジニアリング業務全体の期間の短縮が図れる。 Based on this output screen, the user sequentially specifies the candidate extraction results of the specification items to be confirmed in the new technical document, which text structure is positioned, how likely it is, and the text It is possible to quickly identify the location where information necessary for engineering work is described from a technical document such as a requirement specification by confirming how it is specifically described in the list. Thereby, it is possible to shorten the period required for reading and understanding the technical document and the entire engineering work.
 以上、本発明者によってなされた発明を実施の形態に基づき具体的に説明したが、本発明は前記実施の形態に限定されるものではなく、その要旨を逸脱しない範囲で種々変更可能であることは言うまでもない。 As mentioned above, the invention made by the present inventor has been specifically described based on the embodiment. However, the present invention is not limited to the embodiment, and various modifications can be made without departing from the scope of the invention. Needless to say.
100…文章検索装置
101…データ記憶部
102…モデル生成処理部
103…モデル記憶部
104…文章抽出処理部
105…インタフェース部
1011…仕様項目ツリー構造
1012…技術文書実績
1013…文章別トピック確率実績
1014…文章別代表仕様項目実績
1021…トピックと代表仕様項目との紐付エンジン
1022…文章別トピック確率推定モデル生成エンジン
1023…代表仕様項目確信度推定モデル生成エンジン
1031…文章別トピック確率推定モデル
1032…代表仕様項目確信度推定モデル
1041…文章別トピック確率推定計算エンジン
1042…代表仕様項目確信度計算エンジン
1043…文章内検索単語近接度計算エンジン
1044…合成指標計算エンジン
200…出力画面
DESCRIPTION OF SYMBOLS 100 ... Text search device 101 ... Data storage part 102 ... Model production | generation process part 103 ... Model storage part 104 ... Text extraction process part 105 ... Interface part 1011 ... Specification item tree structure 1012 ... Technical document results 1013 ... Topic probability result 1014 according to sentences ... Text-specific specification item result 1021 ... Topic and representative specification item linking engine 1022 ... Text-specific topic probability estimation model generation engine 1023 ... Representative specification item reliability estimation model generation engine 1031 ... Text-specific topic probability estimation model 1032 ... Representative Specification item certainty estimation model 1041 ... Sentence topic probability estimation calculation engine 1042 ... Representative specification item certainty calculation engine 1043 ... Sentence search word proximity calculation engine 1044 ... Synthetic index calculation engine 200 ... Output screen

Claims (5)

  1.  検索対象の技術文書から製品またはサービスに関する仕様項目に関係する文章を検索する文章検索装置であって、
     記憶部と演算処理部を備え、
     前記記憶部は、
     代表仕様項目を含む仕様項目をツリー構造で管理する仕様項目ツリー構造と、
     他の文章毎のトピック確率を記憶する文章別トピック確率と、
     文章毎の代表仕様項目を記憶する文章別代表仕様項目実績と、を備え、
     前記演算処理部は、
     検索キーとなる仕様項目と前記技術文書とを取得する処理と、
     前記技術文書の文書構成単位の各文章に含まれる単語分布から各文章のトピック確率を算出する処理と、
     仕様項目ツリー構造から前記仕様項目の代表仕様項目を取得する処理と、
     前記トピック確率から、前記文章毎に検索代表仕様項目に関連するトピックが記載されている確率を意味する代表仕様項目確信度を算出する処理と、
     前記検索仕様項目を構成する単語をもとに、各文章内に検索仕様項目に該当する記載が含まれている確率を意味する文章内検索単語近接度を算出する処理と、
     前記確信度と前記近接度とから、文章が代表仕様項目に関連するトピックについて記載されている確率が高く、かつ検索仕様項目に関連する文章が記載されている確率が高いことを意味する合成指標を算出する処理と、
     前記合成指標をもとに、新規技術文書における各検索仕様項目に関する文章を抽出し、抽出結果として表示する処理とを実行することを特徴とする文章検索装置。
    A text search device for searching text related to specification items related to products or services from a technical document to be searched,
    A storage unit and an arithmetic processing unit;
    The storage unit
    A specification item tree structure for managing specification items including representative specification items in a tree structure;
    Topic probability by sentence that memorizes topic probabilities for each other sentence,
    With representative specification item results by sentence to store representative specification items for each sentence,
    The arithmetic processing unit
    Processing for obtaining a specification item as a search key and the technical document;
    Processing for calculating the topic probability of each sentence from the word distribution included in each sentence of the document composition unit of the technical document;
    Processing for obtaining a representative specification item of the specification item from the specification item tree structure;
    From the topic probability, a process for calculating a representative specification item certainty factor that means a probability that a topic related to the search representative specification item is described for each sentence;
    Based on the words constituting the search specification item, a process for calculating a search word proximity in a sentence that means a probability that a description corresponding to the search specification item is included in each sentence;
    From the certainty factor and the proximity, a composite index that means that there is a high probability that a sentence is described on a topic related to a representative specification item and that a sentence related to a search specification item is high A process of calculating
    A text search apparatus that executes a process of extracting text related to each search specification item in a new technical document based on the composite index and displaying the text as an extraction result.
  2.  請求項1において、
     前記代表仕様項目確信度を算出するエンジンを備え、
     当該エンジンで用いるモデルを更新することを特徴とする文章検索装置。
    In claim 1,
    An engine for calculating the representative specification item certainty factor,
    A text search apparatus characterized by updating a model used in the engine.
  3.  請求項1において、
     前記トピック確率を算出するエンジンを備え、
     当該エンジンで用いるモデルを更新することを特徴とする文章検索装置。
    In claim 1,
    An engine for calculating the topic probability,
    A text search apparatus characterized by updating a model used in the engine.
  4.  請求項1において、
     前記確信度が高いもののみ近接度を算出することを特徴とする文章検索装置。
    In claim 1,
    A sentence search apparatus characterized in that the proximity is calculated only for those having a high certainty factor.
  5.  請求項1において、
     前記代表ツリー構造に親ノードがある場合には、親ノードを代表仕様項目とし、
     前記代表ツリー構造に親ノードがない場合には、自ノードを代表仕様項目とすることを特徴とする文章検索装置。
    In claim 1,
    If there is a parent node in the representative tree structure, the parent node is a representative specification item,
    If there is no parent node in the representative tree structure, the self-node is used as a representative specification item.
PCT/JP2015/060904 2015-04-08 2015-04-08 Text search device WO2016162961A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2015/060904 WO2016162961A1 (en) 2015-04-08 2015-04-08 Text search device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2015/060904 WO2016162961A1 (en) 2015-04-08 2015-04-08 Text search device

Publications (1)

Publication Number Publication Date
WO2016162961A1 true WO2016162961A1 (en) 2016-10-13

Family

ID=57072278

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2015/060904 WO2016162961A1 (en) 2015-04-08 2015-04-08 Text search device

Country Status (1)

Country Link
WO (1) WO2016162961A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0628403A (en) * 1992-07-09 1994-02-04 Mitsubishi Electric Corp Document retrieving device
JPH0744579A (en) * 1993-05-25 1995-02-14 Ricoh Co Ltd Logical structure sentence retrieval system
JPH07319918A (en) * 1994-05-24 1995-12-08 Fuji Xerox Co Ltd Device for specifying retrieving object in document
JPH1145254A (en) * 1997-07-25 1999-02-16 Just Syst Corp Document retrieval device and computer readable recording medium recorded with program for functioning computer as the device
JP2013003663A (en) * 2011-06-13 2013-01-07 Sony Corp Information processing apparatus, information processing method, and program

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0628403A (en) * 1992-07-09 1994-02-04 Mitsubishi Electric Corp Document retrieving device
JPH0744579A (en) * 1993-05-25 1995-02-14 Ricoh Co Ltd Logical structure sentence retrieval system
JPH07319918A (en) * 1994-05-24 1995-12-08 Fuji Xerox Co Ltd Device for specifying retrieving object in document
JPH1145254A (en) * 1997-07-25 1999-02-16 Just Syst Corp Document retrieval device and computer readable recording medium recorded with program for functioning computer as the device
JP2013003663A (en) * 2011-06-13 2013-01-07 Sony Corp Information processing apparatus, information processing method, and program

Similar Documents

Publication Publication Date Title
JP5316158B2 (en) Information processing apparatus, full-text search method, full-text search program, and recording medium
US11055338B2 (en) Dynamic facet tree generation
US9582486B2 (en) Apparatus and method for classifying and analyzing documents including text
JP2003186894A (en) Substance dictionary creating method, and inter- substance binary relationship extracting method, predicting method and displaying method
JP2007072646A (en) Retrieval device, retrieval method, and program therefor
JP2020113129A (en) Document evaluation device, document evaluation method, and program
JP2005122510A (en) Topic structure extracting method and device and topic structure extracting program and computer-readable storage medium with topic structure extracting program recorded thereon
Opasjumruskit et al. OntoHuman: ontology-based information extraction tools with human-in-the-loop interaction
WO2006046665A1 (en) Document processing device and document processing method
CN110413307B (en) Code function association method and device and electronic equipment
JP2013030089A (en) Document retrieval system and document retrieval program
JP2011238159A (en) Computer system
Uçar et al. A novel algorithm for extracting the user reviews from web pages
WO2014064777A1 (en) Document evaluation assistance system and document evaluation assistance method
JP2021144348A (en) Information processing device and information processing method
JP5112027B2 (en) Document group presentation device and document group presentation program
JP2018045548A (en) Fmea creation assist system and method
WO2016162961A1 (en) Text search device
JPWO2009113289A1 (en) NEW CASE GENERATION DEVICE, NEW CASE GENERATION METHOD, AND NEW CASE GENERATION PROGRAM
JP5739352B2 (en) Dictionary generation apparatus, document label determination system, and computer program
JP5187187B2 (en) Experience information search system
JP2009140113A (en) Dictionary editing device, dictionary editing method, and computer program
JP2011081626A (en) Dictionary registering device, document label determination system, and dictionary registration program
JP2007026116A (en) Concept search system and concept search method
Tschuggnall et al. From plagiarism detection to bible analysis: The potential of machine learning for grammar-based text analysis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15888454

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15888454

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP