WO2013175524A1 - Structured document management system, structured document management method and program - Google Patents

Structured document management system, structured document management method and program Download PDF

Info

Publication number
WO2013175524A1
WO2013175524A1 PCT/JP2012/003349 JP2012003349W WO2013175524A1 WO 2013175524 A1 WO2013175524 A1 WO 2013175524A1 JP 2012003349 W JP2012003349 W JP 2012003349W WO 2013175524 A1 WO2013175524 A1 WO 2013175524A1
Authority
WO
WIPO (PCT)
Prior art keywords
index
document
word
index word
appearance
Prior art date
Application number
PCT/JP2012/003349
Other languages
French (fr)
Japanese (ja)
Inventor
坪井 創吾
佐々木 淳哉
陽二 加藤
裕子 高森
Original Assignee
株式会社 東芝
東芝ソリューション株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社 東芝, 東芝ソリューション株式会社 filed Critical 株式会社 東芝
Priority to JP2014516505A priority Critical patent/JP5971571B2/en
Priority to PCT/JP2012/003349 priority patent/WO2013175524A1/en
Publication of WO2013175524A1 publication Critical patent/WO2013175524A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing

Definitions

  • Embodiments described herein relate generally to an index creation support technique in structural document management.
  • CMS Content management systems
  • business documents such as regulations and business manuals
  • CMS for personal content such as blogs
  • public CMS where multiple people collaborate to edit content for the same purpose, such as Wikipedia. It is the situation.
  • Information sharing systems using CMS are also widespread.
  • a document to be registered is often a document having a structure such as XML or HTML (hereinafter referred to as “structure document”).
  • An index page is a list of words and matters extracted from a document and arranged in a certain order so that the words and matters can be easily found. It is. There are ways to search by keyword search, but you can't use it unless you come up with a keyword. In addition, the index page has the pleasure of being touched by unknown knowledge based on the terms arranged before and after.
  • the problem to be solved by the present invention is to provide a structure document management technique that enables creation and maintenance of an index page only by a user selecting a part of index words.
  • the structural document management system includes an input unit for inputting an index word.
  • a structural document in which an index word appears is retrieved from a storage device storing the structural document.
  • An appearance condition for identifying at least a structural part where an index word appears in the retrieved structure document is determined.
  • Each structural document is grouped based on the similarity of appearance conditions.
  • a correspondence relationship between each grouped structural document and each index word is stored as index information.
  • FIG. 4 is a diagram illustrating an example of a structural document stored in the structural document storage unit 103 according to the embodiment.
  • FIG. 10 is a diagram illustrating an example of index word appearance conditions for the structure document 202 according to the embodiment.
  • FIG. 1 Flowchart of processing of appearance condition grouping unit 105 of the embodiment
  • the figure which shows the example of ambiguity of the appearance conditions of embodiment The figure which shows the example of the grouping by the frequency
  • storage part 106 of embodiment The figure which shows the example of the presentation screen of the index by the index list presentation part 107 of embodiment
  • index words by specifying some index words. Specifically, another word having a structural characteristic of the appearance position (for example, XPath expressing the appearance position of most index words) common to a specified number of index words is searched.
  • a structural characteristic of the appearance position for example, XPath expressing the appearance position of most index words
  • the documents in which each index word appears are grouped according to the structural feature of the appearance position, and the document with the most specific feature is defined as a group of documents corresponding to the index word. For example, assuming that the appearance position of an index word is expressed by XPath, a feature having the smallest number of nodes corresponding to XPath is defined as a specific feature. It can be said that a narrower range can be expressed.
  • FIG. 1 is a configuration diagram of a structural document management system 100 according to the embodiment.
  • the structural document management system 100 is configured using a computer, and provides a user with an index list editing support function.
  • Each unit of the index word input unit 101 or the group name editing unit 112 in the structural document management system 100 indicates a block that functions when the computer executes a program.
  • the index word input unit 101, the index list presentation unit 107, the index word confirmation unit 108, the index word recommendation unit 109, and the group name editing unit 112 provide an interface to the user via a terminal.
  • the structural document storage unit 103 and the index list storage unit 106 can be realized using a storage device.
  • the user inputs a certain number of words to be registered as index words from the index word input unit 101 via the terminal. For example, if the set of structured documents is the user's company regulations document or business manual, words such as “supervised location”, “company regulations”, “deposit”, “salary”, “vacation”, “device take-out procedure”, “settlement”, etc. It is done.
  • the structural document search unit 102 When the index word is input, the structural document search unit 102 by word accesses the storage device of the structural document storage unit 103 to search and specify the structural document in which the word as the index word appears.
  • the appearance condition determination unit 104 checks the appearance condition in the specified structural document, for example, the appearance position on the structure where the input index word appears.
  • the appearance position on the structure can be expressed by XPath which is a language syntax for designating a specific part of the XML document.
  • appearance conditions may include the same or similar word vectors within a certain number of characters or a certain number of nodes from the appearance position, the type of the document, the combination of the schema of the structure document and the appearance position, etc. .
  • the number of moving up and down the document structure is referred to as “number of nodes”.
  • the first chapter first section has 1 node
  • the first chapter second section has 2 nodes
  • the second chapter first section has 4 nodes.
  • the document type is, for example, a type such as a rule or a business manual.
  • the schema of the structure document is an XML schema or a DTD.
  • the appearance condition grouping unit 105 groups structural documents having similar appearance conditions. For example, the structural document in which the word A appears in the first chapter, the first section, the first paragraph, and the structural document, in which the word B appears in the first chapter, the first section, the first paragraph, have the same appearance position. Group to be in the same group.
  • the appearance condition is made ambiguous.
  • the appearance condition “appears in the first chapter, first section, first paragraph” is also included in a similar range such as “appears somewhere in the first chapter, first section”.
  • the appearance positions are not limited to the same, but may include similar ranges.
  • the structural documents are grouped according to their similarity or concreteness from the structural features of the appearance position without distinguishing the index words included. The degree of similarity will be described later.
  • Such groupings such as words and descriptive sentences of items, that are explained to some extent according to “type” appear in similar places in the document structure, while words that only touch a few words appear in the text. It is based on the hypothesis that focuses on entropy that the place to do tends to be dispersed.
  • Each structural document grouped by the appearance condition grouping unit 105 is associated with a word that is each index word, and index information representing this correspondence is sent to the index list storage unit 106 and stored therein.
  • group A is the structured document D1, D2, D3 in which the input words W1, W2, W3 appear in the first paragraph, first section, first paragraph, and group B has the words W1, W2, W3 in the first chapter.
  • W1-group A: D1, “W1-group B: D4”, “W2-group A: D2”, “W2- A pair of “Group B: D5”, “W3-Group A: D3”, and “W3-Group B: D6” is stored.
  • the index information stored in the index list storage unit 106 is presented to the user by the index list presentation unit 107.
  • the index list presenting unit 107 lists, for example, each structural document whose appearance condition is stricter for each word that is an index word.
  • the index word confirmation unit 108 determines the validity and feeds back to the user.
  • the appearance condition grouping unit 105 notifies that to the input index word W4. It is assumed that the criteria for determining whether or not to include a notification in any group is part of the system settings.
  • the search word recommendation unit 109 presents unregistered index words to the user. For example, when the appearance condition of the group A is the first chapter, the first section, the first paragraph, the first document that matches the appearance condition from the registered structure document by the structural document search unit 110 based on the appearance condition Extract the string in the first paragraph of the section. Then, the unregistered word determination unit 111 determines a characteristic word that is different from the index word still registered in the index list storage unit 106 from the character string.
  • a characteristic word can be determined by extracting a noun by using a morphological analysis algorithm and determining a characteristic word in the character string using an index called TF-IDF. Since this method is known, it will not be described in detail.
  • a determination comparing various characteristics with an already registered index word For example, narrowing down to those with a close average character string length, narrowing down to those with similar appearance numbers for all structured documents. “The number of appearances for all structured documents is similar” means that, for example, if each registered index word appears in 1% of all registered documents, a word recommended as a search term Focusing on those that appear in about 1% of registered documents.
  • the words determined in this way are different from the index words that have already been registered, but they have a similar appearance condition and can be said to be commonly seen. It is highly possible that the word is a power word, and it is recommended to the user as a new index word candidate.
  • the group name editing unit 112 is for editing the contents stored in the index list storage unit 106.
  • the user can delete unnecessary word-document pairs and edit group names and appearance conditions.
  • FIG. 2 is a diagram illustrating an example of a structural document stored in the structural document storage unit 103 according to the embodiment.
  • XML is handled as a structure document stored in the structure document storage unit 103. Or it may be HTML or SGML.
  • Documents 201, 202, and 203 are XML documents written in the same XML schema, and are examples in which a part of a regulation document that defines company activities and rules is stored. See DocBook: http://docbook.org/ns/docbook for the XML schema.
  • Each document has an article element at the top.
  • an info (bibliographic information) element for entering the bibliographic information of the article and a plurality of sect1 (section) elements representing the text.
  • an info element Inside the info element are a title element and an author element, and inside the sect1 element are the title element of the section and multiple para elements.
  • FIG. 3 is a diagram illustrating an example of an index word appearance condition for the structure document 202 according to the embodiment.
  • the appearance condition determination unit 104 sets the appearance condition 301 and the appearance condition 302 respectively. The result of the determination is shown.
  • the appearance position is given as the appearance condition, and the appearance position is represented by XPath. Since the method for obtaining the XPath from the appearance position of the character string is known, it is omitted.
  • each notation part such as “article”, “sect”, “orderedlist”, “listitem”, “para” is represented as “element” from the root node side. Name ".
  • [1] associated with the element “sect1” of the appearance condition 301 [1] associated with the element “sect1” of the appearance condition 302
  • [1] associated with the element “orderedlist” of the appearance condition 301 [1] associated with the element “orderedlist” of the appearance condition 301
  • a notation part such as [4] attached to the element “orderedlist” 302 is referred to as an “index”.
  • the index of the condition 301 is [1]
  • the intermediate indexes [2] and [3] are not shown
  • the index of the appearance condition 302 is [4].
  • peripheral character string a heading character string of a parent node
  • document schema a document schema
  • the characters before and after the index word are the characters ““ ”and“ “” before and after the company regulations or the management section (refer to the underlined portion of the document 202) that is the index word.
  • the heading character string indicates “Article 1” and “Article 4”.
  • the document schema is the DocBook schema in this example. In XML, the schema is represented by the xmlns attribute of the top element. In other words, “http://docbook.org/ns/docbook” is the schema name of this document.
  • FIG. 4 is a flowchart of processing of the appearance condition grouping unit 105 according to the embodiment.
  • the input is a list including a triplet of an index word, an appearance condition, and a document (step S401).
  • the purpose of the processing of the appearance condition grouping unit 105 is to divide the input list into a plurality of groups based on the criteria that the appearance conditions are similar.
  • the appearance condition is obscured to a certain level (step S402).
  • the method of obscuration differs depending on the contents of the appearance condition, regarding the Xpath that represents the appearance position in the appearance condition, the appearance position can be made ambiguous by removing the designation of the index and element name.
  • There are various ways of removal For example, there is a method of (1) removing an index stepwise from the root node side, and (2) removing an element stepwise from the root node side. Step S403).
  • step S404 the appearance conditions of the peripheral information such as the preceding and following characters, the peripheral character string, and the schema can be made ambiguous by removing the designation itself (step S404). It is expected that the effective algorithm for obfuscation varies depending on the schema of the structure document, but such a simple method can be implemented. Note that the order of the process in step S403 and the process in step S404 may be performed in parallel.
  • the number of times of the obscuring process is stored as the number of times of obscuration (step S405).
  • the number of times of obfuscation is a score, and it can be said that it is the concreteness of the appearance condition. Further, when the appearance conditions of a plurality of index words are compared, it can be said that the number of ambiguous processes is a similarity indicating the similarity of the index words.
  • step S406 those with the same appearance condition are grouped from those with the lowest obfuscation count. That is, for all the lists, a combination that has the same or less obscuration count and can group all index terms is repeatedly searched (step S406). That is, it can be said that not only the appearance conditions are the same but also a similar range can be included.
  • one item belongs to only one group, that is, first-come-first-served basis, and an element having the same index word and document pair as an element included in a certain group is removed.
  • FIG. 5 is a diagram illustrating an example of obscuring appearance conditions according to the embodiment.
  • the surrounding characters ““ ”and“ ”” are added as the peripheral information of the case where the appearance condition 501 is made ambiguous and will be described below.
  • the appearance condition 502 is the initial state 501 of the appearance condition of the index word, and the number of times of obscuration at this point is zero.
  • the appearance condition 503 is obtained by removing the index [1] from sect1 [1], which is a part of XPath, with respect to the appearance condition 502 (see the underlined portion of “sect1”). At this time, the number of times of obscuration increases by 1 to “1”. As a result of removing this index, it means that even if the index word “main part” appears in the sect1 element having any index, it is treated as the same thing.
  • the index is first removed stepwise, the peripheral information is removed immediately after all the indexes are removed, and then the element designation is removed.
  • the appearance condition 504 is obtained by removing the index [4] from the “orderedlist [4]” in the appearance condition 503 (see the underlined portion of the “orderedlist”). 2 ”.
  • the appearance condition 504 is obtained by removing the index [2] from “listitem [2]” (see the underlined portion of “listitem”), and the number of obscuration increases by 1 to “3”.
  • the appearance condition 506 is obtained by removing the index [1] from “para [1]” in the appearance condition 505 (see the underlined portion of “para”), and the number of obscuration increases by 1 to “4”.
  • the appearance condition 507 is obtained by removing the peripheral information ““ ”and“ ”” from the appearance condition 506 (see the underlined portion of “peripheral information”), and the number of obscurations Increases by 1 to “5”.
  • the “article” element designation is removed from the appearance condition 507, and the appearance information 508 is added with ““ ”and“ ”” as peripheral information (see the underlined part of “// sect1”) )
  • the number of obscurations is “5” with no change due to 1 increase and 1 decrease.
  • the appearance condition 509 is obtained by removing the peripheral information ““ ”and“ ”” from the appearance condition 508 (see the underlined portion of “peripheral information”), and the number of ambiguities is increased by 1 to “6”. It becomes.
  • the appearance condition 510 is obtained by removing the element designation “sect1” from the appearance condition 509 and adding ““ ”and“ ”” as peripheral information (see the underlined part of “// orderedlist”) )
  • the number of obscurations is “6” without change by 1 increase and 1 decrease. The subsequent ambiguity is not shown.
  • FIG. 6 is a diagram illustrating an example of grouping based on the number of times of obscuring appearance conditions according to the embodiment.
  • index word-document pairs having appearance conditions developed as shown in FIG. 5 are compared to search for the same group.
  • the appearance condition 501 of the document 202 in which the index word “main part” 500 appears and the appearance condition 511 of the document 203 in which the index word “deposit” 600 appears are each obscured. Matches for the first time in conditions. That is, the appearance condition 505 and the appearance condition 515 match. When there are only two index words, “main part” 500 and “deposit” 600, the document 202 and the document 203 become index destination documents of the respective index words.
  • FIG. 7 is a diagram illustrating an example of the contents stored in the index list storage unit 106 according to the embodiment.
  • the index list storage unit 106 stores the index information output from the appearance condition grouping unit 105.
  • the index information stored in the index list storage unit 106 includes an index word 701, an obfuscation count 702, an appearance condition 703, and a document name 705.
  • the group name 704 can be displayed in place of each appearance condition on the index list presentation screen by giving a name to the grouped appearance condition group.
  • the group name 704 can be given by the user using the group name editing unit 112.
  • the index list storage unit 106 includes a group named “definition” (see data rows 505 and 515) and a group named “reference document” (data rows 711, 515). 712) is stored.
  • the “definition” group is the group with the least number of obscurations, and the “reference document” group is composed of other items.
  • FIG. 8 is a diagram illustrating an example of an index presentation screen by the index list presenting unit 107 according to the embodiment.
  • the index list presenting unit 107 determines the reading of the index word, and displays it sorted by the Japanese syllabary.
  • [A] ... [ka] ... [sa] ... [shi] ... [yo] etc. are index word reading headings 801. There are various methods for acquiring kanji readings, which are well known and will be omitted.
  • index words “main part” 500 and “deposit” 600 are displayed.
  • the names of documents belonging to the group are displayed indented. For example, the document having the smallest number 3 of obscuration is displayed first (refer to “Company Rules Management Rules” 202, “Personal Information Cooperation Company Handling and Deposit Management Rules” 203), and then the number of further obscurations Are displayed with a deeper indentation (see “Regulation Editing Manual”, “Regulation Change Request Guidelines”, “(Other 4)” 711, “External Ordering Regulations” 712).
  • a transition is made to the display screen for that document.
  • FIG. 9 is a diagram illustrating an example of a presentation screen by the index word confirmation unit 108 according to the embodiment.
  • the user inputs a new index word “employee information” in an index word addition form 902 “Add an index word:”
  • An “add” button 903 is pressed.
  • the index word confirmation unit 108 refers to each appearance condition already stored in the index list storage unit 106 for the appearance condition in the structural document in which the index word “employee information” appears via the appearance condition grouping unit 105. To do.
  • the index word confirmation unit 108 may not be appropriate as an index word. (See the display in the screen area 904 "Specified” Employee information "tends to be different from other index words. Are you sure you want to register?") The user is prompted to perform the next operation for confirmation (see “add” button 905, “cancel” button 906, and “confirm registered document” button 907).
  • FIG. 10 is a diagram illustrating an example of a presentation screen by the index word recommendation unit 109 according to the embodiment.
  • the structural document search unit 110 uses the appearance condition of the group with the least number of times of obfuscation, the structural document search unit 110 by the appearance condition searches all registered documents stored in the structural document storage unit 103. In response to the result, the index word recommendation unit 109 displays unregistered index words that are not yet registered.
  • a “document reference” link 1003 as necessary, the contents of the document at the appearance position of the index word can be confirmed.
  • an “add to index word” button 904 the index word is added.
  • index destination document is rechecked, and words that are not yet registered as index words can be presented to the user as index word candidates.
  • the index list is created and maintained at a low cost, so that the viewing efficiency of the document viewer increases and the maintenance cost of the document editor decreases. Both business efficiency is improved, and it becomes possible to concentrate on higher value work such as understanding and editing of document contents.
  • index words By simply specifying index words, the most appropriate document can be determined from documents including those words, and an index list in which the index words and documents are paired can be automatically generated.
  • an index list is easily created, information collection efficiency is improved for document viewers, and document maintenance costs are reduced for document editors, so work efficiency is generally improved.
  • Structural document management system ... 100 Index word input part 101 Structural document search unit by word 102 Structure document storage unit 103 Appearance condition determination unit 104 Appearance condition grouping unit 105 Index list storage unit 106 Index list presentation unit 107 Index word confirmation part ... 108 Index word recommendation part ... 109 Structure document search unit based on appearance conditions ... 110 Unregistered word determination unit 111 Group name editing part ... 112

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The objective of the present invention is to provide structured document management technology that allows creation and maintenance of an index page by merely selecting some index terms by a user. A structured document management system of the embodiment has input means for inputting an index term. A structured document in which an index term occurs is retrieved from a storage device that stores the structured document. An occurrence condition is assessed which at least identifies a structural portion in which the index term appears in the retrieved structured document. Each structured document is grouped on the basis of the degree of similarity of the occurrence conditions. A correspondence relationship between each of the grouped structured documents and each of the index terms is stored as index information.

Description

構造文書管理システム、構造文書管理方法及びプログラムStructural document management system, structural document management method, and program
本発明の実施形態は、構造文書管理における索引の作成支援技術に関する。 Embodiments described herein relate generally to an index creation support technique in structural document management.
コンテンツ管理システム(以下「CMS」という。)の普及が著しい。規程、業務マニュアルといったビジネス文書のCMS、ブログ等の個人用コンテンツのCMS、Wikipediaをはじめとする複数人が同一の目的のためにコンテンツを共同編集するパブリックなCMSなど、特にインターネットの世界ではCMSばかりといった状況である。CMSを用いた情報共有システムも普及している。 Content management systems (hereinafter referred to as “CMS”) are becoming increasingly popular. CMS for business documents such as regulations and business manuals, CMS for personal content such as blogs, and public CMS where multiple people collaborate to edit content for the same purpose, such as Wikipedia. It is the situation. Information sharing systems using CMS are also widespread.
CMSの文書管理の技術において、登録される文書は、多くの場合、XMLやHTMLといった構造を持った文書(以下「構造文書」という。)である。 In the CMS document management technology, a document to be registered is often a document having a structure such as XML or HTML (hereinafter referred to as “structure document”).
大量の文書から目的の文書を探す際、索引ページがあると効率的である。索引ページとは、語句や事項などを容易に探し出せるように、その語句や事項を文書から抽出して一定の順序に配列し、その語句や事項が出現する文書の所在や閲覧方法をまとめたものである。キーワード検索による探し方もあるが、そもそもキーワードを思いつかないと使用することができない。また、索引ページは、前後に並んだ用語から、未知の知識に触れられる楽しみがある。 When searching for a target document from a large number of documents, it is efficient to have an index page. An index page is a list of words and matters extracted from a document and arranged in a certain order so that the words and matters can be easily found. It is. There are ways to search by keyword search, but you can't use it unless you come up with a keyword. In addition, the index page has the pleasure of being touched by unknown knowledge based on the terms arranged before and after.
一方で、索引ページを作成し、メンテナンスをすることは容易ではない。 On the other hand, it is not easy to create and maintain an index page.
(1) 索引に載せるべき語かどうか、妥当性の判断が難しい。例えば、出現頻度を元に妥当性の判定を行うことは、有効ではない。定義にあたるような文書では、索引語の出現頻度は、むしろ少ない。 (1) It is difficult to judge the validity of a word that should be included in the index. For example, it is not effective to determine validity based on the appearance frequency. In a document that meets the definition, the frequency of index terms is rather low.
(2) 索引語を含む文書の中から、どの文書を最も索引として載せるのにふさわしいのか、索引に載せない文書の扱いをどうするかを判断することが難しい。 (2) It is difficult to determine which document is most suitable as an index from among documents including index words and how to handle a document that is not included in the index.
(3) また、一度作った索引のメンテナンスも手間がかかる作業である。いずれかの文書が更新されれば、索引ページの更新も逐一必要になる。ビジネスにしろプライベートにしろ、文書の更新を行うインセンティブと、索引ページの更新を行うインセンティブは異なる。さらにビジネスの場合、お互いの文書の管理部署が違うこともあり、総じてメンテナンスコストがかかる。 (3) Also, maintenance of the index once created is a time-consuming work. If any document is updated, the index page must be updated one by one. Whether in business or private, the incentive to update documents is different from the incentive to update index pages. In addition, in the case of business, the management department of each document may be different, which generally requires maintenance costs.
従来、構造文書の階層構造や属性を用いた検索式と、その検索結果を一覧化することで、索引ページの代わりにするような技術が知られている。しかしながら、検索式を列挙していく作業は難しく、専門的な知識も必要である。前述の課題の(1)~(3)を解決することが求められる。 2. Description of the Related Art Conventionally, a technique that substitutes an index page by listing a search expression using a hierarchical structure and attributes of a structured document and a search result thereof is known. However, it is difficult to enumerate search expressions, and specialized knowledge is also required. It is required to solve the above problems (1) to (3).
特開2006-185408号公報JP 2006-185408 A
本発明が解決しようとする課題は、ユーザが一部の索引語を選択するだけで、索引ページの作成およびメンテナンスが可能になるような構造文書管理技術を提供することである。 The problem to be solved by the present invention is to provide a structure document management technique that enables creation and maintenance of an index page only by a user selecting a part of index words.
実施形態の構造文書管理システムは、索引語を入力する入力手段を有する。構造文書を記憶している記憶装置から索引語が出現する構造文書を検索する。検索された構造文書において索引語が出現する構造上の部分を少なくとも特定する出現条件を判定する。出現条件の類似度に基づいて各構造文書をグルーピングする。グルーピングされた各構造文書と各索引語との対応関係を索引情報として記憶する。 The structural document management system according to the embodiment includes an input unit for inputting an index word. A structural document in which an index word appears is retrieved from a storage device storing the structural document. An appearance condition for identifying at least a structural part where an index word appears in the retrieved structure document is determined. Each structural document is grouped based on the similarity of appearance conditions. A correspondence relationship between each grouped structural document and each index word is stored as index information.
実施形態の構造文書管理システム100の構成図Configuration diagram of structural document management system 100 of embodiment 実施形態の構造文書記憶部103に記憶される構造文書の一例を示す図FIG. 4 is a diagram illustrating an example of a structural document stored in the structural document storage unit 103 according to the embodiment. 実施形態の構造文書202に対する索引語の出現条件の例を示す図FIG. 10 is a diagram illustrating an example of index word appearance conditions for the structure document 202 according to the embodiment. 実施形態の出現条件グルーピング部105の処理のフローチャートFlowchart of processing of appearance condition grouping unit 105 of the embodiment 実施形態の出現条件の曖昧化の例を示す図The figure which shows the example of ambiguity of the appearance conditions of embodiment 実施形態の出現条件の曖昧化回数によるグルーピングの例を示す図The figure which shows the example of the grouping by the frequency | count of fuzziness of the appearance condition of embodiment 実施形態の索引一覧記憶部106における記憶内容の例を示す図The figure which shows the example of the memory content in the index list memory | storage part 106 of embodiment 実施形態の索引一覧提示部107による索引の提示画面の例を示す図The figure which shows the example of the presentation screen of the index by the index list presentation part 107 of embodiment 実施形態の索引語確認部108による提示画面の例を示す図The figure which shows the example of the presentation screen by the index word confirmation part 108 of embodiment 実施形態の索引語推薦部109による提示画面の例を示す図The figure which shows the example of the presentation screen by the index word recommendation part 109 of embodiment
以下、発明を実施するための実施形態について説明する。本実施形態における解決方法の概略は下記(1)~(3)の通りである。 Hereinafter, embodiments for carrying out the invention will be described. The outline of the solution in the present embodiment is as follows (1) to (3).
(1) いくつかの索引語を指定することで、他の索引語を取得する。具体的には、指定した一定数の索引語に共通する、出現位置の構造上の特徴(例えば、ほとんどの索引語の出現位置を表現したXPath)を持つ他の単語を探す。 (1) Acquire other index words by specifying some index words. Specifically, another word having a structural characteristic of the appearance position (for example, XPath expressing the appearance position of most index words) common to a specified number of index words is searched.
(2) 各索引語が出現する文書間で、出現位置の構造上の特徴ごとにグループ分けし、最も特徴が具体的なものを、索引語に対応する文書のグループとする。例えば、索引語の出現位置をXPathで表現するとして、XPathが該当するノード数が最も少ない特徴を具体的な特徴とする。より狭い範囲を表現できているということができる。 (2) The documents in which each index word appears are grouped according to the structural feature of the appearance position, and the document with the most specific feature is defined as a group of documents corresponding to the index word. For example, assuming that the appearance position of an index word is expressed by XPath, a feature having the smallest number of nodes corresponding to XPath is defined as a specific feature. It can be said that a narrower range can be expressed.
(3) ユーザによって新たに索引語が指定された際に、他の索引語と出現位置の構造上の特徴が異なる場合は、その語が索引語としてふさわしくない可能性があるとして、警告を出す。 (3) When a new index word is specified by the user, if the structural characteristics of the appearance position are different from other index words, a warning is given that the word may not be suitable as an index word .
図1は、実施形態の構造文書管理システム100の構成図である。 FIG. 1 is a configuration diagram of a structural document management system 100 according to the embodiment.
構造文書管理システム100は、コンピュータを用いて構成され、ユーザに対して索引一覧編集支援の機能を提供する。構造文書管理システム100における索引語入力部101ないしグループ名編集部112の各部は、コンピュータがプログラムを実行することで機能するブロックを示している。索引語入力部101、索引一覧提示部107、索引語確認部108、索引語推薦部109、およびグループ名編集部112は、端末を介して、ユーザにインタフェースを提供する。また、構造文書記憶部103および索引一覧記憶部106は、記憶装置を用いて実現することができる。 The structural document management system 100 is configured using a computer, and provides a user with an index list editing support function. Each unit of the index word input unit 101 or the group name editing unit 112 in the structural document management system 100 indicates a block that functions when the computer executes a program. The index word input unit 101, the index list presentation unit 107, the index word confirmation unit 108, the index word recommendation unit 109, and the group name editing unit 112 provide an interface to the user via a terminal. The structural document storage unit 103 and the index list storage unit 106 can be realized using a storage device.
ユーザは、端末を介して、索引語入力部101から、索引語として登録したい単語を一定数入力する。例えば、構造文書集合がそのユーザの企業の規程文書や業務マニュアルである場合、「主管個所」「会社規程」「預託」「給与」「休暇」「機器持ち出し手続き」「精算」などといった単語が考えられる。 The user inputs a certain number of words to be registered as index words from the index word input unit 101 via the terminal. For example, if the set of structured documents is the user's company regulations document or business manual, words such as “supervised location”, “company regulations”, “deposit”, “salary”, “vacation”, “device take-out procedure”, “settlement”, etc. It is done.
索引語が入力されると、単語による構造文書検索部102により構造文書記憶部103の記憶装置にアクセスし、索引語である単語が出現する構造文書を検索して特定する。 When the index word is input, the structural document search unit 102 by word accesses the storage device of the structural document storage unit 103 to search and specify the structural document in which the word as the index word appears.
続いて、出現条件判定部104において、特定された構造文書の中での出現条件、例えば入力された索引語が出現している構造上の出現位置を調べる。構造上の出現位置とは、例えば構造文書がXMLである場合、XML文書の特定の部分を指定する言語構文であるXPathで表すことができる。 Subsequently, the appearance condition determination unit 104 checks the appearance condition in the specified structural document, for example, the appearance position on the structure where the input index word appears. For example, when the structure document is XML, the appearance position on the structure can be expressed by XPath which is a language syntax for designating a specific part of the XML document.
その他の出現条件としては、出現位置から一定文字数内もしくは一定のノード数にある単語ベクトルが同じもしくは類似していることや、その文書の種類、構造文書のスキーマと出現位置の組み合わせなどが考えられる。本実施形態では、文書構造を上下に移動する数を「ノード数」と呼ぶ。例えば、第一章第一節はノード数1、第一章第二節はノード数2、第二章第一節はノード数4である。文書の種類とは、例えば、規程なのか業務マニュアルなのかといった種類である。構造文書のスキーマとは、XMLの場合はXMLスキーマやDTDである。 Other appearance conditions may include the same or similar word vectors within a certain number of characters or a certain number of nodes from the appearance position, the type of the document, the combination of the schema of the structure document and the appearance position, etc. . In the present embodiment, the number of moving up and down the document structure is referred to as “number of nodes”. For example, the first chapter first section has 1 node, the first chapter second section has 2 nodes, and the second chapter first section has 4 nodes. The document type is, for example, a type such as a rule or a business manual. In the case of XML, the schema of the structure document is an XML schema or a DTD.
出現条件グルーピング部105は、出現条件が近い構造文書同士をグルーピングする。例えば、単語Aが第一章第一節第一段落に出現している構造文書と、単語Bが第一章第一節第一段落に出現している構造文書は、出現位置が同一であるから、同じグループになるようにグルーピングする。 The appearance condition grouping unit 105 groups structural documents having similar appearance conditions. For example, the structural document in which the word A appears in the first chapter, the first section, the first paragraph, and the structural document, in which the word B appears in the first chapter, the first section, the first paragraph, have the same appearance position. Group to be in the same group.
このように厳密なグルーピングが行えない場合は、出現条件を曖昧にする。例えば、「第一章第一節第一段落に出現」という出現条件を、「第一章第一節のどこかに出現」というような類似の範囲も包含するようにする。すなわち出現位置同士が同一に限らず類似の範囲も含みうるということである。含んでいる索引語の区別なく、出現位置の構造上の特徴から、その類似度または具体度に従って各構造文書をグループ分けするのである。類似度については後述する。 When strict grouping cannot be performed in this way, the appearance condition is made ambiguous. For example, the appearance condition “appears in the first chapter, first section, first paragraph” is also included in a similar range such as “appears somewhere in the first chapter, first section”. In other words, the appearance positions are not limited to the same, but may include similar ranges. The structural documents are grouped according to their similarity or concreteness from the structural features of the appearance position without distinguishing the index words included. The degree of similarity will be described later.
このようなグループ分けは、語句や事項の定義的文章など、ある程度「型」に従って説明されるものは、文書構造的に似た場所に出現する一方、本文中で少々触れるだけの言葉は、出現する場所は分散する傾向にあるという、エントロピーに注目した仮説に基づいている。 Such groupings, such as words and descriptive sentences of items, that are explained to some extent according to “type” appear in similar places in the document structure, while words that only touch a few words appear in the text. It is based on the hypothesis that focuses on entropy that the place to do tends to be dispersed.
なお、曖昧にしていく方式としては、出現位置の場合、上記の単語の出現位置から近い構造的限定を外していく方式がある。 In addition, as a method of making it ambiguous, in the case of an appearance position, there exists a method of removing the structural limitation close | similar from the appearance position of said word.
出現条件グルーピング部105によりグルーピングされた各構造文書は、各索引語である単語との対応付けが行われ、この対応関係を表す索引情報が、索引一覧記憶部106に送られて格納される。例えば、グループAは入力された単語W1,W2,W3が第一章第一節第一段落に出現する構造文書D1,D2,D3であり、グループBは、単語W1,W2,W3が第一章のいずれかに出現している構造文書D4,D5,D6であるとすると、「W1-グループA:D1」、「W1-グループB:D4」、「W2-グループA:D2」、「W2-グループB:D5」、「W3-グループA:D3」、「W3-グループB:D6」という対を記憶する。 Each structural document grouped by the appearance condition grouping unit 105 is associated with a word that is each index word, and index information representing this correspondence is sent to the index list storage unit 106 and stored therein. For example, group A is the structured document D1, D2, D3 in which the input words W1, W2, W3 appear in the first paragraph, first section, first paragraph, and group B has the words W1, W2, W3 in the first chapter. , W1-group A: D1, “W1-group B: D4”, “W2-group A: D2”, “W2- A pair of “Group B: D5”, “W3-Group A: D3”, and “W3-Group B: D6” is stored.
索引一覧記憶部106に格納された索引情報は、索引一覧提示部107によってユーザに提示される。索引一覧提示部107は、例えば、索引語である単語ごとに、出現条件がより厳密な構造文書ごとに列挙する。 The index information stored in the index list storage unit 106 is presented to the user by the index list presentation unit 107. The index list presenting unit 107 lists, for example, each structural document whose appearance condition is stricter for each word that is an index word.
索引語確認部108は、ユーザが新たに索引語を追加する際に、その妥当性を判定し、ユーザにフィードバックする。出現条件グルーピング部105により、入力された索引語W4に対して、上記グループAに属す構造文書が存在しない場合、その旨を通知する。どのグループに含まれなかった場合に、通知対象にするかどうかの基準は、システムの設定の一部であるとする。 When the user adds a new index word, the index word confirmation unit 108 determines the validity and feeds back to the user. When there is no structural document belonging to the group A, the appearance condition grouping unit 105 notifies that to the input index word W4. It is assumed that the criteria for determining whether or not to include a notification in any group is part of the system settings.
検索語推薦部109は、ユーザに、未登録の索引語を提示する。例えば、グループAの出現条件が第一章第一節第一段落である場合は、出現条件による構造文書検索部110により、登録されている構造文書中からその出現条件に適合する第一章第一節第一段落における文字列を取り出す。そして、未登録語判定部111が、その文字列の中から、まだ索引一覧記憶部106に登録されている索引語とは異なる語であって、かつ、特徴的な単語を判定する。 The search word recommendation unit 109 presents unregistered index words to the user. For example, when the appearance condition of the group A is the first chapter, the first section, the first paragraph, the first document that matches the appearance condition from the registered structure document by the structural document search unit 110 based on the appearance condition Extract the string in the first paragraph of the section. Then, the unregistered word determination unit 111 determines a characteristic word that is different from the index word still registered in the index list storage unit 106 from the character string.
特徴的な単語の判定は、形態素解析アルゴリズムにかけて名詞を抽出し、TF-IDF という指標を用いてその文字列に特徴的な単語を判定する、などの方法がある。この手法については公知なので詳細には触れない。 A characteristic word can be determined by extracting a noun by using a morphological analysis algorithm and determining a characteristic word in the character string using an index called TF-IDF. Since this method is known, it will not be described in detail.
また、すでに登録されている索引語とさまざまな性質を比較する判定を加えてもよい。例えば、平均文字列長が近いものに絞る、全構造文書に対する出現数が類似しているものに絞る、などである。「全構造文書に対する出現数が類似している」とは、例えば、すでに登録されている各索引語が、全体の登録文書の1%にそれぞれ出現している場合、検索語として推薦する語も1%程度の登録文書に出現しているものに絞ることである。 In addition, it may be possible to add a determination comparing various characteristics with an already registered index word. For example, narrowing down to those with a close average character string length, narrowing down to those with similar appearance numbers for all structured documents. “The number of appearances for all structured documents is similar” means that, for example, if each registered index word appears in 1% of all registered documents, a word recommended as a search term Focusing on those that appear in about 1% of registered documents.
このように判定された単語は、すでに登録されている索引語とは異なる語であるが、似た出現条件を持つという意味で、共通に見られる傾向をもつといえるので、索引語として登録すべき単語である可能性が高いとし、ユーザに新たな索引語の候補として推薦する。 The words determined in this way are different from the index words that have already been registered, but they have a similar appearance condition and can be said to be commonly seen. It is highly possible that the word is a power word, and it is recommended to the user as a new index word candidate.
グループ名編集部112は、索引一覧記憶部106に記憶された内容を編集するためのものである。ユーザが、不要な単語-文書対を削除したり、グループ名、出現条件を編集することができる。 The group name editing unit 112 is for editing the contents stored in the index list storage unit 106. The user can delete unnecessary word-document pairs and edit group names and appearance conditions.
図2は、実施形態の構造文書記憶部103に記憶される構造文書の一例を示す図である。 FIG. 2 is a diagram illustrating an example of a structural document stored in the structural document storage unit 103 according to the embodiment.
本実施形態では、構造文書記憶部103に格納される構造文書としてXMLを扱う。またはHTMLやSGMLでもよい。文書201、202、203は同じXMLスキーマで書かれたXML文書であり、それぞれ、企業の活動やルールを定めた規程文書の一部が格納されている例である。XMLスキーマについては、DocBook: http://docbook.org/ns/docbookを参照のこと。 In the present embodiment, XML is handled as a structure document stored in the structure document storage unit 103. Or it may be HTML or SGML. Documents 201, 202, and 203 are XML documents written in the same XML schema, and are examples in which a part of a regulation document that defines company activities and rules is stored. See DocBook: http://docbook.org/ns/docbook for the XML schema.
各文書は、article(記事)要素を先頭に持つ。article要素内部には、articleの書誌情報を記入するinfo(書誌情報)要素、本文を表す複数のsect1(節)要素がある。info要素の内部にはtitle(タイトル)要素やauthor(著者)要素があり、sect1要素の内部には、その節のtitle要素や複数のpara(段落)要素がある。その他、orderedlist(番号付き箇条書き)要素やlistitem(箇条書きの1項目)要素もある。 Each document has an article element at the top. Inside the article element, there are an info (bibliographic information) element for entering the bibliographic information of the article and a plurality of sect1 (section) elements representing the text. Inside the info element are a title element and an author element, and inside the sect1 element are the title element of the section and multiple para elements. In addition, there is an orderedlist (numbered item) element and a listitem (one item of item).
図3は、実施形態の構造文書202に対する索引語の出現条件の例を示す図である。 FIG. 3 is a diagram illustrating an example of an index word appearance condition for the structure document 202 according to the embodiment.
構造文書の例であるXML文書として先に示した文書202において、索引語が「会社規程」および「主管個所」である場合に、出現条件判定部104が各々の出現条件301および出現条件302を判定した結果を示している。この例では、出現条件として出現位置を出しており、出現位置はXPathで表される。文字列の出現位置からXPathを求める方式については公知であるため省略する。 In the document 202 previously shown as the XML document as an example of the structure document, when the index words are “company rules” and “main part”, the appearance condition determination unit 104 sets the appearance condition 301 and the appearance condition 302 respectively. The result of the determination is shown. In this example, the appearance position is given as the appearance condition, and the appearance position is represented by XPath. Since the method for obtaining the XPath from the appearance position of the character string is known, it is omitted.
本実施形態では、出現位置を表すXPathにおいて、例えば出現条件301または302においてルートノード側から「article」、「sect」、「orderedlist」、「listitem」、「para」といった各表記部分を、「要素名」ということにする。 In the present embodiment, in the XPath representing the appearance position, for example, in the appearance condition 301 or 302, each notation part such as “article”, “sect”, “orderedlist”, “listitem”, “para” is represented as “element” from the root node side. Name ".
また、例えば出現条件301の要素「sect1」に付随する[1]、出現条件302の要素「sect1」に付随する[1]、出現条件301の要素「orderedlist」に付随する[1]、出現条件302の要素「orderedlist」に付随する[4]といった表記部分を、「インデックス」ということにする。出現条件301および出現条件302の関係について図3を参照すると、両者は「sect1[1]」については同一の階層に属しているが、それよりの下位の階層である「orderedlist」については、出現条件301のインデックスが[1]であり、途中のインデックス[2]、[3]は図示を省略し、出現条件302のインデックスは[4]となっている。 Also, for example, [1] associated with the element “sect1” of the appearance condition 301, [1] associated with the element “sect1” of the appearance condition 302, [1] associated with the element “orderedlist” of the appearance condition 301, A notation part such as [4] attached to the element “orderedlist” 302 is referred to as an “index”. Referring to FIG. 3 regarding the relationship between the appearance condition 301 and the appearance condition 302, both belong to the same hierarchy with respect to “sect1 [1]”, but with respect to “orderedlist”, which is a lower hierarchy, The index of the condition 301 is [1], the intermediate indexes [2] and [3] are not shown, and the index of the appearance condition 302 is [4].
この図3の例では、出現位置のみを出現条件としているが、他のパラメータを出現条件の一部として組み合わせることもできる。例えば、索引語の前後の文字ないし文字列、または親ノードの見出し文字列(以下「周辺文字列」という。)や、文書のスキーマなどといった周辺情報が考えられる。 In the example of FIG. 3, only the appearance position is set as the appearance condition, but other parameters may be combined as a part of the appearance condition. For example, peripheral information such as characters or character strings before and after an index word, a heading character string of a parent node (hereinafter referred to as “peripheral character string”), a document schema, and the like can be considered.
索引語の前後の文字とは、この例では、索引語である会社規定ないし主管個所(文書202の下線部参照。)の前後の、文字“「”と、文字“」”である。親ノードの見出し文字列とは、この例では「第1条」「第4条」を指す。文書のスキーマとは、この例ではDocBookスキーマである。XMLでは、スキーマを最上段の要素のxmlns属性で表す。つまり「http://docbook.org/ns/docbook」がこの文書のスキーマ名となる。 In this example, the characters before and after the index word are the characters ““ ”and“ “” before and after the company regulations or the management section (refer to the underlined portion of the document 202) that is the index word. In this example, the heading character string indicates “Article 1” and “Article 4”. The document schema is the DocBook schema in this example. In XML, the schema is represented by the xmlns attribute of the top element. In other words, “http://docbook.org/ns/docbook” is the schema name of this document.
図4は、実施形態の出現条件グルーピング部105の処理のフローチャートである。 FIG. 4 is a flowchart of processing of the appearance condition grouping unit 105 according to the embodiment.
入力は、索引語、出現条件、文書の3つ組からなるリストである(ステップS401)。出現条件グルーピング部105の処理の目的は、この入力されたリストを、出現条件が類似しているものという基準で複数のグループに分けることである。 The input is a list including a triplet of an index word, an appearance condition, and a document (step S401). The purpose of the processing of the appearance condition grouping unit 105 is to divide the input list into a plurality of groups based on the criteria that the appearance conditions are similar.
入力されたリストの各々について、出現条件を一定レベルまで曖昧化する(ステップS402)。曖昧化の方法は出現条件の内容によって異なるが、出現条件のうち、出現位置を表すXpathに関しては、インデックスや要素名の指定を外していくことで、出現位置を曖昧にしていくことができる。外し方は様々であるが、例えば、まず、(1)ルートノード側からインデックスを段階的に除去し、次に、(2)ルートノード側から要素を段階的に除去していく方法がある(ステップS403)。 For each of the input lists, the appearance condition is obscured to a certain level (step S402). Although the method of obscuration differs depending on the contents of the appearance condition, regarding the Xpath that represents the appearance position in the appearance condition, the appearance position can be made ambiguous by removing the designation of the index and element name. There are various ways of removal. For example, there is a method of (1) removing an index stepwise from the root node side, and (2) removing an element stepwise from the root node side. Step S403).
一方、前後の文字、周辺文字列、スキーマといった周辺情報の出現条件は、その指定自体を解除してなくすことで曖昧化できる(ステップS404)。この曖昧化の効果的なアルゴリズムは構造文書のスキーマによって異なることが予想されるが、このような単純な方法でも実施可能である。なお、ステップS403の処理とステップS404の処理の順序は問わず、並行的に行ってもよい。 On the other hand, the appearance conditions of the peripheral information such as the preceding and following characters, the peripheral character string, and the schema can be made ambiguous by removing the designation itself (step S404). It is expected that the effective algorithm for obfuscation varies depending on the schema of the structure document, but such a simple method can be implemented. Note that the order of the process in step S403 and the process in step S404 may be performed in parallel.
この曖昧化処理の回数を、曖昧化回数として記憶する(ステップS405)。この曖昧化回数は、スコアであり、出現条件の具体度であるということができる。また、複数の索引語の出現条件同士を比較したときに、曖昧処理回数は索引語の類似性をあらわす類似度であるといえる。 The number of times of the obscuring process is stored as the number of times of obscuration (step S405). The number of times of obfuscation is a score, and it can be said that it is the concreteness of the appearance condition. Further, when the appearance conditions of a plurality of index words are compared, it can be said that the number of ambiguous processes is a similarity indicating the similarity of the index words.
次に、出現条件が一致しているものを、曖昧化回数が低いものからグルーピングしていく。つまりリストの全てを対象に、曖昧化回数が等しいかそれ以下のもので、すべての索引語をグルーピングできる組み合わせを繰り返し探す(ステップS406)。すなわち、出現条件同士が同一である場合に限らず、類似の範囲も包含しうるということができる。 Next, those with the same appearance condition are grouped from those with the lowest obfuscation count. That is, for all the lists, a combination that has the same or less obscuration count and can group all index terms is repeatedly searched (step S406). That is, it can be said that not only the appearance conditions are the same but also a similar range can be included.
ただし、1つの項目は1つのグループのみに属するものとし、つまり先着順とし、あるグループに入った要素と同じ索引語と文書の対を持つ要素は取り除かれるものとする。 However, it is assumed that one item belongs to only one group, that is, first-come-first-served basis, and an element having the same index word and document pair as an element included in a certain group is removed.
以上の処理の結果、出現条件グルーピング部105の出力として最終的に得られるのは、索引語、出現条件、最大曖昧化回数、文書のリストの4つ組からなるリストである(ステップS407)。 As a result of the above processing, what is finally obtained as an output of the appearance condition grouping unit 105 is a list including four sets of an index word, an appearance condition, the maximum number of obscurations, and a list of documents (step S407).
図5は、実施形態の出現条件の曖昧化の例を示す図である。 FIG. 5 is a diagram illustrating an example of obscuring appearance conditions according to the embodiment.
「主管個所」という索引語500について、先に図3で示した「会社規程管理規程」という文書202に関する出現条件302に対して、この図5で示した出現状態の初期状態501は、索引語の周辺情報として前後の文字“「”と“」”を加えたものとしている。この出現条件501を曖昧化していった場合を考え、以下に説明する。 With respect to the index word 500 “main part”, the initial state 501 of the appearance state shown in FIG. 5 with respect to the appearance condition 302 related to the document 202 “company regulation management rule” shown in FIG. The surrounding characters ““ ”and“ ”” are added as the peripheral information of the case where the appearance condition 501 is made ambiguous and will be described below.
出現条件502は、索引語の出現条件の初期状態501そのものであり、この時点での曖昧化回数は0である。 The appearance condition 502 is the initial state 501 of the appearance condition of the index word, and the number of times of obscuration at this point is zero.
この出現条件502に対して、XPathの一部であるsect1[1]から、[1]というインデックスを外したものが、出現条件503である(「sect1」の下線部を参照)。このとき、曖昧化回数は1増加して「1」となる。このインデックスを外した結果、どのようなインデックスを持つsect1要素に索引語「主管個所」が出現しても、同じものとして扱われることを意味する。 The appearance condition 503 is obtained by removing the index [1] from sect1 [1], which is a part of XPath, with respect to the appearance condition 502 (see the underlined portion of “sect1”). At this time, the number of times of obscuration increases by 1 to “1”. As a result of removing this index, it means that even if the index word “main part” appears in the sect1 element having any index, it is treated as the same thing.
図5の例では、まずインデックスを段階的に外し、すべてのインデックスを外した直後に周辺情報を外し、次に要素指定を外していくという流れを示している。 In the example of FIG. 5, the index is first removed stepwise, the peripheral information is removed immediately after all the indexes are removed, and then the element designation is removed.
具体的には、出現条件503の「orderedlist[4]」からインデックス[4]を外したものが出現条件504であり(「orderedlist」の下線部を参照)、曖昧化回数は1増加して「2」となる。出現条件504の「listitem[2]」からインデックス[2]を外したものが出現条件505であり(「listitem」の下線部を参照)、曖昧化回数は1増加して「3」となる。出現条件505の「para[1]」からインデックス[1]を外したものが出現条件506であり(「para」の下線部を参照)、曖昧化回数は1増加して「4」となる。 Specifically, the appearance condition 504 is obtained by removing the index [4] from the “orderedlist [4]” in the appearance condition 503 (see the underlined portion of the “orderedlist”). 2 ”. The appearance condition 504 is obtained by removing the index [2] from “listitem [2]” (see the underlined portion of “listitem”), and the number of obscuration increases by 1 to “3”. The appearance condition 506 is obtained by removing the index [1] from “para [1]” in the appearance condition 505 (see the underlined portion of “para”), and the number of obscuration increases by 1 to “4”.
ここで、すべてのインデックスが外れたので、出現条件506から周辺情報である“「”と“」”を外したものが出現条件507となり(「周辺情報」の下線部を参照)、曖昧化回数は1増加して「5」となる。 Here, since all indexes have been removed, the appearance condition 507 is obtained by removing the peripheral information ““ ”and“ ”” from the appearance condition 506 (see the underlined portion of “peripheral information”), and the number of obscurations Increases by 1 to “5”.
次に、出現条件507から要素指定である「article」を外し、かつ、周辺情報として“「”と“」”を付加したものが出現条件508であり(「//sect1」の下線部を参照)、曖昧化回数は1増加と1減少により変化はなく「5」となる。次に、出現条件508から周辺情報である“「”と“」”を外したものが出現条件509となり(「周辺情報」の下線部を参照)、曖昧化回数は1増加して「6」となる。次に、出現条件509から要素指定である「sect1」を外し、かつ、周辺情報として“「”と“」”を付加したものが出現条件510であり(「//orderedlist」の下線部を参照)、曖昧化回数は1増加と1減少により変化はなく「6」となる。これ以降の曖昧化については図示を省略する。 Next, the “article” element designation is removed from the appearance condition 507, and the appearance information 508 is added with ““ ”and“ ”” as peripheral information (see the underlined part of “// sect1”) ) The number of obscurations is “5” with no change due to 1 increase and 1 decrease. Next, the appearance condition 509 is obtained by removing the peripheral information ““ ”and“ ”” from the appearance condition 508 (see the underlined portion of “peripheral information”), and the number of ambiguities is increased by 1 to “6”. It becomes. Next, the appearance condition 510 is obtained by removing the element designation “sect1” from the appearance condition 509 and adding ““ ”and“ ”” as peripheral information (see the underlined part of “// orderedlist”) ) The number of obscurations is “6” without change by 1 increase and 1 decrease. The subsequent ambiguity is not shown.
図6は、実施形態の出現条件の曖昧化回数によるグルーピングの例を示す図である。 FIG. 6 is a diagram illustrating an example of grouping based on the number of times of obscuring appearance conditions according to the embodiment.
ここでは、図5のように展開した出現条件を持つ索引語-文書の対同士を比較し、同一のグループを探す例を示している。 Here, an example is shown in which index word-document pairs having appearance conditions developed as shown in FIG. 5 are compared to search for the same group.
索引語「主管個所」500が出現する文書202の出現条件501と、索引語「預託」600が出現する文書203の出現条件511は、それぞれ曖昧化をしていった結果、曖昧化回数3の条件において初めて一致する。つまり出現条件505と出現条件515とが一致する。索引語がこの「主管個所」500及び「預託」600の2つのみである場合、文書202と文書203が、各索引語の各々の索引先文書となる。 The appearance condition 501 of the document 202 in which the index word “main part” 500 appears and the appearance condition 511 of the document 203 in which the index word “deposit” 600 appears are each obscured. Matches for the first time in conditions. That is, the appearance condition 505 and the appearance condition 515 match. When there are only two index words, “main part” 500 and “deposit” 600, the document 202 and the document 203 become index destination documents of the respective index words.
図7は、実施形態の索引一覧記憶部106における記憶内容の例を示す図である。 FIG. 7 is a diagram illustrating an example of the contents stored in the index list storage unit 106 according to the embodiment.
索引一覧記憶部106は出現条件グルーピング部105から出力された索引情報を記憶している。索引一覧記憶部106に記憶されている索引情報は、索引語701、曖昧化回数702、出現条件703、および文書名705から構成されている。グループ名704は、グルーピングされた出現条件群に対して名前を付けることで、索引一覧提示画面において各出現条件のかわりに表示することができる。このグループ名704は、ユーザが、グループ名編集部112を用いて付けることができる。 The index list storage unit 106 stores the index information output from the appearance condition grouping unit 105. The index information stored in the index list storage unit 106 includes an index word 701, an obfuscation count 702, an appearance condition 703, and a document name 705. The group name 704 can be displayed in place of each appearance condition on the index list presentation screen by giving a name to the grouped appearance condition group. The group name 704 can be given by the user using the group name editing unit 112.
図7において、索引一覧記憶部106には、グループ名「定義」と名付けられたグループ(データ行505、515を参照。)と、グループ名「参考文書」と名付けられたグループ(データ行711、712を参照。)に関する索引情報が記憶されている。「定義」のグループは、最も曖昧化回数が少ないグループであり、「参考文書」のグループは、それ以外のものから構成されている。 In FIG. 7, the index list storage unit 106 includes a group named “definition” (see data rows 505 and 515) and a group named “reference document” (data rows 711, 515). 712) is stored. The “definition” group is the group with the least number of obscurations, and the “reference document” group is composed of other items.
図8は、実施形態の索引一覧提示部107による索引の提示画面の例を示す図である。 FIG. 8 is a diagram illustrating an example of an index presentation screen by the index list presenting unit 107 according to the embodiment.
「登録文書の索引」という表題の画面800では、索引語の読みを、索引一覧提示部107が判定し、五十音毎に分類して表示している。[あ]…[か]…[さ]…[し]…[よ]等とあるのが、索引語の読みの見出し801である。漢字の読みを取得する方法は様々な方法があり、公知であるため省略する。 On the screen 800 titled “Registered Document Index”, the index list presenting unit 107 determines the reading of the index word, and displays it sorted by the Japanese syllabary. [A] ... [ka] ... [sa] ... [shi] ... [yo] etc. are index word reading headings 801. There are various methods for acquiring kanji readings, which are well known and will be omitted.
索引語は、「主管個所」500、「預託」600の二つが表示されている。この各索引語の下には、曖昧化回数の少ないグループごとに、そのグループに属する文書名をインデントして表示している。例えば、最も小さい曖昧化回数3を持つ文書を最初に表示し(「会社規程管理規程」202、「個人情報協力会社取扱および預託管理規程」203を参照。)、次にそれ以上の曖昧化回数を持つ文書をもう一段深いインデントで表示する(「規程編集マニュアル」「規定変更依頼ガイドライン」「(他4件)」711、「社外発注規程」712を参照。)。ユーザが文書名を選択すると、その文書の表示画面に遷移する。 Two index words, “main part” 500 and “deposit” 600 are displayed. Under each index word, for each group with a small number of obscurations, the names of documents belonging to the group are displayed indented. For example, the document having the smallest number 3 of obscuration is displayed first (refer to “Company Rules Management Rules” 202, “Personal Information Cooperation Company Handling and Deposit Management Rules” 203), and then the number of further obscurations Are displayed with a deeper indentation (see “Regulation Editing Manual”, “Regulation Change Request Guidelines”, “(Other 4)” 711, “External Ordering Regulations” 712). When the user selects a document name, a transition is made to the display screen for that document.
図9は、実施形態の索引語確認部108による提示画面の例を示す図である。 FIG. 9 is a diagram illustrating an example of a presentation screen by the index word confirmation unit 108 according to the embodiment.
「索引語の追加」という表題の画面900では、画面領域901において、ユーザが、「索引語を追加します:」とある索引語追加フォーム902に新たな索引語「社員情報」を入力し、「追加」ボタン903を押す。すると、索引語確認部108は、索引語「社員情報」が出現する構造文書における出現条件について、すでに索引一覧記憶部106に記憶されている各出現条件を、出現条件グルーピング部105を介して参照する。 In the screen 900 titled “Add index word”, in the screen area 901, the user inputs a new index word “employee information” in an index word addition form 902 “Add an index word:” An “add” button 903 is pressed. Then, the index word confirmation unit 108 refers to each appearance condition already stored in the index list storage unit 106 for the appearance condition in the structural document in which the index word “employee information” appears via the appearance condition grouping unit 105. To do.
その結果、索引語「社員情報」の出現条件が、すでに登録されている索引語群の出現条件に含まれないと判定された場合、索引語確認部108は、索引語として適切ではない可能性があるとしてユーザに対して警告を出し(画面領域904の「指定された「社員情報」は他の索引語とは異なる傾向があります。本当に登録しますか?」という表示を参照。)、その確認のための次の操作をユーザに促す(「追加」ボタン905、「取り消し」ボタン906、「登録文書を確認」ボタン907参照。)。 As a result, if it is determined that the appearance condition of the index word “employee information” is not included in the appearance conditions of the already registered index word group, the index word confirmation unit 108 may not be appropriate as an index word. (See the display in the screen area 904 "Specified" Employee information "tends to be different from other index words. Are you sure you want to register?") The user is prompted to perform the next operation for confirmation (see “add” button 905, “cancel” button 906, and “confirm registered document” button 907).
図10は、実施形態の索引語推薦部109による提示画面の例を示す図である。 FIG. 10 is a diagram illustrating an example of a presentation screen by the index word recommendation unit 109 according to the embodiment.
「索引語候補」1000画面の例では、曖昧化回数の最も少ないグループの出現条件を用いて、出現条件による構造文書検索部110が構造文書記憶部103に記憶されている全ての登録文書を検索し、その結果を受けて、索引語推薦部109がまだ登録されていない未登録の索引語を表示している。 In the “index word candidate” 1000 screen example, using the appearance condition of the group with the least number of times of obfuscation, the structural document search unit 110 by the appearance condition searches all registered documents stored in the structural document storage unit 103. In response to the result, the index word recommendation unit 109 displays unregistered index words that are not yet registered.
画面領域1001に示された「成果物」「関連会社」「輸出管理推進責任者」「業務担当」「教育担当」「審査担当」といった単語が、いずれかの構造文書の/article/sect1/orderedlist/listitem/para[1]という位置に出現し、周辺に“「”と“」”という文字が存在しているということを意味している。 The words “deliverable”, “affiliated company”, “export control promotion manager”, “business manager”, “educator” and “examiner” shown in the screen area 1001 are displayed in the / article / sect1 / orderedlist of any structural document. Appears at the position / listitem / para [1], which means that the characters "" and "" exist around it.
ユーザは、その中で索引語としたいものがあれば候補横のチェックボックス1002にチェックを入れる。必要に応じて「文書参照」リンク1003を押すことにより、その索引語の出現位置における文書の内容を確認することができる。そして、「索引語に追加」ボタン904を押すことにより、その索引語が追加される。 The user checks a check box 1002 next to the candidate if there is an index word that is desired. By pressing a “document reference” link 1003 as necessary, the contents of the document at the appearance position of the index word can be confirmed. Then, by pressing an “add to index word” button 904, the index word is added.
さらに、文書集合が更新されたときは、索引先文書の再チェックを行うとともに、まだ索引語として登録されていない語を、索引語候補としてユーザに提示することもできる。 Furthermore, when the document set is updated, the index destination document is rechecked, and words that are not yet registered as index words can be presented to the user as index word candidates.
以上説明したように、本実施形態によれば、低コストで索引一覧が作成・保守されることで、文書閲覧者の閲覧効率が上がると共に、文書編集者の保守コストが下がる。双方の業務効率が向上し、より価値の高い作業、例えば文書内容の理解や編集に集中できるようになる。 As described above, according to the present embodiment, the index list is created and maintained at a low cost, so that the viewing efficiency of the document viewer increases and the maintenance cost of the document editor decreases. Both business efficiency is improved, and it becomes possible to concentrate on higher value work such as understanding and editing of document contents.
第一に、索引語を指定するだけで、それらの語が含まれる文書の中から最も適切な文書を判定し、索引語と文書が対となった索引一覧を自動生成できる。結果として、索引一覧が手軽に作られ、文書閲覧者は情報収集効率が上がり、文書編集者は文書の保守コストが下がるため、総じて業務効率が向上する。 First, by simply specifying index words, the most appropriate document can be determined from documents including those words, and an index list in which the index words and documents are paired can be automatically generated. As a result, an index list is easily created, information collection efficiency is improved for document viewers, and document maintenance costs are reduced for document editors, so work efficiency is generally improved.
第二に、不適切な索引語の登録をチェックすることで、不適切な語が索引として登録されにくくなる。文書編集者の文書保守コストが下がり、業務効率が向上する。 Second, by checking the registration of an inappropriate index word, it becomes difficult to register an inappropriate word as an index. Document maintenance costs for document editors are reduced, and work efficiency is improved.
第三に、一部の索引語を入力するだけで、それ以外の索引語も明らかになる仕組みが提供される。この仕組みにより、文書編集者は索引語のメンテナンスコストを大幅に下げることができる。閲覧者も、より充実した索引一覧が使用できることになり、組織全体の業務効率が向上する。 Third, it is possible to provide a mechanism that makes it possible to clarify other index words simply by inputting some index words. This mechanism allows document editors to significantly reduce index word maintenance costs. Readers can use a more extensive index list, which improves the operational efficiency of the entire organization.
本発明のいくつかの実施形態を説明したが、これらの実施形態は、例として提示したものであり、発明の範囲を限定することは意図していない。これら実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。これら実施形態やその変形は、発明の範囲や要旨に含まれると同様に、特許請求の範囲に記載された発明とその均等の範囲に含まれるものである。 Although several embodiments of the present invention have been described, these embodiments are presented by way of example and are not intended to limit the scope of the invention. These embodiments can be implemented in various other forms, and various omissions, replacements, and changes can be made without departing from the spirit of the invention. These embodiments and modifications thereof are included in the scope of the present invention and the gist thereof, and are also included in the invention described in the scope of claims and the equivalents thereof.
構造文書管理システム・・・100
索引語入力部・・・101
単語による構造文書検索部・・・102
構造文書記憶部・・・103
出現条件判定部・・・104
出現条件グルーピング部・・・105
索引一覧記憶部・・・106
索引一覧提示部・・・107
索引語確認部・・・108
索引語推薦部・・・109
出現条件による構造文書検索部・・・110
未登録語判定部・・・111
グループ名編集部・・・112
Structural document management system ... 100
Index word input part 101
Structural document search unit by word 102
Structure document storage unit 103
Appearance condition determination unit 104
Appearance condition grouping unit 105
Index list storage unit 106
Index list presentation unit 107
Index word confirmation part ... 108
Index word recommendation part ... 109
Structure document search unit based on appearance conditions ... 110
Unregistered word determination unit 111
Group name editing part ... 112

Claims (6)

  1. 索引語を入力する入力手段と、
    構造文書を記憶している記憶装置から前記索引語が出現する構造文書を検索する検索手段と、
    前記検索された構造文書において前記索引語が出現する構造上の部分を少なくとも特定する出現条件を判定する判定手段と、
    前記出現条件の類似度に基づいて各構造文書をグルーピングするグルーピング手段と、
    グルーピングされた各構造文書と各索引語との対応関係を索引情報として記憶する索引記憶手段とを有する構造文書管理システム。
    An input means for inputting an index word;
    Retrieval means for retrieving a structural document in which the index word appears from a storage device storing the structural document;
    Determining means for determining an appearance condition for at least identifying a part on the structure in which the index word appears in the searched structure document;
    Grouping means for grouping each structural document based on the similarity of the appearance conditions;
    A structural document management system comprising index storage means for storing a correspondence relationship between each grouped structural document and each index word as index information.
  2. 前記出現条件は前記索引語の周辺の文字列の有無についても含む請求項1記載の構造文書管理システム。 The structural document management system according to claim 1, wherein the appearance condition includes the presence / absence of a character string around the index word.
  3. ユーザが索引語を追加する際に、その語が出現する構造文書における出現条件が、すでに前記索引記憶手段に記憶されている各索引語についての出現条件に包含されない場合、警告を出して確認を促す索引語確認手段をさらに有する請求項1ないし請求項2記載の構造文書管理システム。 When the user adds an index word, if the appearance condition in the structure document in which the word appears is not included in the appearance condition for each index word already stored in the index storage means, a warning is issued to confirm. 3. The structural document management system according to claim 1, further comprising index word confirmation means for prompting.
  4. 前記出現条件に適合する部分を有する構造文書を記憶装置から検索する第二の検索手段と、
    この検索された構造文書における前記出現条件に適合する部分から、すでに前記索引記憶手段に記憶されている索引語とは異なる語を抽出し、この抽出された語を新たな索引語の候補としてユーザに提示する索引語推薦手段をさらに有する請求項1ないし請求項3記載の構造文書管理システム。
    Second search means for searching a storage device for a structural document having a portion that matches the appearance condition;
    A word different from the index word already stored in the index storage means is extracted from the part that matches the appearance condition in the retrieved structural document, and the extracted word is used as a new index word candidate by the user. 4. The structural document management system according to claim 1, further comprising index word recommendation means presented in the above.
  5. 指定された索引語を入力する入力ステップと、
    前記指定された索引語が含まれる構造文書を記憶装置から検索する検索ステップと、
    前記検索された構造文書において前記索引語が出現している構造上の部分を特定する出現条件を判定する判定ステップと、
    前記出現条件の類似度に基づいて各構造文書をグルーピングするグルーピングステップと、
    グルーピングされた各構造文書と各索引語との対応関係を索引情報として記憶する索引記憶ステップとを有する構造文書管理方法。
    An input step for entering a specified index word;
    A search step of searching a storage device for a structured document including the specified index word;
    A determination step of determining an appearance condition for identifying a structural part in which the index word appears in the searched structure document;
    A grouping step of grouping each structural document based on the similarity of the appearance conditions;
    A structural document management method comprising an index storage step of storing a correspondence relationship between each grouped structural document and each index word as index information.
  6. 請求項1ないし請求項4記載の構造文書管理システムを構成するコンピュータに前記各手段を機能させるためのプログラム。 5. A program for causing a computer constituting the structural document management system according to claim 1 to function as said means.
PCT/JP2012/003349 2012-05-22 2012-05-22 Structured document management system, structured document management method and program WO2013175524A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2014516505A JP5971571B2 (en) 2012-05-22 2012-05-22 Structural document management system, structural document management method, and program
PCT/JP2012/003349 WO2013175524A1 (en) 2012-05-22 2012-05-22 Structured document management system, structured document management method and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2012/003349 WO2013175524A1 (en) 2012-05-22 2012-05-22 Structured document management system, structured document management method and program

Publications (1)

Publication Number Publication Date
WO2013175524A1 true WO2013175524A1 (en) 2013-11-28

Family

ID=49623263

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2012/003349 WO2013175524A1 (en) 2012-05-22 2012-05-22 Structured document management system, structured document management method and program

Country Status (2)

Country Link
JP (1) JP5971571B2 (en)
WO (1) WO2013175524A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006185408A (en) * 2004-11-30 2006-07-13 Matsushita Electric Ind Co Ltd Database construction device, database retrieval device, and database device
JP2007226453A (en) * 2006-02-22 2007-09-06 Toshiba Corp Structured document processor, structured document processing method and structured document processing program
JP2008242605A (en) * 2007-03-26 2008-10-09 Toshiba Corp Apparatus, method and program for managing structured document

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006185408A (en) * 2004-11-30 2006-07-13 Matsushita Electric Ind Co Ltd Database construction device, database retrieval device, and database device
JP2007226453A (en) * 2006-02-22 2007-09-06 Toshiba Corp Structured document processor, structured document processing method and structured document processing program
JP2008242605A (en) * 2007-03-26 2008-10-09 Toshiba Corp Apparatus, method and program for managing structured document

Also Published As

Publication number Publication date
JPWO2013175524A1 (en) 2016-01-12
JP5971571B2 (en) 2016-08-17

Similar Documents

Publication Publication Date Title
JP5512489B2 (en) File management apparatus and file management method
US10452907B2 (en) System and method for global identification in a collection of documents
KR101538998B1 (en) Method and apparatus for providing search service based on knowladge service
WO2009063925A1 (en) Document management & retrieval system and document management & retrieval method
WO2020167557A1 (en) Natural language querying of a data lake using contextualized knowledge bases
WO2016121048A1 (en) Text generation device and text generation method
Voskarides et al. Generating descriptions of entity relationships
Atwan et al. Semantically enhanced pseudo relevance feedback for Arabic information retrieval
US20120179709A1 (en) Apparatus, method and program product for searching document
JP5836893B2 (en) File management apparatus, file management method, and program
CN115292450A (en) Data classification field knowledge base construction method based on information extraction
JP6409071B2 (en) Sentence sorting method and calculator
JP3612769B2 (en) Information search apparatus and information search method
Al-Natsheh et al. Metadata enrichment of multi-disciplinary digital library: a semantic-based approach
JP5269399B2 (en) Structured document retrieval apparatus, method and program
KR101602342B1 (en) Method and system for providing information conforming to the intention of natural language query
Rodosthenous et al. Using generic ontologies to infer the geographic focus of text
JP2005128872A (en) Document retrieving system and document retrieving program
US20220083736A1 (en) Information processing apparatus and non-transitory computer readable medium
US8195458B2 (en) Open class noun classification
KR101078978B1 (en) System for grouping documents
JP5971571B2 (en) Structural document management system, structural document management method, and program
JP2000293537A (en) Data analysis support method and device
JP5746912B2 (en) Method, system and computer readable recording medium for refining a web document using text pattern extraction
JP6707410B2 (en) Document search device, document search method, and computer program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12877196

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2014516505

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12877196

Country of ref document: EP

Kind code of ref document: A1