WO2013175524A1

WO2013175524A1 - Structured document management system, structured document management method and program

Info

Publication number: WO2013175524A1
Application number: PCT/JP2012/003349
Authority: WO
Inventors: 坪井　創吾; 佐々木　淳哉; 陽二加藤; 裕子高森
Original assignee: 株式会社東芝; 東芝ソリューション株式会社
Priority date: 2012-05-22
Filing date: 2012-05-22
Publication date: 2013-11-28
Also published as: JPWO2013175524A1; JP5971571B2

Abstract

The objective of the present invention is to provide structured document management technology that allows creation and maintenance of an index page by merely selecting some index terms by a user. A structured document management system of the embodiment has input means for inputting an index term. A structured document in which an index term occurs is retrieved from a storage device that stores the structured document. An occurrence condition is assessed which at least identifies a structural portion in which the index term appears in the retrieved structured document. Each structured document is grouped on the basis of the degree of similarity of the occurrence conditions. A correspondence relationship between each of the grouped structured documents and each of the index terms is stored as index information.

Description

Structural document management system, structural document management method, and program

Embodiments described herein relate generally to an index creation support technique in structural document management.

Content management systems (hereinafter referred to as “CMS”) are becoming increasingly popular. CMS for business documents such as regulations and business manuals, CMS for personal content such as blogs, and public CMS where multiple people collaborate to edit content for the same purpose, such as Wikipedia. It is the situation. Information sharing systems using CMS are also widespread.

In the CMS document management technology, a document to be registered is often a document having a structure such as XML or HTML (hereinafter referred to as “structure document”).

When searching for a target document from a large number of documents, it is efficient to have an index page. An index page is a list of words and matters extracted from a document and arranged in a certain order so that the words and matters can be easily found. It is. There are ways to search by keyword search, but you can't use it unless you come up with a keyword. In addition, the index page has the pleasure of being touched by unknown knowledge based on the terms arranged before and after.

On the other hand, it is not easy to create and maintain an index page.

(1) It is difficult to judge the validity of a word that should be included in the index. For example, it is not effective to determine validity based on the appearance frequency. In a document that meets the definition, the frequency of index terms is rather low.

(2) It is difficult to determine which document is most suitable as an index from among documents including index words and how to handle a document that is not included in the index.

(3) Also, maintenance of the index once created is a time-consuming work. If any document is updated, the index page must be updated one by one. Whether in business or private, the incentive to update documents is different from the incentive to update index pages. In addition, in the case of business, the management department of each document may be different, which generally requires maintenance costs.

2. Description of the Related Art Conventionally, a technique that substitutes an index page by listing a search expression using a hierarchical structure and attributes of a structured document and a search result thereof is known. However, it is difficult to enumerate search expressions, and specialized knowledge is also required. It is required to solve the above problems (1) to (3).

JP 2006-185408 A

The problem to be solved by the present invention is to provide a structure document management technique that enables creation and maintenance of an index page only by a user selecting a part of index words.

The structural document management system according to the embodiment includes an input unit for inputting an index word. A structural document in which an index word appears is retrieved from a storage device storing the structural document. An appearance condition for identifying at least a structural part where an index word appears in the retrieved structure document is determined. Each structural document is grouped based on the similarity of appearance conditions. A correspondence relationship between each grouped structural document and each index word is stored as index information.

Configuration diagram of structural document management system 100 of embodiment FIG. 4 is a diagram illustrating an example of a structural document stored in the structural document storage unit 103 according to the embodiment. FIG. 10 is a diagram illustrating an example of index word appearance conditions for the structure document 202 according to the embodiment. Flowchart of processing of appearance condition grouping unit 105 of the embodiment The figure which shows the example of ambiguity of the appearance conditions of embodiment The figure which shows the example of the grouping by the frequency | count of fuzziness of the appearance condition of embodiment The figure which shows the example of the memory content in the index list memory | storage part 106 of embodiment The figure which shows the example of the presentation screen of the index by the index list presentation part 107 of embodiment The figure which shows the example of the presentation screen by the index word confirmation part 108 of embodiment The figure which shows the example of the presentation screen by the index word recommendation part 109 of embodiment

Hereinafter, embodiments for carrying out the invention will be described. The outline of the solution in the present embodiment is as follows (1) to (3).

(1) Acquire other index words by specifying some index words. Specifically, another word having a structural characteristic of the appearance position (for example, XPath expressing the appearance position of most index words) common to a specified number of index words is searched.

(2) The documents in which each index word appears are grouped according to the structural feature of the appearance position, and the document with the most specific feature is defined as a group of documents corresponding to the index word. For example, assuming that the appearance position of an index word is expressed by XPath, a feature having the smallest number of nodes corresponding to XPath is defined as a specific feature. It can be said that a narrower range can be expressed.

(3) When a new index word is specified by the user, if the structural characteristics of the appearance position are different from other index words, a warning is given that the word may not be suitable as an index word .

FIG. 1 is a configuration diagram of a structural document management system 100 according to the embodiment.

The structural document management system 100 is configured using a computer, and provides a user with an index list editing support function. Each unit of the index word input unit 101 or the group name editing unit 112 in the structural document management system 100 indicates a block that functions when the computer executes a program. The index word input unit 101, the index list presentation unit 107, the index word confirmation unit 108, the index word recommendation unit 109, and the group name editing unit 112 provide an interface to the user via a terminal. The structural document storage unit 103 and the index list storage unit 106 can be realized using a storage device.

The user inputs a certain number of words to be registered as index words from the index word input unit 101 via the terminal. For example, if the set of structured documents is the user's company regulations document or business manual, words such as “supervised location”, “company regulations”, “deposit”, “salary”, “vacation”, “device take-out procedure”, “settlement”, etc. It is done.

When the index word is input, the structural document search unit 102 by word accesses the storage device of the structural document storage unit 103 to search and specify the structural document in which the word as the index word appears.

Subsequently, the appearance condition determination unit 104 checks the appearance condition in the specified structural document, for example, the appearance position on the structure where the input index word appears. For example, when the structure document is XML, the appearance position on the structure can be expressed by XPath which is a language syntax for designating a specific part of the XML document.

Other appearance conditions may include the same or similar word vectors within a certain number of characters or a certain number of nodes from the appearance position, the type of the document, the combination of the schema of the structure document and the appearance position, etc. . In the present embodiment, the number of moving up and down the document structure is referred to as “number of nodes”. For example, the first chapter first section has 1 node, the first chapter second section has 2 nodes, and the second chapter first section has 4 nodes. The document type is, for example, a type such as a rule or a business manual. In the case of XML, the schema of the structure document is an XML schema or a DTD.

The appearance condition grouping unit 105 groups structural documents having similar appearance conditions. For example, the structural document in which the word A appears in the first chapter, the first section, the first paragraph, and the structural document, in which the word B appears in the first chapter, the first section, the first paragraph, have the same appearance position. Group to be in the same group.

When strict grouping cannot be performed in this way, the appearance condition is made ambiguous. For example, the appearance condition “appears in the first chapter, first section, first paragraph” is also included in a similar range such as “appears somewhere in the first chapter, first section”. In other words, the appearance positions are not limited to the same, but may include similar ranges. The structural documents are grouped according to their similarity or concreteness from the structural features of the appearance position without distinguishing the index words included. The degree of similarity will be described later.

Such groupings, such as words and descriptive sentences of items, that are explained to some extent according to “type” appear in similar places in the document structure, while words that only touch a few words appear in the text. It is based on the hypothesis that focuses on entropy that the place to do tends to be dispersed.

In addition, as a method of making it ambiguous, in the case of an appearance position, there exists a method of removing the structural limitation close | similar from the appearance position of said word.

Each structural document grouped by the appearance condition grouping unit 105 is associated with a word that is each index word, and index information representing this correspondence is sent to the index list storage unit 106 and stored therein. For example, group A is the structured document D1, D2, D3 in which the input words W1, W2, W3 appear in the first paragraph, first section, first paragraph, and group B has the words W1, W2, W3 in the first chapter. , W1-group A: D1, “W1-group B: D4”, “W2-group A: D2”, “W2- A pair of “Group B: D5”, “W3-Group A: D3”, and “W3-Group B: D6” is stored.

The index information stored in the index list storage unit 106 is presented to the user by the index list presentation unit 107. The index list presenting unit 107 lists, for example, each structural document whose appearance condition is stricter for each word that is an index word.

When the user adds a new index word, the index word confirmation unit 108 determines the validity and feeds back to the user. When there is no structural document belonging to the group A, the appearance condition grouping unit 105 notifies that to the input index word W4. It is assumed that the criteria for determining whether or not to include a notification in any group is part of the system settings.

The search word recommendation unit 109 presents unregistered index words to the user. For example, when the appearance condition of the group A is the first chapter, the first section, the first paragraph, the first document that matches the appearance condition from the registered structure document by the structural document search unit 110 based on the appearance condition Extract the string in the first paragraph of the section. Then, the unregistered word determination unit 111 determines a characteristic word that is different from the index word still registered in the index list storage unit 106 from the character string.

A characteristic word can be determined by extracting a noun by using a morphological analysis algorithm and determining a characteristic word in the character string using an index called TF-IDF. Since this method is known, it will not be described in detail.

In addition, it may be possible to add a determination comparing various characteristics with an already registered index word. For example, narrowing down to those with a close average character string length, narrowing down to those with similar appearance numbers for all structured documents. “The number of appearances for all structured documents is similar” means that, for example, if each registered index word appears in 1% of all registered documents, a word recommended as a search term Focusing on those that appear in about 1% of registered documents.

The words determined in this way are different from the index words that have already been registered, but they have a similar appearance condition and can be said to be commonly seen. It is highly possible that the word is a power word, and it is recommended to the user as a new index word candidate.

The group name editing unit 112 is for editing the contents stored in the index list storage unit 106. The user can delete unnecessary word-document pairs and edit group names and appearance conditions.

FIG. 2 is a diagram illustrating an example of a structural document stored in the structural document storage unit 103 according to the embodiment.

In the present embodiment, XML is handled as a structure document stored in the structure document storage unit 103. Or it may be HTML or SGML.

Documents

201, 202, and 203 are XML documents written in the same XML schema, and are examples in which a part of a regulation document that defines company activities and rules is stored. See DocBook: http://docbook.org/ns/docbook for the XML schema.

Each document has an article element at the top. Inside the article element, there are an info (bibliographic information) element for entering the bibliographic information of the article and a plurality of sect1 (section) elements representing the text. Inside the info element are a title element and an author element, and inside the sect1 element are the title element of the section and multiple para elements. In addition, there is an orderedlist (numbered item) element and a listitem (one item of item).

FIG. 3 is a diagram illustrating an example of an index word appearance condition for the structure document 202 according to the embodiment.

In the document 202 previously shown as the XML document as an example of the structure document, when the index words are “company rules” and “main part”, the appearance condition determination unit 104 sets the appearance condition 301 and the appearance condition 302 respectively. The result of the determination is shown. In this example, the appearance position is given as the appearance condition, and the appearance position is represented by XPath. Since the method for obtaining the XPath from the appearance position of the character string is known, it is omitted.

In the present embodiment, in the XPath representing the appearance position, for example, in the

appearance condition

301 or 302, each notation part such as “article”, “sect”, “orderedlist”, “listitem”, “para” is represented as “element” from the root node side. Name ".

Also, for example, [1] associated with the element “sect1” of the appearance condition 301, [1] associated with the element “sect1” of the appearance condition 302, [1] associated with the element “orderedlist” of the appearance condition 301, A notation part such as [4] attached to the element “orderedlist” 302 is referred to as an “index”. Referring to FIG. 3 regarding the relationship between the appearance condition 301 and the appearance condition 302, both belong to the same hierarchy with respect to “sect1 [1]”, but with respect to “orderedlist”, which is a lower hierarchy, The index of the condition 301 is [1], the intermediate indexes [2] and [3] are not shown, and the index of the appearance condition 302 is [4].

In the example of FIG. 3, only the appearance position is set as the appearance condition, but other parameters may be combined as a part of the appearance condition. For example, peripheral information such as characters or character strings before and after an index word, a heading character string of a parent node (hereinafter referred to as “peripheral character string”), a document schema, and the like can be considered.

In this example, the characters before and after the index word are the characters ““ ”and“ “” before and after the company regulations or the management section (refer to the underlined portion of the document 202) that is the index word. In this example, the heading character string indicates “Article 1” and “Article 4”. The document schema is the DocBook schema in this example. In XML, the schema is represented by the xmlns attribute of the top element. In other words, “http://docbook.org/ns/docbook” is the schema name of this document.

FIG. 4 is a flowchart of processing of the appearance condition grouping unit 105 according to the embodiment.

The input is a list including a triplet of an index word, an appearance condition, and a document (step S401). The purpose of the processing of the appearance condition grouping unit 105 is to divide the input list into a plurality of groups based on the criteria that the appearance conditions are similar.

For each of the input lists, the appearance condition is obscured to a certain level (step S402). Although the method of obscuration differs depending on the contents of the appearance condition, regarding the Xpath that represents the appearance position in the appearance condition, the appearance position can be made ambiguous by removing the designation of the index and element name. There are various ways of removal. For example, there is a method of (1) removing an index stepwise from the root node side, and (2) removing an element stepwise from the root node side. Step S403).

On the other hand, the appearance conditions of the peripheral information such as the preceding and following characters, the peripheral character string, and the schema can be made ambiguous by removing the designation itself (step S404). It is expected that the effective algorithm for obfuscation varies depending on the schema of the structure document, but such a simple method can be implemented. Note that the order of the process in step S403 and the process in step S404 may be performed in parallel.

The number of times of the obscuring process is stored as the number of times of obscuration (step S405). The number of times of obfuscation is a score, and it can be said that it is the concreteness of the appearance condition. Further, when the appearance conditions of a plurality of index words are compared, it can be said that the number of ambiguous processes is a similarity indicating the similarity of the index words.

Next, those with the same appearance condition are grouped from those with the lowest obfuscation count. That is, for all the lists, a combination that has the same or less obscuration count and can group all index terms is repeatedly searched (step S406). That is, it can be said that not only the appearance conditions are the same but also a similar range can be included.

However, it is assumed that one item belongs to only one group, that is, first-come-first-served basis, and an element having the same index word and document pair as an element included in a certain group is removed.

As a result of the above processing, what is finally obtained as an output of the appearance condition grouping unit 105 is a list including four sets of an index word, an appearance condition, the maximum number of obscurations, and a list of documents (step S407).

FIG. 5 is a diagram illustrating an example of obscuring appearance conditions according to the embodiment.

With respect to the index word 500 “main part”, the initial state 501 of the appearance state shown in FIG. 5 with respect to the appearance condition 302 related to the document 202 “company regulation management rule” shown in FIG. The surrounding characters ““ ”and“ ”” are added as the peripheral information of the case where the appearance condition 501 is made ambiguous and will be described below.

The appearance condition 502 is the initial state 501 of the appearance condition of the index word, and the number of times of obscuration at this point is zero.

The appearance condition 503 is obtained by removing the index [1] from sect1 [1], which is a part of XPath, with respect to the appearance condition 502 (see the underlined portion of “sect1”). At this time, the number of times of obscuration increases by 1 to “1”. As a result of removing this index, it means that even if the index word “main part” appears in the sect1 element having any index, it is treated as the same thing.

In the example of FIG. 5, the index is first removed stepwise, the peripheral information is removed immediately after all the indexes are removed, and then the element designation is removed.

Specifically, the appearance condition 504 is obtained by removing the index [4] from the “orderedlist [4]” in the appearance condition 503 (see the underlined portion of the “orderedlist”). 2 ”. The appearance condition 504 is obtained by removing the index [2] from “listitem [2]” (see the underlined portion of “listitem”), and the number of obscuration increases by 1 to “3”. The appearance condition 506 is obtained by removing the index [1] from “para [1]” in the appearance condition 505 (see the underlined portion of “para”), and the number of obscuration increases by 1 to “4”.

Here, since all indexes have been removed, the appearance condition 507 is obtained by removing the peripheral information ““ ”and“ ”” from the appearance condition 506 (see the underlined portion of “peripheral information”), and the number of obscurations Increases by 1 to “5”.

Next, the “article” element designation is removed from the appearance condition 507, and the appearance information 508 is added with ““ ”and“ ”” as peripheral information (see the underlined part of “// sect1”) ) The number of obscurations is “5” with no change due to 1 increase and 1 decrease. Next, the appearance condition 509 is obtained by removing the peripheral information ““ ”and“ ”” from the appearance condition 508 (see the underlined portion of “peripheral information”), and the number of ambiguities is increased by 1 to “6”. It becomes. Next, the appearance condition 510 is obtained by removing the element designation “sect1” from the appearance condition 509 and adding ““ ”and“ ”” as peripheral information (see the underlined part of “// orderedlist”) ) The number of obscurations is “6” without change by 1 increase and 1 decrease. The subsequent ambiguity is not shown.

FIG. 6 is a diagram illustrating an example of grouping based on the number of times of obscuring appearance conditions according to the embodiment.

Here, an example is shown in which index word-document pairs having appearance conditions developed as shown in FIG. 5 are compared to search for the same group.

The appearance condition 501 of the document 202 in which the index word “main part” 500 appears and the appearance condition 511 of the document 203 in which the index word “deposit” 600 appears are each obscured. Matches for the first time in conditions. That is, the appearance condition 505 and the appearance condition 515 match. When there are only two index words, “main part” 500 and “deposit” 600, the document 202 and the document 203 become index destination documents of the respective index words.

FIG. 7 is a diagram illustrating an example of the contents stored in the index list storage unit 106 according to the embodiment.

The index list storage unit 106 stores the index information output from the appearance condition grouping unit 105. The index information stored in the index list storage unit 106 includes an index word 701, an obfuscation count 702, an appearance condition 703, and a document name 705. The group name 704 can be displayed in place of each appearance condition on the index list presentation screen by giving a name to the grouped appearance condition group. The group name 704 can be given by the user using the group name editing unit 112.

In FIG. 7, the index list storage unit 106 includes a group named “definition” (see data rows 505 and 515) and a group named “reference document” (data rows 711, 515). 712) is stored. The “definition” group is the group with the least number of obscurations, and the “reference document” group is composed of other items.

FIG. 8 is a diagram illustrating an example of an index presentation screen by the index list presenting unit 107 according to the embodiment.

On the screen 800 titled “Registered Document Index”, the index list presenting unit 107 determines the reading of the index word, and displays it sorted by the Japanese syllabary. [A] ... [ka] ... [sa] ... [shi] ... [yo] etc. are index word reading headings 801. There are various methods for acquiring kanji readings, which are well known and will be omitted.

Two index words, “main part” 500 and “deposit” 600 are displayed. Under each index word, for each group with a small number of obscurations, the names of documents belonging to the group are displayed indented. For example, the document having the smallest number 3 of obscuration is displayed first (refer to “Company Rules Management Rules” 202, “Personal Information Cooperation Company Handling and Deposit Management Rules” 203), and then the number of further obscurations Are displayed with a deeper indentation (see “Regulation Editing Manual”, “Regulation Change Request Guidelines”, “(Other 4)” 711, “External Ordering Regulations” 712). When the user selects a document name, a transition is made to the display screen for that document.

FIG. 9 is a diagram illustrating an example of a presentation screen by the index word confirmation unit 108 according to the embodiment.

In the screen 900 titled “Add index word”, in the screen area 901, the user inputs a new index word “employee information” in an index word addition form 902 “Add an index word:” An “add” button 903 is pressed. Then, the index word confirmation unit 108 refers to each appearance condition already stored in the index list storage unit 106 for the appearance condition in the structural document in which the index word “employee information” appears via the appearance condition grouping unit 105. To do.

As a result, if it is determined that the appearance condition of the index word “employee information” is not included in the appearance conditions of the already registered index word group, the index word confirmation unit 108 may not be appropriate as an index word. (See the display in the screen area 904 "Specified" Employee information "tends to be different from other index words. Are you sure you want to register?") The user is prompted to perform the next operation for confirmation (see “add” button 905, “cancel” button 906, and “confirm registered document” button 907).

FIG. 10 is a diagram illustrating an example of a presentation screen by the index word recommendation unit 109 according to the embodiment.

In the “index word candidate” 1000 screen example, using the appearance condition of the group with the least number of times of obfuscation, the structural document search unit 110 by the appearance condition searches all registered documents stored in the structural document storage unit 103. In response to the result, the index word recommendation unit 109 displays unregistered index words that are not yet registered.

The words “deliverable”, “affiliated company”, “export control promotion manager”, “business manager”, “educator” and “examiner” shown in the screen area 1001 are displayed in the / article / sect1 / orderedlist of any structural document. Appears at the position / listitem / para [1], which means that the characters "" and "" exist around it.

The user checks a check box 1002 next to the candidate if there is an index word that is desired. By pressing a “document reference” link 1003 as necessary, the contents of the document at the appearance position of the index word can be confirmed. Then, by pressing an “add to index word” button 904, the index word is added.

Furthermore, when the document set is updated, the index destination document is rechecked, and words that are not yet registered as index words can be presented to the user as index word candidates.

As described above, according to the present embodiment, the index list is created and maintained at a low cost, so that the viewing efficiency of the document viewer increases and the maintenance cost of the document editor decreases. Both business efficiency is improved, and it becomes possible to concentrate on higher value work such as understanding and editing of document contents.

First, by simply specifying index words, the most appropriate document can be determined from documents including those words, and an index list in which the index words and documents are paired can be automatically generated. As a result, an index list is easily created, information collection efficiency is improved for document viewers, and document maintenance costs are reduced for document editors, so work efficiency is generally improved.

Second, by checking the registration of an inappropriate index word, it becomes difficult to register an inappropriate word as an index. Document maintenance costs for document editors are reduced, and work efficiency is improved.

Third, it is possible to provide a mechanism that makes it possible to clarify other index words simply by inputting some index words. This mechanism allows document editors to significantly reduce index word maintenance costs. Readers can use a more extensive index list, which improves the operational efficiency of the entire organization.

Although several embodiments of the present invention have been described, these embodiments are presented by way of example and are not intended to limit the scope of the invention. These embodiments can be implemented in various other forms, and various omissions, replacements, and changes can be made without departing from the spirit of the invention. These embodiments and modifications thereof are included in the scope of the present invention and the gist thereof, and are also included in the invention described in the scope of claims and the equivalents thereof.

Structural document management system ... 100
Index word input part 101
Structural document search unit by word 102
Structure document storage unit 103
Appearance condition determination unit 104
Appearance condition grouping unit 105
Index list storage unit 106
Index list presentation unit 107
Index word confirmation part ... 108
Index word recommendation part ... 109
Structure document search unit based on appearance conditions ... 110
Unregistered word determination unit 111
Group name editing part ... 112

Claims

An input means for inputting an index word;
Retrieval means for retrieving a structural document in which the index word appears from a storage device storing the structural document;
Determining means for determining an appearance condition for at least identifying a part on the structure in which the index word appears in the searched structure document;
Grouping means for grouping each structural document based on the similarity of the appearance conditions;
A structural document management system comprising index storage means for storing a correspondence relationship between each grouped structural document and each index word as index information.
The structural document management system according to claim 1, wherein the appearance condition includes the presence / absence of a character string around the index word.
When the user adds an index word, if the appearance condition in the structure document in which the word appears is not included in the appearance condition for each index word already stored in the index storage means, a warning is issued to confirm. 3. The structural document management system according to claim 1, further comprising index word confirmation means for prompting.
Second search means for searching a storage device for a structural document having a portion that matches the appearance condition;
A word different from the index word already stored in the index storage means is extracted from the part that matches the appearance condition in the retrieved structural document, and the extracted word is used as a new index word candidate by the user. 4. The structural document management system according to claim 1, further comprising index word recommendation means presented in the above.
An input step for entering a specified index word;
A search step of searching a storage device for a structured document including the specified index word;
A determination step of determining an appearance condition for identifying a structural part in which the index word appears in the searched structure document;
A grouping step of grouping each structural document based on the similarity of the appearance conditions;
A structural document management method comprising an index storage step of storing a correspondence relationship between each grouped structural document and each index word as index information.
5. A program for causing a computer constituting the structural document management system according to claim 1 to function as said means.