WO2009063925A1 - 文書管理・検索システムおよび文書の管理・検索方法 - Google Patents
文書管理・検索システムおよび文書の管理・検索方法 Download PDFInfo
- Publication number
- WO2009063925A1 WO2009063925A1 PCT/JP2008/070630 JP2008070630W WO2009063925A1 WO 2009063925 A1 WO2009063925 A1 WO 2009063925A1 JP 2008070630 W JP2008070630 W JP 2008070630W WO 2009063925 A1 WO2009063925 A1 WO 2009063925A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- tag
- document
- index
- word
- key
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/313—Selection or weighting of terms for indexing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/80—Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
- G06F16/81—Indexing, e.g. XML tags; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/80—Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
- G06F16/83—Querying
- G06F16/835—Query processing
Definitions
- the present invention relates to a technique for adding a tag to a partial character string in a document and managing and retrieving document information based on the tag.
- the present invention relates to a technology that enables a phrase including a tag to be used as a processing request (query) for information retrieval.
- Figure 2 shows an example of a tagged document.
- “text” refers to data including at least a document number that is a unique identifier and a character string (main text) to be searched.
- a “tag” is data added to one or more words in a document.
- a tag example and a tag “person name” for “Taro Yamada j” from the 7th to 10th characters are shown.
- a character string representing a tag such as “company name” or “person name” is called a tag name.
- word refers to a partial character string of the text created based on some standard such as morphological analysis or N-gram (character string is divided into N characters).
- Document management and search system that performs document management and search for text with such tags added.
- the search system uses a function to add or delete tags to partial character strings in documents and phrases using tags.
- a document search function Searching a document by a phrase using a tag means a function of inputting a continuous character string including a character string including a tag name and outputting a document set including the phrase.
- “[company name] [person name]” is a phrase that uses tags.
- the character string enclosed by “[J and“] ” is regarded as the tag name. If this phrase is taken as a search query, any word with the tag “company name”, “no”, or any word with the “person name” tag will appear in succession. Statement It means to return the book.
- the tagged documents are expressed in a hierarchical structure description format such as XML (Extensible Markup Language), and A method using a search device XMLDB (XMLData Base) is known (for example, refer to Japanese Unexamined Patent Application Publication No. 2005-18811, hereinafter referred to as Patent Document 1).
- XML Extensible Markup Language
- Patent Document 1 Japanese Unexamined Patent Application Publication No. 2005-18811, hereinafter referred to as Patent Document 1
- FIGS. Fig. 3 shows an example of expressing a document with a tag in XML
- Fig. 4 shows a part of the document as a tree structure based on the inclusion relationship of tags
- Fig. 5 manages hierarchical information.
- the table to do is shown.
- the ellipse node means a tag
- the rectangular leaf node means text
- the edge between them means that there is a containment relationship between those tags or text.
- information called the path hierarchy is described under each node.
- the path hierarchy is information indicating the position of each node in the document.
- the path hierarchy is a number describing the position of a node with a delimiter (“.”).
- the “Person Name” node in Figure 4 has a path hierarchy of “1. 1. 3”, which is 1 below the first node (the “Documents” node) as seen from the root.
- the third node under the "th" node (the "body” node).
- These hierarchical information are managed in a table as shown in Fig. 5. However, this table shows a logical relationship and is actually often expressed in multiple tables.
- node ID, document number, text, tag name, and path hierarchy information are managed for the nodes in the document set.
- Node ID is a unique identifier for all nodes in the document set.
- the document number is an ID that indicates the document that contains the node.
- a text is a character string included in a leaf node.
- NULL shall be input for nodes that are not leaf nodes.
- the tag name is the tag name of each node.
- # t e x t is input to the leaf node.
- the path hierarchy means the path hierarchy of each node. A method for searching for such information will be described by taking the operation of the search device disclosed in Patent Document 1 as an example.
- This search device first decomposes a query into a plurality of search conditions. This query is broken down into three parts: A: the company name tag, B: the word “no” is included, and C: the person name tag. Next, this search device refers to the table shown in Fig. 5 based on each condition, and lists the nodes whose tag name is "company name” (A list) and the text " A list of nodes with “” (referred to as B list) and a list of nodes with tag name “name” (referred to as “C list”) are obtained.
- this search device compares the nodes in the A list, B list, and C list, retrieves the combination of nodes with the same document number, extracts the “company name” node in the A list, and “in” in the B list. ”Node,“ person name ”in the C list Retrieve the nodes whose positional relationship is continuous in the same order as the query. This positional relationship is determined by comparing the path hierarchy. In the case of this query, the “company name” node, “no” node, and “person name” node are sibling nodes, and this search device creates a search result from nodes that satisfy the following three conditions.
- Condition 1 The path hierarchy of the “company name” node, the path hierarchy of the “no” node, and the path hierarchy of the “person name” node match except for the number at the end;
- Figure 6 shows an example of changing the path hierarchy by adding tags.
- Figure 6 shows an example of adding a person name tag to a document.
- the document structure before addition is shown on the left, and the document structure after addition and the update range of its path hierarchy are shown on the right.
- the update range on the right indicates that the path hierarchy of the nodes in the range indicated by the dotted line needs to be updated.
- the path hierarchy expresses the position of the node using the hierarchical structure of the entire document, so even if a part of the document is changed, it needs to be changed significantly.
- the second problem is that the search takes a long time when the search query is a phrase consisting of only common words and high-frequency tag names. This is a general When searching with words or tags with high frequency, a large number of nodes are found when searching for nodes with individual conditions, so it is necessary to check the document numbers and positional relationships for a large number of nodes.
- the problem is that the search speed decreases. For example, in the query “[company name] [person name]”, the query will be broken down into a company name tag, the word “no” included, and a person name tag. A list of nodes that meet each condition is retrieved. However, since each condition is too general, a large number of nodes are discovered, and a large amount of time is required to examine the positional relationship.
- the document management / retrieval system using XM LDB indexes up to the hierarchical structure of the document, so it takes time to update (add or delete) tags and search. Therefore, as another method for realizing phrase search using tags, it is conceivable to use a transposed index that is used in full-text search indexes without using the hierarchical structure as indexes.
- Figure 7 shows an example of an inverted index.
- the number (frequency) of documents containing the word by inputting a word as a key, the number (frequency) of documents containing the word, the document number of the document containing the word, A list (hereinafter referred to as the document list) of the appearance positions of the words (entrance position, expressed by the number of characters from the front of the document) can be obtained.
- the inverted index of the tag shown in (b) is used in addition to the normal inverted index shown in (a).
- (B) is the same as the case of a word.
- a tag document list For a tag with a tag name, the number (frequency) of documents that contain the tag, the document number that contains the tag, and the appearance position of the tag in the document. You can obtain a list (hereinafter referred to as a tag document list) of information (start position and end position, expressed by the number of characters from the front of the document).
- Ne X tword index As a technique for reducing the length of the document list and speeding up the phrase search for a search query consisting of common words (HE Will iams, J. Zobel and D. Bahle, “Fast Phrase Querying with Combined Indexes”, ACM Transactions on Information Systems, 22 (4), pp. 573-594, 2004, hereinafter referred to as Non-Patent Document 1).
- the N tex t o rd d index has a data structure in which a document list of common words with high frequency is divided based on the next word (assuming horizontal writing and this is called “right”).
- N extword index a set of words (N extword) appearing to the right of a word is stored as a key. Furthermore, from the pair of the key word and one N e X tword, 2 You can refer to a document list for a set of documents in which two words appear next to each other.
- Figure 9 shows an example of an index.
- “Yamada” and “Company” are registered as N extwords of the word “no”, and for each, a document list of documents containing “No Yamada” and a document containing “Company” This means that the document list is registered.
- the key consisting of two words (or conditions) is expressed as “A ⁇ B” (for example, “No ⁇ Yamada”, etc.), and A is the primary key and B is the secondary key. I will call it.
- Non-Patent Document 1 uses this N e for high-frequency words.
- the search speed is improved by using X two ⁇ d index.
- the search system is: Search on. First, the normal transposed index is referenced for low-frequency words, and a document list corresponding to “abc industry” is obtained. Next, high frequency For N words, refer to the N extword index, and obtain the document list from “No ⁇ Yamada” and the reference. Furthermore, these two document lists are compared, and a set of documents with the same document and the same appearance position as the query is output. In this way, according to the Ne X t wo rd index, the document list can be read using the adjacency relationship between two words as a key, so the search speed can be improved.
- FIG. 10 is a diagram for explaining that the tag update process takes time in a search system using the Nex two rd index.
- a b c industry Yamada the range that needs to be updated when adding or deleting tags is shown.
- FIG. 10 as shown in (a), for the string “abe industry Yamada”, the tag “noun” and “company name” for the abe industry, and “particle” for “no”. The tag [person name] is added to Yamada.
- the eight dotted arrows in (a) represent the adjacent keys created in the Ne X tword index.
- “A b c industry” in Figure 10 is assumed to be infrequent and stored in the normal inverted index.
- the Ne x two rd index does not assume that a tag will be added, and if it is simply applied to a tagged document, there will be many places to be updated. There is a problem that it takes time. Note that this is because if a tag is used as a secondary key, references to a tag are dispersed. Disclosure of the invention:
- the search device described in Patent Document 1 assumes not only a phrase search but also a query with a hierarchical tag structure (such as returning a document with the structure of “Z document body company name”). In order to have a hierarchical index, it took time to update the index.
- the search device described in Patent Document 1 is based on the idea that a phrase is decomposed into individual word conditions and then searched according to individual conditions. If the pattern is simple, a large amount of information must be read out, and the search takes time.
- Non-Patent Document 1 can reduce the amount of the document list that is read based on the adjacency relationship between two words. Since the adjacency relationship between words and tags was complicated, the tag update took time.
- the present invention resolves such a problem, and in searching for a phrase including a tag, an efficient search for a query including a general phrase and a high-frequency tag, and an efficient update of the tag, It aims to provide a compatible document management ⁇ search system and document management ⁇ search method.
- the document management / retrieval system originated from the present invention includes a word index storage unit for storing the appearance position of each word in the document set, and a tag indicating the attribute of the word added to the word.
- the set of words appearing on the right and left of each tag is stored for the set, and each tag and the word appearing on the right, or the combination of each tag and the word appearing on the left are used as keys.
- a tag LR index storage unit that stores the appearance position of each tag in the document set, and a phrase consisting of tags and words as input as a search query.
- adjacent words and left and right of the tag in the phrase Refer to the tag LR index storage using the relationship of The document search part that returns a list of identifiers of documents that contain messages, and the query that adds or deletes tags to / from substrings in a specific document is interpreted, and the contents stored in the tag LR index storage part are updated.
- a tag updating unit for updating the index in the word index storage unit when one or more documents are given.
- an arbitrary character string is used as a key, and a high-speed tag value determination unit that enables high-speed reference to a set of tag names that may be attached to the character string is provided.
- the document search unit stores the high-speed tag value determination unit and the tag LR index when a phrase with consecutive tags is input as a search query. It is desirable to include a means to execute a query by referring to the section and narrowing down to words that may contain a specific tag name.
- a bit string storage unit that stores a bit string that represents a set of documents each containing the word and tag, with the high-frequency word and tag name as keys, and the document index creation unit stores bit strings when creating an index from the document.
- the tag updating unit includes means for updating the bit string in the bit string storage unit based on the added / deleted tag when updating the tag, and the document search unit At the time of search, the bit string storage unit is referred to based on the high-frequency words and tag names included in the query in advance, and a set of document numbers including all the high-frequency words and tag names in the query is obtained. It is also possible to include means for narrowing down the document set and reading the position of the phrase in the document set.
- a tag NLR index storage unit that stores each tag name as a key and a tag appearance position in the document set and the left and right words for the tag set, and the tag NLR index storage unit for the tag LR index It is also possible to provide conversion means for converting to index in the storage unit and management means for changing the index storage method based on the appearance frequency of the tag.
- the document management / retrieval method of the present invention creates a document index that stores the appearance position of each word contained in a set of words included in the document when one or more documents are given. Add tags to steps and substrings in a specific document When a query to be added or deleted is given, the tag update step that stores the tag appearance position using the tag name as a key, and within this tag update step, the tag is input to the right and left of the tag.
- a tag that memorizes the words that appear and also stores each tag and the word that appears on the right, or the position of each tag in the document set that uses the combination of each tag and the word that appears on the left as a key
- multiple LR storage steps are used to interpret the search query and use the relationship between the adjacent words in the phrase and the left and right Create keys, and based on these keys, based on the keys stored in the document index creation step, based on the key locations stored in the tag update step. It includes a document search step of referring to the appearance positions of each tag and returning a list of identifiers of documents including the phrase by integrating them.
- the document search step includes the step of updating the data representing the relationship between the tag name and the character string, and the document search step uses the fast tag value determination step when a search query is input for a phrase in which the tag name is continuous. It is desirable to include a step of reading the appearance position of the tag by narrowing down to only words that may contain the tag name.
- the document index step includes a bit string storage step for storing a bit string representing a set of documents each including the word and the tag, using the frequent word and the tag name as a key.
- the tag update step is added when the tag is updated.
- the document search step includes the step of updating the bit string in the bit string storage unit based on the deleted tag, and the document search step stores the high-frequency word included in the search query and the tag name as a key in the bit string storage step.
- the tag update step includes a tag NLR index step that stores each tag name in the document set as a key and a tag appearance position and the left and right words in the document set.
- the tag update step and document search step Appearance position with tag as key
- the tag NLR is selected based on the reference selection step and the tag frequency based on whether the tag is stored in the tag NLR index step or the tag LR update step. It can also include an index conversion step that deletes the data created in the index step and creates it in the tag LR index step.
- the present invention can also be implemented as a computer program. That is, when one or more documents are given, a document index creation process for storing the appearance position of each word included in the set of words included in the document and a partial character in a specific document
- a document index creation process for storing the appearance position of each word included in the set of words included in the document and a partial character in a specific document
- a tag update process that stores the appearance position of the tag using the tag name as a key
- a tag for the input tag within this tag update process The words appearing on the right and left of the tag are stored, and the position of each tag in the document set using the combination of each tag and the word appearing on the right, or the combination of each tag and the word appearing on the left as a key Tag LR storage processing, and when a phrase consisting of tags and words is given as a search query, the search query is interpreted and the adjacent words and tags in the phrase Create multiple keys using the left-right relationship of the words, based on the keys stored in the document indentation creation
- Tag name when adding a tag in high-speed tag value judgment processing and tag update processing that can refer to a set of tag names that may be attached to the character string at high speed using an arbitrary character string as a key
- tag update processing that can refer to a set of tag names that may be attached to the character string at high speed using an arbitrary character string as a key
- a bit string storage process for storing a bit string representing a set of documents including the word and tag, using a high-frequency word and a tag name as a key in document index processing;
- the update process when updating the tag, the process of updating the bit string stored in the bit string storage process based on the deleted tag and the high-frequency word and tag included in the search query in the document search process Refers to the bit string stored in the bit string storage process using the name as a key, obtains data representing a set of documents that contain all the high-frequency words and tag names in the query, and narrows down the document set based on that data. It is also possible to have the computer execute the process of reading the appearance positions of words and tags.
- each tag name is used as a key for a set of tags, and the tag NLR index process that stores the appearance position of the tag in the document set and the left and right words is executed by the computer.
- the tag is stored in the tag NLR index process.
- the data created by the tag NLR index process can be deleted and the index conversion process created by the tag LR index process can be executed by the computer.
- the present invention it is possible to refer to an index using a tag and a right or left word as a key for adjacent words and tags included in a query phrase at the time of search, and the amount of a document list to be read can be reduced. Search processing can be performed at high speed.
- updating the tag it is possible to update the tag only by adding two updates to the tag LR index storage unit, and it is possible to update the tag at high speed only by performing a small amount of update.
- FIG. 1 is a block diagram showing a first preferred embodiment of the present invention.
- FIG. 2 shows an example of a tagged document.
- FIG. 3 is a diagram showing an example in which a document with a tag is expressed in XML.
- Fig. 4 is a diagram showing the path hierarchy used in XMLDB.
- Figure 5 shows the logical structure of the index used in XMLDB.
- Figure 6 shows the range that needs to be updated when adding tags in XM LDB.
- FIG. 7 is a diagram showing an example of transposition indentus.
- FIG. 8 is a diagram showing an example of the data structure of the N nex t w rd index.
- FIG. 9 is a diagram showing an example of the N e Xt w rd index.
- Fig. 10 is a diagram showing the range that needs to be updated when adding / deleting tags in a search system that uses the NeXtword index.
- FIG. 11 is a diagram showing an example of transposition indentations assumed in the first embodiment of the present invention.
- FIG. 12 is a diagram illustrating an example of data in the tag LR index storage unit.
- FIG. 13 is a block diagram showing a second preferred embodiment of the present invention.
- FIG. 14 is a block diagram illustrating a configuration example of the high-speed tag value determination unit.
- FIG. 15 shows an example of a tag value table.
- FIG. 16 shows an example of a list of inquiry tasks.
- FIG. 17 shows an example of a document list column.
- Figure 18 is a flowchart of the search process.
- FIG. 19 is a diagram showing an example of a key string.
- FIG. 20 is a flowchart of processing for creating a list of inquiry tasks.
- Figure 21 is a flowchart of the query task execution process.
- Figure 22 shows a flowchart of the document list integration process.
- Fig. 23 is a diagram for explaining the positional relation check process, and shows four cases in the inquiry for each key.
- Fig. 24 is a flowchart of the positional relationship check process.
- Figure 25 illustrates the tag update process
- FIG. 26 shows an example of a list of words, document numbers, and appearance positions.
- FIG. 27 shows an example of a key string.
- FIG. 28 is a block diagram showing a third preferred embodiment of the present invention.
- FIG. 29 is a diagram illustrating an example of data stored in the bit string storage unit.
- FIG. 30 is a block diagram showing a fourth preferred embodiment of the present invention.
- FIG. 31 shows an example of the tag LR document list.
- FIG. 32 shows an example of the management table.
- Figure 33 shows a flowchart of processing when the index type is NLR.
- Figure 34 is a flowchart of the Indettas optimization process. Best Mode for Carrying Out the Invention:
- FIG. 1 is a block diagram showing a first preferred embodiment of the present invention, and shows a configuration example of a document management / retrieval system.
- This document management / retrieval system includes a word index storage unit 13 that stores the appearance position of each word in the document set, and a set of tags that are added to the word and indicate the attribute of the word. For each tag, it stores a set of words that appear on the right and left of each tag, and also uses a combination of each tag and the word that appears on the right, or a combination of each tag and the word that appears on the left.
- the tag LR index storage unit 14 stores the appearance position of each tag in the document set, and the search query is input as a phrase consisting of a tag and a word.
- a document search unit 15 that returns a list of identifiers of documents that contain the phrase by referring to the tag LR index storage unit 14 using the left-right relationship between the word and the tag, and a partial character in a specific document
- Tag LR index storage section 1 4 Update tag contents update section 1 2 and word index storage section when one or more documents are given 1 3 is provided with a document index creation unit 1 1 for updating the index in 3.
- the word index storage unit 13 stores the transposed index (N) for the word.
- An inverted index means data that can be referred to a set of document numbers and positions of occurrence of documents in a document set using the word as a key.
- Fig. 11 shows an example of the inverted index assumed in this embodiment.
- the word “Yamada” is used as a key
- the word “Yamada” appears twice in the document set, appears once in the document with the document number 3 3 3 3
- the appearance position is 7 characters from the front. Eye It also indicates that it appears twice in the document with the document number 3 46 and the appearance position is the 4th and 20th characters from the front.
- the word index storage unit 13 receives a set of data consisting of the word, the document number of the document containing the word, the appearance position in the document, and the like. The word index storage unit 13 stores this data as a document list using each word as a key. Further, when the word index storage unit 13 receives a query composed of at least one word from the query execution means 1 5 2, it returns a document list of the word.
- the tag LR index storage unit 14 stores a tag LR index including a tag L index (T L) and a tag R index (T R) as a transposition index for the tag and its left and right words.
- the tag L index stores a set of words that appear to the left of a tag when the tag appears, and a tag document list that uses the tag and the word that appears to the left as keys.
- the tag R index stores a set of words that existed on the right side when a tag appears, and a tag document list that uses the tag and the word that appears on the right side as keys.
- the tag document list can be extracted under the condition that a word exists on the right and left side of a certain tag.
- Figure 12 shows an example of a tag LR index.
- Tag L index 'Each data in the tag R index is expressed as a reference (pointer) to the tag document list.
- the tag document list corresponding to "[person name] ⁇ " is the pointer "# 1 2 5 6" This pattern occurs 1 9 8 5 9 times in all documents, and the [person name] tag is added to the 7th to 10th characters from the front in the document with the document number 3 3 3 It shows that.
- Tag LR index storage unit 1 4 receives data including instruction type, tag name, document number, start position, end position, left word, and right word from tag update unit 1 2 and internal tag LR index Update.
- the instruction type is information that identifies at least one of the two types of addition and deletion.
- the tag name is the name of the tag to be added or deleted Indicates the front.
- the document number is the document number of the document for which the tag is added or deleted.
- the start position and end position are the positions in the document where tags are added or deleted.
- the word on the left is the word that appears to the left of the start position.
- the word on the right is the word that appears to the right of the starting position.
- the tag LR index storage unit 14 receives an inquiry from the document search unit 15 including a reference destination and a reference key.
- the reference destination is data indicating either the tag L index or the tag R index.
- the reference key is specified by “tag name” card “tag name ⁇ word”.
- Tag LR index storage unit 14 receives an inquiry using the reference destination and reference key as input, and when the reference key is “tag name”, it refers to the tag L index tag R index in the reference destination based on the tag name. Return the list of the right word list and the right word list. If the reference key is “tag name ⁇ word”, refer to the tag L index tag R index in the reference destination based on the key “tag name ⁇ word” and return the corresponding tag document list. .
- the document index creation unit 1 1 is executed by an external program or user, and when a set of one or more documents is given, all the words included in the document are extracted and at least the word and Then, the document number of the document, the appearance position indicating the number of characters from the top of the body of the document and the appearance position are input to the word index storage unit 13.
- the tag update unit 12 is executed by an external program or user, and receives a command statement related to tag addition / deletion, and updates the index in the tag LR index storage unit 14 according to the command statement.
- the command statement related to tag addition / deletion is information consisting of command type, tag name, document number, start position, end position, target word string, left word, right word, and force.
- the document search unit 15 is executed by an external program or user and receives a phrase (search query) composed of one or more tags or words. Based on this input, the document search unit 1 5 makes an inquiry to the word index storage unit 1 3, tag LR index storage unit 1 4, and high-speed tag value determination unit 1 6, and searches at least a list of document numbers. Output as a result.
- the index can be referred to by using the bidirectionality of the index stored in the tag LR index storage unit 14 to reduce the amount of document list to be read without having the tag name as a secondary key. Therefore, the search process can be performed at high speed.
- the tag is updated, only updates are made to two locations in the tag LR index storage unit 14, and tag addition and deletion can be performed at a high speed with a small amount of update.
- FIG. 13 is a block diagram showing a second preferred embodiment of the present invention, and shows a configuration example of a document management / retrieval system.
- This document management / retrieval system has a list of tag names that may be added to arbitrary character strings, and enables a high-speed reference to the list of tag names that may be added to character strings.
- the difference from the first embodiment is that a high-speed tag value determination unit 16 is provided.
- Figure 13 shows the details of the document search unit 15. In other words, the document search unit 15 interprets the search query and decomposes it into a plurality of conditions, and the word index storage unit 1 3 based on the plurality of conditions interpreted by the query interpretation unit 1 5 1.
- Tag document Document list integration means 1 5 3 for comparing the lists with each other and integrating them into a document list of only documents having the same document number and the same phrase as the search query.
- FIG. 14 is a block diagram illustrating a configuration example of the high-speed tag value determination unit 16.
- the high-speed tag value determination unit 16 includes therein a tag value table 1 61, an update unit 1 6 2, and a determination unit 1 6 3.
- the tag value table 1 ⁇ 61 is a table that stores the relationship between the tag and the word string to which the tag is added.
- the update means 1 6 2 is called by the tag update unit 1 2, and the tag name, the target word string (one or more words to be tagged), and the instruction type (addition or deletion) )) Is input, and the relation information in the tag value table 1 61 is updated.
- Judgment means 1 6 3 is called by inquiry execution means 1 5 2, inputs a word string, may refer to tag value table 1 6 1, and may be added to the word string Returns a list of tag names at high speed.
- FIG. 15 shows an example of the tag value table 1 6 1.
- Tag value table 1 6 1 As a character string (2 grams) that separates words into two characters and appended to the 2 grams It is possible to use the one that stores the relationship between the list of possible tag names (tag name string).
- This tag value table 1 61 can be implemented as a program in memory. In the example shown in Fig. 15, for example, a [person name] tag or a [place name] tag may be attached to a character string containing “Yamada”. In this example, the original one-letter word (such as “no”) is stored as one character in the tag value table.
- the update means 1 6 2 divides the target word string input by the tag update unit 1 2 every 2 grams, and the tag value table 1 6 1 for each 2 grams. And update the corresponding tag name string.
- the judgment means 1 6 3 may divide the character string input by the inquiry execution means 1 5 2 every 2 grams and refer to the tag value table 1 6 1 and add it to the character string. Returns a list of tag names with.
- the query interpretation means 1 5 1, query execution means 1 5 2, and document list integration means 1 5 3 in the document search unit 15 will be described.
- the query interpretation means 1 5 1 is executed by an external program or user, receives a search query, and outputs a query task list to the inquiry execution means 1 5 2.
- An inquiry task is data consisting of a reference destination, a reference key, and a position number.
- the reference destination means an index to be referred to at the time of inquiry.
- the transposed index (N) in the word index storage unit 13 or the tag L index (TL) and tag R in the tag LR index storage unit 14 are used. Either index (TR) or.
- the reference key is a key for retrieving a document list from the index. If the reference destination is N, one word is used.
- the reference destination is TL or TR, “[tag name] — word” or It is a set of primary key and secondary key expressed by a character string such as “[tag name] ⁇ [tag name]”.
- the secondary key since the secondary key does not have an index that becomes a tag name, it is not possible to simply obtain a tag document list with “[tag name] ⁇ [tag name]” as a key. Not considered at this point.
- the position number indicates the position of the reference key in the query, and is created from the position number in the key string.
- Figure 16 shows an example of a list of inquiry tasks: “[company name] [person name]” This is based on the query.
- a query task with a location number of 1 and a reference destination of TR or tag R index and a reference destination of “[company name] ⁇ ”, and a location number of 3 and reference destination is TL or Two inquiry tasks have been created, one with a tag index and one with a reference destination of “[person name] —”.
- the inquiry execution means 1 5 2 is called by the document search unit 15 and takes a list of inquiry tasks as an input.
- the inquiry execution means 1 5 2 refers to the word index storage unit 1 3, the high-speed tag value determination unit 1 6, and the tag LR index storage unit 1 4, based on the list of inquiry tasks, and the document list
- the column is output to the document list integration means 1 5 3.
- Figure 17 shows an example of a document list column.
- the document list column is information that associates each document list with an inquiry task for a set of document lists / tag document lists obtained from the word index storage unit 13 and the tag LR index storage unit 14. is there.
- the position number of each inquiry task, the reference key, and the document list obtained by the inquiry are related.
- the document list integration means 1 5 3 is called by the document search unit 15 and takes the document list column as input and outputs a document list in which these are combined into a result list.
- the processing in this embodiment mainly has three processes: a search process, a tag update process, and a document index process. Below, these are demonstrated in order.
- Figure 18 shows the processing flow of the search process.
- the search process starts when a user or an external program inputs a search query to the document search unit 15.
- the document search unit 15 uses the query interpretation means 15 1 to create a key column from the input search query (S 11).
- This processing is performed using some dictionary or rule such as morphological analysis or N-gram. For example, as the syntax of the search query, the tag is enclosed in “mouth” and “]”, and the tag name or “tag name: a character string to which the tag is attached” is described. If the part is defined as being written in natural language, this processing is performed as follows.
- Query interpretation means 1 5 1 first takes out the part surrounded by “[” and ⁇ for the search query, tag name, or Take out the tag name and the character string to which the tag is added. Next, the query interpretation means 1 5 1 performs morphological analysis and creates a key string after dividing the input phrase into words.
- the key string is a word key string and a tag key string.
- a word key represents one word in a phrase.
- the tag key represents one tag name in the phrase.
- the word key and tag key are stored together with the position number indicating the number of each word tag after the top when the phrase is divided into words and tags.
- Figure 19 shows an example of a key string.
- a key column created based on the phrase “[company name: a b c industry] [person name] president]” is shown.
- This query includes the string "abe industry” with the [company name] tag, the string “no”, any string with the [person name] tag, the string "president”, and
- the word “abc industry” and the tag [company name] are in position 1, and the word “no” is in position 2.
- “1” written at other positions means that there is no condition at that position.
- the query interpretation means 15 1 creates a list (task list) of inquiry tasks based on the key column (S 1 2).
- the step S 1 2 is defined as an arbitrary process for creating an inquiry task based on the following conditions. '[Condition 1] For each tag key in the key column, create at least one query task with that tag as the primary key.
- FIG. 20 shows a flowchart of an example of an algorithm for realizing the process of creating a list of inquiry tasks (step S 1 2 in FIG. 18).
- the query interpreting means 15 1 creates a query task to the tag LR index storage unit 14 when there are words on the left and right of each tag key in the key string (S 1 2 1).
- Query interpretation means 1 5 1 examines the key sequence in order from the left (from position 1), and checks whether there is a word key to the right of the tag key. If it exists, change the reference to "TR" Create a query task with the reference key as the tag name of the tag key—the word to the right and the position as the position number of the tag key, and add it to the task list. If there is no word key to the right of the tag key, check whether there is a word key to the left of the tag key. If it exists, create a query task with the reference destination as “TL”, the reference key as “tag name of that tag key ⁇ the word to the left”, and the position as “position number of that tag key”. to add.
- the query interpretation means 15 1 creates a query task in which the tags are concatenated for the tag key for which the query task has not yet been created (S 1 2 2).
- Query interpretation means 1 5 1 checks the key sequence from the left (from position 1) in order, and if a query task with the tag key as the primary key has not yet been created, a tag key exists to the right of the tag key. Find out. If it exists, create a query task with the reference destination as "TR”, the reference key as "tag name of that tag key-> tag name of right tag key", and the position as "position number of that tag key”. Add to list. If there is no word key to the right of the tag key, check whether there is a word key to the left of the tag key. If it exists, create a query task with the reference destination as "TL”, the reference key as "tag name of that tag key-> tag name of the left tag key”, and the position as "position number of that tag key”. Add to task list.
- the query interpretation means 1 5 1 creates a query task for a word key for which a query task has not yet been created (S 1 2 3).
- Query interpretation means 1 5 1 examines the key sequence from the left (from position 1) in order, and if a query task with the word key as the primary key or secondary key has not yet been created, refer to “NJ, Reference” Create a query task with the key as “the word” and the position as “the word position” and add it to the task list.
- the algorithm shown in the flowchart of Fig. 20 is an algorithm that prioritizes reference to the right direction (R index), but an algorithm that prioritizes left direction is also conceivable.
- R index right direction
- an algorithm may be considered that reads the frequency of the head of the document list based on both references and selects the smaller one.
- FIG. 21 shows a flowchart of an example of an algorithm that implements this process. This process is performed for each inquiry task created in step S12.
- the inquiry execution means 1 5 2 searches the word index storage unit 1 3 with the reference key of the inquiry task, reads the corresponding document list, It is held together with the position information (S 1 3 1).
- the inquiry execution means 1 5 2 checks whether the secondary key in the reference key of the inquiry task is a word or a tag. . If it is a word, the tag LR index storage unit 14 is inquired of the reference destination and the reference key “tag name ⁇ word”, and the corresponding tag document list is read (S 1 3 2). When the secondary key in the inquiry task reference key is a tag, the inquiry execution means 1 5 2 reads the tag document list using the tag LR index storage unit 1 4 and the high-speed tag value determination unit 1 6 (S 1 3 3).
- step S 1 3 3 The process of step S 1 3 3 will be described in more detail.
- Query execution means 1 5 2 first queries the tag LR index storage unit 1 4 for the reference destination and the “primary key tag name”, and lists the left word in the L index index and R index list of the right word Get (S 1 3 3 1).
- the query execution means 1 5 2 inputs each word in the right word list to the high-speed tag value determination unit 16 and acquires a tag name string. Then, it is checked whether the tag name of the secondary key is included in the tag name column. If not, the word is deleted from the list of words on the left of the read right word (S 1 3 3 2).
- the query execution means 1 5 2 stores the tag LR index using the primary key tag name and each word in the word list on the right as the secondary key. An inquiry is made to part 14 and the set of tag document lists obtained is added to form one tag document list.
- step S 1 3 multiple inquiry tasks are executed, but the order may be arbitrary. Furthermore, the document number list DL is stored from the result of a certain inquiry task, and when the document list tag document list is read in the subsequent inquiry task, the entry position / start position where the document number is not included in the DL. It is also possible to speed up the processing by not reading the end position.
- step S 1 3 3 2 is not performed, but in step S 1 3 3 3, the list of all right words is conditional on only the primary key tag name.
- step S 1 3 3 is referred to the transposition index using only the primary key.
- the process of reading the list may be replaced with.
- the document list integration means 1 5 3 has the same document number and the appearance of words and tags.
- the document number of the document whose position is equal to the key column is taken out (S 1 4).
- Figure 22 shows a flowchart of an example of an algorithm that implements this process.
- the document list stored in the word index storage unit 13 and the tag document list stored in the tag LR index storage unit 14 are based on the document number and the appearance position Z start position, respectively. It is assumed that they are sorted into
- the document list integration means 1 5 3 prepares M integer pointers corresponding to each document list and creates all initial values as 1 (S 1 4 1).
- the document list integration means 1 5 3 extracts the set of the start position, end position and document number of the entry position and the document number at the pointer position from each document list tag document list (S 1 4 2).
- the document list integration means 1 5 3 determines whether or not all the M document numbers obtained in step S 1 4 2 are equal (S 1 4 3), and each appearance position is the key column position number. It is determined whether the adjacency is correct (S 1 4 4), and if those conditions are satisfied, it is determined that the document has been hit, and the document number is added to the output result list (S 1 4 5).
- FIG. 23 is a diagram for explaining the algorithm of step S 1 4 4.
- the key sequence is examined in order from the left, and the appearance position Z start position in the document obtained using each key as the primary key is compared with the end position obtained from the one key to the left. I will check if it is.
- the method of evaluation depends on how the query is made for the i-th key.
- Figure 23 shows these four cases. In order to express each case, the example of the key column and the primary key used for the query in the key column are dotted ellipses, from the primary key to the secondary key. Key references are represented by dotted arrows.
- Case A is a case where there is no query using the i-th key as the primary key.
- the word key is used as the secondary key.
- Case B is a case in which only the tag key exists for the i-th and the primary key is a query. Therefore, it is necessary to check the position for the query whose primary key is a tag (in this example, “B ⁇ A”).
- Case C is the case where only the word key exists for the i-th and the query is made using the word key as the primary key. Therefore, it is necessary to check the position for an inquiry using only this word key.
- Case D is the case where there is an i-th word key and tag key, and a query is made with each as the primary key. Therefore, it is necessary to check the positional relationship for each of these inquiries. Therefore, in this algorithm, the position is checked for each of these cases.
- FIG. 24 is a flowchart for explaining the algorithm of step S 1 44.
- the document list integration means 1 5 3 first initializes two variables i to 1 and P to 1 (S 1 4 4 0 1).
- the key sequence is examined in order from the left, and variable i represents the position of the currently examined key in the key sequence.
- the variable P represents the start position of the position of the i-th key predicted from the left key in the document.
- the document list integration means 1 5 3 asks what question It is determined whether or not the meeting has been performed (S 14402). This determination process is performed by checking the primary key in the reference key of the inquiry task whose position number is i and checking whether it is a tag key or a word key. In case A, the position check is not performed. If P is not the initial value (_ 1), the next (i + 1st) key position check is prepared, and the character length of the word key is added to P ( S 14403).
- a position check is performed on the i-th tag key (S 14 404).
- the position check for the tag key refers to the process of determining whether the following conditions T 1 and T 2 are satisfied.
- Condition T1 When there are multiple inquiries using the tag key as the primary key, the start positions and end positions obtained by each inquiry must match.
- Condition T2 P is _ 1 (tag key is first), or P is equal to the start position obtained with the tag key as the primary key (adjacent to the appearance position obtained with the left key) thing.
- a position check is performed on the i-th word key (S 14 406).
- the position check for the word key refers to a process for determining whether or not the following condition W is satisfied.
- Condition W P is _ 1 (word key is first), or P is equal to the appearance position obtained with the word key as the primary key (adjacent to the appearance position obtained with the left key) That). If this is satisfied, it is considered that they match, and the appearance position obtained based on the word key + the character length of the word key is substituted for P (S 14407). Otherwise, it is determined that they do not match, and the processing of S144 is finished.
- a position check is performed on the i-th word key and tag key (S 14408).
- the position check for the word key and tag key refers to the process of determining whether all of the following conditions TW are satisfied in addition to the conditions Tl, Condition ⁇ 2, and Condition W.
- step S 1 4 4 1 the document list integration means 1 5 3 adds 1 to i and checks whether i exceeds the length of the key string (S 1 4 4 1 2). It is judged that all the positional relationships are correct, and the processing of S 1 4 4 is finished. Otherwise, return to step S 1 4 4 0 2.
- the document search unit 15 outputs the result list obtained by the document list integration unit 15 3 (S 15).
- the document index creation process begins when one or more documents are entered by an external program or user.
- the document index creation unit 1 1 When one or more documents are input, the document index creation unit 1 1 reads the text of the document for each input document, and uses the morphological analysis program or N-gram creation program to convert the text into words. Create a word string by separating each. Next, the document index creation unit 11 checks the word string in order from the front, and counts the number of characters from the front of the document for each word as the appearance position. Furthermore, the document index creation unit 11 gives each word, document number, and appearance position to the word index storage unit 13.
- FIG. 25 illustrates the tag update process.
- the tag update process starts when an instruction for adding / deleting tags is input by an external program or user and the tag update unit 12 is called.
- Tag addition * Deletion statements related to deletion are the command type (addition / deletion), tag name, document number, start position, end position, target character string (to be tagged), and to the left of the tag It consists of a word and the word to the right of the tag.
- the tag update unit 1 2 refers to the L index in the tag LR index storage unit 1 4 based on the tag name and the left word to be tagged.
- the tag document list is updated according to the instruction type in the document list (S 2 1). If the command type is append, the document number, tag start position, and tag end position are added to the corresponding tag document list. If the instruction type is delete, read the corresponding tag document list, find the part where the sentence number, start position, and end position match, and delete that part. Similarly, refer to the R index in the tag LR index storage unit 14 based on the tag name and the right word to be tagged, and add / delete the document number, tag start position, and tag end position ( S 2 2).
- the tag update unit 12 calls the update unit 16 2 in the high-speed tag value determination unit 16 and inputs the instruction type, the tag name, and the target character string to be tagged (S 2 3).
- the tag value table 16 1 is implemented by implementing the table shown in FIG.
- the update means 1 6 2 divides the tagged character string into 2 grams, and refers to the tag value table 1 6 1 for each 2 grams and is input. Whether the tag name is included in the tag name column. If the tag name is not included in the tag name column, the tag name is added to the tag name column. If the instruction type is checked and deleted, nothing is done. Note that when the high-speed tag value determination unit 16 is not used as in the first embodiment, the processing of S 23 is not performed.
- the document index creation process will be described. For example, when the document 3 3 3 shown in FIG. 2 is input to the document index creation unit 1 1, the document index creation unit 1 1 delimits words in the text and lists the word, document number, and appearance position. create. A part of this list is shown in Figure 26. Next, the document index creation unit 11 inputs this list into the word index storage unit 13. The word index storage unit 13 creates an inverted index based on the list shown in FIG. Some examples of this inverted index are shown in Fig. 11.
- the tag update unit 1 2 makes an inquiry to the L index in the tag LR index storage unit 14 with the key “[person name] ⁇ ”, and enters the document number 3 3 3 in the corresponding tag document list. , Add start position 7 and end position 1 0.
- the R index in the tag R index storage section 1 4 is queried with the key [[person name] ⁇ president], and the corresponding tag document list is document number 3 3 3, start position 7, end position 1 0. Is added.
- the data in the tag LR index storage unit 14 created as a result is shown in FIG.
- the tag update unit 1 2 updates the tag name [person name] in the command statement, the character string “Taro Yamada”, and the command type “ADD” in the high-speed tag value determination unit 16 Enter 1 6 2
- Update means 1 6 2 separates the character string “Taro Yamada” into two characters, and creates character strings “Yamada”, “Tada”, and “Taro”.
- the update means 1 6 2 refers to the tag value table 1 6 1, refers to the tag name column with “Yamada”, “Tada” and “Taro” as keys, and does not include “person name” If so, add “person name”.
- An example of the tag value table 16 1 created as a result of this is shown in FIG. The following is an example of deletion.
- the tag update unit 1 2 makes an inquiry to the L index in the tag LR index storage unit 14 with the key “[person name] ⁇ ”, reads the corresponding tag document list, and reads document number 3 3 3. Delete the start position 7 and end position 1 0.
- the tag updater 1 2 also uses the tag name [person name] in the statement and “Taro Yamada”
- the character string and the instruction type “RM” are input to the update means 1 6 2 in the high-speed tag value judgment unit 16. In this case, since the instruction type is “RM” (deletion), the update means 1 6 2 does nothing.
- the document search unit 15 operates as follows.
- Query interpreting means 1 5 1 first interprets this query and converts it into the key string shown in FIG. 27 (S 1 1).
- the query interpretation means 15 1 performs the process of step S 1 2 1 based on this key string, and creates the inquiry task shown in FIG. 16 (S 1 2).
- the inquiry execution means 1 5 2 inquires these two tasks to the tag LR index storage unit 14 and creates a document list string as shown in FIG.
- the document list integration means 1 5 3 creates a result list representing a document set in which the document numbers match and each word tag is as phrase based on the document list sequence. This process is performed as follows.
- Document list integration means 1 5 3 reads the tag document list shown in Fig. 1 7 in order from the beginning.
- Document number 3 3 3 from the inquiry “[company name] —no”, start position 1, end position 5 From the inquiry “[person name] ⁇ no”, data of document number 3 3 3, start position 7 and end position 1 0 is read (S 1 4 2).
- the document list integration means 1 5 3 confirms that the document numbers match between these data (S 1 4 3), and proceeds to the processing of step S 1 4 4.
- step S 1 4 4 the document list integration means 1 5 3 checks the key sequence in order from the front.
- the first in the key column is the tag key [company name], and there is an inquiry task with [company name] as the primary key. Therefore, in step S 1 4 4 0 2, it is determined as case B, and S 1 4 4 0 Perform step 4.
- the tag key is single and P is one of the initial values
- the document list integration means 1 5 3 reads the key string 2.
- the document list integration means 1 53 reads the third key string.
- the third person in the key column is [person name], and there is a corresponding inquiry task.
- the tag key [person name] in the key column number 3 is determined as case B in step S144 ⁇ 2,
- the document list integration means 153 performs this processing until the condition of S 147 is satisfied, and outputs a finally obtained result list (S 15).
- the query interpreter 151 interprets the query (S 1 1), converts it into a single column, and creates the following inquiry task (S 12).
- the inquiry execution means 152 inquires the tag LR index storage unit 14 for each inquiry task in the process of step S 13. Of these, the inquiry task for the reference “TL”, reference key “[person name] ⁇ [particle]”, and position “3” will be described.
- the system reads “no” and “recent” as the left word list shown in Figure 13 with [person name] as the primary key (S 1 33 1).
- the inquiry execution means 1 52 inquires the high-speed tag value determination unit 16 for each word, and deletes words that are not likely to contain particles. For example, if the tag value table in the high-speed tag value determination unit 16 is as shown in FIG. 15, the word “recent” does not include a particle, and is deleted (S 1 332).
- the inquiry execution means 1 52 uses the remaining word “no” to “ The tag document list is read out from the tag L index based on the reference “name” ⁇ “no” (S 1 3 3 3). Subsequent steps S 1 4 and S 15 are the same as those in the above example, and thus description thereof is omitted.
- search processing can be performed at high speed, and tag addition / deletion can be performed at high speed with a small amount of updates.
- tag value judgment unit 16 that makes it possible to refer to a set of tag names that may be attached to a character string at high speed using an arbitrary character string as a key. Because the tag document list can be read out only for the words that may have the tag of B added to the set of words that appear to the right of A for the tag AB, the query with adjacent tags The phrase can be referred to at high speed.
- FIG. 28 is a block diagram showing a third preferred embodiment of the present invention, and shows a configuration example of a document management / retrieval system.
- This document management / retrieval system further includes a bit string storage unit 17 in the configuration of the second embodiment of the present invention.
- the bit string storage unit 17 stores the relationship between a word or tag name and a bit string that indicates which document contains the tag name for each word or tag name. This bit string has the same length as the document set, and each bit corresponds to each document and indicates whether the key is included in each document (1) or not (0).
- FIG. 29 shows an example of data stored in the bit string storage unit 17.
- the Nth bit corresponds to the document number N.
- the word “ha” is included in documents with document numbers 1, 2, 3, 4, 6, etc.
- the tag [person name] means that it is included in the documents with document numbers 1, 2, 4, 5, and so on.
- Fig. 29 shows the logical relationship of data managed by the bit string storage unit 17 and the actual data storage format can be any.
- the bit string storage unit 17 receives the word and the document number from the document index creation unit 11 and updates the bit string using the input word as a key.
- the bit string storage unit 17 receives the tag name, document number, and instruction type from the tag update unit 12 and updates the bit string corresponding to the tag name.
- the bit string storage unit 17 is called by the inquiry execution means, receives a word or tag name as an input, and internally corresponds to the key. If one exists, the corresponding bit string is returned.
- the search process is performed as follows.
- the document search unit 1 5 interprets the query in step S 1 1 of the search process P 1 0, and then stores each word / tag name contained in the key string as a bit string Queries part 17 and retrieves each bit string.
- the document search unit 15 performs an AND operation on the obtained plurality of bit strings to create a bit string BL that represents a set including all the keys in the key string.
- the document search unit 15 performs the process of S 12 2 to create a set of inquiry tasks, and then in S 1 3 makes an inquiry to the document list / tag document list of each inquiry task.
- step S 24 is a process of updating the bit string by inputting the tag name, document number, and instruction type to the bit string storage unit 17.
- step S 2 4 the bit string storage unit 1 7 first checks the instruction type, and if the instruction type is added, reads the corresponding bit string using the tag name as a key and updates the bit of the document number to “1”. To do. If the instruction type is delete, do nothing.
- Step S 31 is performed.
- Step S 3 1 is a process in which the document index creation unit 11 inputs a word and a document number into the bit string storage unit 17.
- the bit string storage unit 17 reads the corresponding bit string using the word as a key, and updates the bit of the document number to “1”.
- step S 3 1 is performed only on a specific word, and may be accepted. For example, prepare a dictionary HD of frequently used words in advance, and use step S It is conceivable to compare the word with HD before performing the processing of 3 1 and perform S 3 1 only if the word is included in the HD.
- the query interpretation means 1 1 performs the process of S 1 1 and uses the keys [company name], “no”, and [person name].
- the inquiry execution means 15 2 refers to the data (FIG. 29) stored in the bit string storage unit 17, reads the bit string corresponding to each key, and performs an AND operation. As a result, a bit string “1 1 0 0 1 0 1 0 0 0 0 1 0 0” is obtained.
- the inquiry execution means 1 5 2 reads only the part applicable to this document set when reading the document list / tag document list in step S 1 3.
- the subsequent processing is the same as the document update process in the first and second embodiments.
- the query execution means reads a bit string by referring to the bit string storage unit based on the word tag name included in the query in advance at the time of search, and examines it by an AND operation. Since documents containing tag names can be found at high speed, the load on the document list can be reduced, and searches can be performed even faster.
- FIG. 30 is a block diagram showing a fourth preferred embodiment of the present invention.
- This document management / retrieval system includes a tag management unit 19 for managing tags.
- a tag management unit 19 for managing tags.
- a tag LR index storage unit 14 and a set of tags in a document set Tag NLR index storage unit 1 8 that stores the appearance position of the tag and the left and right words, and conversion means 2 0 that converts the index in tag NLR index storage unit 1 8 into the index in tag LR index storage unit 1 4
- Management means 21 for changing the way the index is held based on the tag statistical information.
- the tag management unit 19 When the tag management unit 19 receives an inquiry from the inquiry execution unit 1 52, the tag management unit 19 passes the input data to the internal management unit 21 and sends the data output by the management unit 21 to the inquiry execution unit 1 52. return.
- the tag management unit 1 9 is the tag update unit 1 When an update command is received from 2, the command is input to the internal management means 21.
- the tag NLR index storage unit 18 internally has a tag LR document list with each tag name as a key for a set of tags.
- the tag LR document list is data in which the left word and the right word are added to the data of the tag document list.
- Figure 31 shows an example of a tag LR document list.
- the tag [person name] appears 1 0 0 0 0 1 time in the document set, is in the 7th to 10th characters in the document with the document number 3 3 3, and The word “” indicates that the word “President” is on the right.
- the tag L R index storage unit 14 has the same information as the tag L R index of FIG. 12 shown in the first embodiment.
- the conversion unit 20 is called by the management unit 21 and receives the tag L R document list as an input, and outputs the L index and the R index.
- Management means 21 has an internal management table.
- a management table is a table that stores tag names, tag document frequencies, and index types. Of these, the index type indicates where the index of the corresponding tag is created, and the value is the tag NLR index storage unit 1 8 (NLR) or the tag LR index storage unit 1 4 It is either (LR).
- Figure 32 shows an example of the management table. This example means that the [person name] tag appears 1 0 0 0 0 1 times in the document set, and the index is currently stored in the tag N L R index storage unit 18.
- the management type 2 1 When the management type 2 1 receives data (command text) including the command type, tag name, document number, start position, end position, left word, and right word, the management table is referenced based on the tag name. Then, the index type corresponding to the tag name is extracted, and the statement entered in the corresponding index is input as it is.
- the management means 2 1 receives a query with the reference key and reference destination as input, refers to the management table based on the tag name in the reference key, retrieves the index type corresponding to the tag name, and queries the corresponding index. I do.
- the management means 21 also checks the frequency and index type of the tag in the management table at an arbitrary timing.
- the tag NLR index Storage unit 1 Reads the tag LR document list corresponding to the tag name from the inside, creates a tag L index and a tag R index using conversion unit 2 0, and adds them to the tag LR indentus storage unit 1 4 .
- the threshold ⁇ is an arbitrary fixed number.
- This embodiment mainly has three processes: a search process, a tag update process, and a document index process. These processes are the tag LR index storage unit 1 in the first to third embodiments. This is equivalent to replacing the operation of 4 with the tag management unit 19. Therefore, here, only the processing in the tag management unit 19 will be described, and the tag update process for the tag management unit 19, the inquiry process for the tag management unit 19, and the index optimization process will be described. To do.
- the tag update process starts when the tag update unit 12 inputs a command statement related to tag addition / deletion to the management unit 19. At this time, the system first refers to the management table based on the tag name and updates the frequency corresponding to the tag name. The frequency is updated as follows. If the command type of the command statement is additional, add 1 to the frequency. If the command type is delete, subtract 1 from the frequency.
- the system refers to the management table based on the tag name and retrieves the corresponding index type.
- the index type is LR
- the command statement is given to the tag LR index storage unit 14 and the processes of steps S 2 1 and S 2 2 are performed.
- the index type is N L R
- the system performs the following process. The system reads the tag LR document list using the input tag name as a key, and if the command type is added, the document number, start position, end position, left word, right in the tag LR document list Add a word. If the command type is delete, the tag L R searches the document list for the part where the document number, start position, and end position match, and deletes that part.
- the inquiry process for the tag management unit 19 will be described. This process is started when the inquiry execution means 15 2 makes an inquiry with the reference key and the reference destination as input to the tag management unit 19.
- the system first refers to the management table based on the tag name, and Take out the Ndex species.
- the index type is LR
- the tag LR index storage unit 14 is inquired. This inquiry processing is the same as the inquiry to the tag LR index storage unit 14 in the first embodiment.
- Figure 3 shows a flowchart of the process when the Indetas species is N L R. If the index type is NLR, the system reads the corresponding tag LR document list based on the tag name contained in the reference key in the query, and uses the conversion means 2 0 to tag L index and tag R. Create an index.
- the system first creates an empty tag L-index and an empty R-index at a location that can be added and referred to at high speed, such as on the computer memory (S 51).
- the system checks the tag L R index in order from the front, and performs the following processing every time it reads five data consisting of the document number, start position, end position, left word, and right word.
- the system checks whether there is a tag document list with the key "tag name ⁇ left word" in the tag L index. If there is, the document number, start position and end position are added to the end of the tag document list. to add. If it does not exist, create a new tag document list based on the document number, start position, and end position, and register it with the key “tag name ⁇ left word”. Further, the same processing is performed for the tag R index, and the document number, start position, and end position are added to the tag R index with the key “tag name ⁇ right word” (S 52).
- the tag document list is returned by referring to the corresponding position in the Rindettas (S 53).
- Figure 34 shows a flowchart of the Indetus optimization process.
- the index optimization process takes one row of data (tag name, frequency, index type) in the management table as input, and is executed at any time. For example, the execution timing may be executed for a row in the management table that is updated in the tag update process when the tag update process for the tag management unit 19 is completed. For example, it is possible to execute each line at 3 am.
- the system checks the frequency and index type.
- the management means 21 checks the tag NLR index storage unit 18 and reads the tag LR document list corresponding to the tag name (S 61).
- the management means 21 uses the conversion means 20 to create a tag L index and a tag R index from this tag LR document list (S62). Furthermore, the management means 21 adds the created tag L index and tag R index to the tag LR index storage unit 14 (S63). Next, the management means 21 refers to the inside of the management table using the same tag name, and updates the index type to “LR” (S 64). Finally, the management means 21 deletes the tag LR document list and key from the tag NLR index storage unit 18 corresponding to the tag name (S65).
- the index storage destination is changed based on the tag frequency, but other criteria include the number of left word types, the number of right word types, and tag queries. The number of times, or a number calculated by combining them can be considered.
- the index optimization process works as follows.
- the management means 21 first checks the frequency and index type. At this time, since the tag frequency is equal to or higher than the threshold value and the index type is “NLR”, the management means 21 makes an inquiry to the tag NLR index storage unit 18 and uses the person name in FIG. Get the list (S 6 1). Further, the management means 21 uses the conversion means 20 to create a tag L index and a tag R index from the tag LR document list, obtain the index shown in FIG. 12 (S 6 2), and use this as the tag. The data is stored in the LR index storage unit 14 (S63).
- the management means 21 changes the index type for the person name in the management table shown in FIG. 32 to “LR” (S 64), and the tag NLR index storage unit 18 calls this tag LR document list and “person name”. Delete the key (S65).
- the tag NLR index and the tag LR index are switched based on the tag statistical information.
- the tag LR index is fast because it has a document list based on the left and right words, but it is redundant because it creates an index in both directions, and it has the feature of storing a large amount of data. Therefore, by using tags NLR index to reduce the index for infrequent tags that are originally short in frequency and small in the amount of document list read at the time of search, the balance between the data amount and high-speed search can be achieved. Can be taken. In other words, it is possible to avoid creating an LR index for tags whose document list is short and low in frequency, and to keep the search speed high while reducing the amount of data held as an index. Can do.
- the present invention can be implemented as a computer program and can be distributed via a storage medium or a network.
- a document index creation process for storing the appearance position using each word as a key for a set of words included in the document
- tag update processing that stores the tag appearance position using the tag name as a key
- the words appearing on the right and left of the tag are memorized, and the combination of each tag and the word appearing on the right, or the combination of each tag and the word appearing on the left is used as a key.
- the search query is interpreted. Create multiple keys by using the left / right relationship between adjacent words in the raise and tags, and update the appearance of words and tags based on the keys stored in the document indexing process based on these keys.
- a document search process that refers to the appearance position of each tag based on the key stored in, and then integrates them to return a list of identifiers of the document containing the phrase. It is composed of the following codes.
- the tag name that may be added to the character string High-speed tag value determination processing that enables high-speed reference to a set, processing that updates data representing the relationship between tag names and character strings when tags are added in tag update processing, and tag names that are consecutive in document search processing
- a search query is entered for a phrase
- use the high-speed tag value determination process to further cause the computer to execute the process of reading the tag appearance position only for words that may contain a specific tag name. It is desirable to include code.
- a high-frequency word and tag name are used as keys, and a bit string storage process that stores a bit string that represents a set of documents that contain the word and tag, and a tag update process that is added or deleted when the tag is updated.
- the bit string stored in the bit string storage process is used as a key in the process of updating the bit string stored in the bit string storage process based on the tag and the document search process. Referencing and obtaining data representing a set of documents that contain all the high-frequency words and tag names in the query, narrowing down the set of documents based on that data, and further reading the appearance positions of words and tags You can include code that you want to execute.
- a tag update step is executed by causing the computer to execute a tag NLR index process for storing a tag appearance position and left and right words in a document set using each tag name as a key for a set of tags.
- a tag NLR index process for storing a tag appearance position and left and right words in a document set using each tag name as a key for a set of tags.
- the reference destination is changed. It also includes code that causes the computer to execute the selected process and the index conversion process created by the tag LR index process by deleting the data created by the tag NLR index process based on the frequency related to the tag. it can.
- the present invention is effective as a part of a system for managing and retrieving documents using tags.
- the focus is on the part that quickly determines the list of document numbers representing the document set including the phrase based on the phrase including the tag. Therefore, in addition to the configuration of the present invention, by preparing a document database that refers to the document itself from the document number, a search engine that can read a document set with a phrase including a tag. It can be used as
- the present invention is a technique for realizing a phrase search including a tag on the assumption that the tag is updated.
- Applications that require this technology include the field of text mining that analyzes large document sets.
- text mining a tag is added to a document and analysis is performed using the tag.
- it is often not known in advance what kind of tagging is preferable for a document set. Therefore, by indexing a large number of document sets in advance, tagging using various tagging means, searching with tags and phrases containing the tags, extracting the frequency and document set, Knowledge can be efficiently extracted from a document set.
- the present invention is useful in such a case.
Abstract
Description
Claims
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/741,302 US9454597B2 (en) | 2007-11-15 | 2008-11-06 | Document management and retrieval system and document management and retrieval method |
JP2009541163A JP5376163B2 (ja) | 2007-11-15 | 2008-11-06 | 文書管理・検索システムおよび文書の管理・検索方法 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2007296386 | 2007-11-15 | ||
JP2007-296386 | 2007-11-15 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2009063925A1 true WO2009063925A1 (ja) | 2009-05-22 |
Family
ID=40638773
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2008/070630 WO2009063925A1 (ja) | 2007-11-15 | 2008-11-06 | 文書管理・検索システムおよび文書の管理・検索方法 |
Country Status (3)
Country | Link |
---|---|
US (1) | US9454597B2 (ja) |
JP (1) | JP5376163B2 (ja) |
WO (1) | WO2009063925A1 (ja) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2012108782A (ja) * | 2010-11-18 | 2012-06-07 | Yahoo Japan Corp | テキストデータ読出装置、方法及びプログラム |
US9767191B2 (en) | 2013-07-23 | 2017-09-19 | International Business Machines Corporation | Group based document retrieval |
JP2018060424A (ja) * | 2016-10-06 | 2018-04-12 | 富士通株式会社 | インデックス生成プログラム、インデックス生成装置、インデックス生成方法、検索プログラム、検索装置および検索方法 |
WO2018096686A1 (ja) * | 2016-11-28 | 2018-05-31 | 富士通株式会社 | 検証プログラム、検証装置、検証方法、インデックス生成プログラム、インデックス生成装置およびインデックス生成方法 |
JP2019185145A (ja) * | 2018-04-02 | 2019-10-24 | 富士通株式会社 | データ生成プログラム、データ生成方法および情報処理装置 |
CN111178965A (zh) * | 2019-12-27 | 2020-05-19 | 聚好看科技股份有限公司 | 一种资源投放方法及服务器 |
US11386267B2 (en) | 2017-05-16 | 2022-07-12 | Fujitsu Limited | Analysis method, analyzer, and computer-readable recording medium |
Families Citing this family (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8266140B2 (en) * | 2009-03-13 | 2012-09-11 | International Business Machines Corporation | Tagging system using internet search engine |
JP2011123598A (ja) * | 2009-12-09 | 2011-06-23 | Canon Inc | 原稿判別装置、原稿判別方法及びプログラム |
US8745370B2 (en) * | 2010-06-28 | 2014-06-03 | Sap Ag | Secure sharing of data along supply chains |
US8539597B2 (en) * | 2010-09-16 | 2013-09-17 | International Business Machines Corporation | Securing sensitive data for cloud computing |
US9600565B2 (en) * | 2010-10-15 | 2017-03-21 | Nec Corporation | Data structure, index creation device, data search device, index creation method, data search method, and computer-readable recording medium |
US8983963B2 (en) * | 2011-07-07 | 2015-03-17 | Software Ag | Techniques for comparing and clustering documents |
WO2013038527A1 (ja) * | 2011-09-14 | 2013-03-21 | 富士通株式会社 | 抽出方法、抽出プログラム、抽出装置、および抽出システム |
US9495352B1 (en) | 2011-09-24 | 2016-11-15 | Athena Ann Smyros | Natural language determiner to identify functions of a device equal to a user manual |
US20130086059A1 (en) * | 2011-10-03 | 2013-04-04 | Nuance Communications, Inc. | Method for Discovering Key Entities and Concepts in Data |
US8589404B1 (en) * | 2012-06-19 | 2013-11-19 | Northrop Grumman Systems Corporation | Semantic data integration |
US9740765B2 (en) * | 2012-10-08 | 2017-08-22 | International Business Machines Corporation | Building nomenclature in a set of documents while building associative document trees |
US9116938B2 (en) * | 2013-03-15 | 2015-08-25 | Qualcomm Incorporated | Updating index information when adding or removing documents |
US11126592B2 (en) * | 2014-09-02 | 2021-09-21 | Microsoft Technology Licensing, Llc | Rapid indexing of document tags |
US10055301B2 (en) * | 2015-06-15 | 2018-08-21 | Open Text Sa Ulc | Systems and methods for content server make disk image operation |
US10606815B2 (en) * | 2016-03-29 | 2020-03-31 | International Business Machines Corporation | Creation of indexes for information retrieval |
CA3055172C (en) * | 2017-03-03 | 2022-03-01 | Perkinelmer Informatics, Inc. | Systems and methods for searching and indexing documents comprising chemical information |
US10417269B2 (en) * | 2017-03-13 | 2019-09-17 | Lexisnexis, A Division Of Reed Elsevier Inc. | Systems and methods for verbatim-text mining |
JP6841322B2 (ja) | 2017-04-06 | 2021-03-10 | 富士通株式会社 | インデックス生成プログラム、インデックス生成装置、インデックス生成方法、検索プログラム、検索装置および検索方法 |
WO2020257973A1 (en) * | 2019-06-24 | 2020-12-30 | Citrix Systems, Inc. | Detecting hard-coded strings in source code |
CN112230781B (zh) * | 2019-07-15 | 2023-07-25 | 腾讯科技(深圳)有限公司 | 字符推荐方法、装置及存储介质 |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH06215029A (ja) * | 1992-12-10 | 1994-08-05 | Xerox Corp | テキスト検索方法 |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6215898B1 (en) * | 1997-04-15 | 2001-04-10 | Interval Research Corporation | Data processing system and method |
CN1328321A (zh) * | 2000-05-31 | 2001-12-26 | 松下电器产业株式会社 | 通过语音提供信息的装置和方法 |
JP3709890B2 (ja) | 2000-10-25 | 2005-10-26 | 松下電器産業株式会社 | 文字列検索装置 |
JP3882729B2 (ja) * | 2002-09-27 | 2007-02-21 | 富士通株式会社 | 情報開示プログラム |
US20080077570A1 (en) * | 2004-10-25 | 2008-03-27 | Infovell, Inc. | Full Text Query and Search Systems and Method of Use |
-
2008
- 2008-11-06 US US12/741,302 patent/US9454597B2/en active Active
- 2008-11-06 WO PCT/JP2008/070630 patent/WO2009063925A1/ja active Application Filing
- 2008-11-06 JP JP2009541163A patent/JP5376163B2/ja active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH06215029A (ja) * | 1992-12-10 | 1994-08-05 | Xerox Corp | テキスト検索方法 |
Non-Patent Citations (1)
Title |
---|
HUGH E. WILLIAMS ET AL.: "Fast phrase querying with combined indexes", ACM TRANSACTIONS ON INFORMATION SYSTEMS, vol. 22, no. ISS..., 2004, pages 573 - 594 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2012108782A (ja) * | 2010-11-18 | 2012-06-07 | Yahoo Japan Corp | テキストデータ読出装置、方法及びプログラム |
US9767191B2 (en) | 2013-07-23 | 2017-09-19 | International Business Machines Corporation | Group based document retrieval |
JP2018060424A (ja) * | 2016-10-06 | 2018-04-12 | 富士通株式会社 | インデックス生成プログラム、インデックス生成装置、インデックス生成方法、検索プログラム、検索装置および検索方法 |
WO2018096686A1 (ja) * | 2016-11-28 | 2018-05-31 | 富士通株式会社 | 検証プログラム、検証装置、検証方法、インデックス生成プログラム、インデックス生成装置およびインデックス生成方法 |
JPWO2018096686A1 (ja) * | 2016-11-28 | 2019-08-08 | 富士通株式会社 | 検証プログラム、検証装置、検証方法、インデックス生成プログラム、インデックス生成装置およびインデックス生成方法 |
US11386267B2 (en) | 2017-05-16 | 2022-07-12 | Fujitsu Limited | Analysis method, analyzer, and computer-readable recording medium |
JP2019185145A (ja) * | 2018-04-02 | 2019-10-24 | 富士通株式会社 | データ生成プログラム、データ生成方法および情報処理装置 |
JP7006462B2 (ja) | 2018-04-02 | 2022-01-24 | 富士通株式会社 | データ生成プログラム、データ生成方法および情報処理装置 |
CN111178965A (zh) * | 2019-12-27 | 2020-05-19 | 聚好看科技股份有限公司 | 一种资源投放方法及服务器 |
CN111178965B (zh) * | 2019-12-27 | 2023-07-25 | 聚好看科技股份有限公司 | 一种资源投放方法及服务器 |
Also Published As
Publication number | Publication date |
---|---|
JPWO2009063925A1 (ja) | 2011-03-31 |
JP5376163B2 (ja) | 2013-12-25 |
US9454597B2 (en) | 2016-09-27 |
US20100281030A1 (en) | 2010-11-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2009063925A1 (ja) | 文書管理・検索システムおよび文書の管理・検索方法 | |
US6853992B2 (en) | Structured-document search apparatus and method, recording medium storing structured-document searching program, and method of creating indexes for searching structured documents | |
US8171052B2 (en) | Information search system, method and program | |
JP4647336B2 (ja) | グラフベースの順位付けを使用してテキスト内の単語および概念に順位付けする方法およびシステム | |
US7490078B2 (en) | Stream data processing system and method for avoiding duplication of data process | |
KR101339103B1 (ko) | 의미적 자질을 이용한 문서 분류 시스템 및 그 방법 | |
US20050021545A1 (en) | Very-large-scale automatic categorizer for Web content | |
WO2006036487A2 (en) | System and method for management of data repositories | |
JP2008052662A (ja) | 構造化文書管理システム及びプログラム | |
US11977581B2 (en) | System and method for searching chains of regions and associated search operators | |
US20230109772A1 (en) | System and method for value based region searching and associated search operators | |
JP4237813B2 (ja) | 構造化文書管理システム | |
US20060248037A1 (en) | Annotation of inverted list text indexes using search queries | |
JP4108337B2 (ja) | 電子ファイリングシステム及びその検索インデックス作成方法 | |
JP5169456B2 (ja) | 文書検索システム、文書検索方法および文書検索プログラム | |
JP4378106B2 (ja) | 文書検索装置、文書検索方法及びプログラム | |
KR100441346B1 (ko) | 엑스엠엘 문서의 저장방법 및 엑스엠엘 문서 또는 인덱스노드 탐색방법 | |
KR20020054254A (ko) | 사전구조를 이용한 한국어 형태소 분석방법 | |
JP3842574B2 (ja) | 情報抽出方法および構造化文書管理装置およびプログラム | |
JP5971571B2 (ja) | 構造文書管理システム、構造文書管理方法及びプログラム | |
JP2006163723A (ja) | ドキュメント検索方法 | |
US9922115B1 (en) | Composite storage | |
JP6476638B2 (ja) | 固有用語候補抽出装置、固有用語候補抽出方法、及び固有用語候補抽出プログラム | |
JPH11203312A (ja) | キーワード検索装置、文書検索装置、キーワード検索プログラムを記録した記録媒体及び文書検索プログラムを記録した記録媒体 | |
GB2590343A (en) | Context search for unstructured content |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 08850873 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 12741302 Country of ref document: US |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2009541163 Country of ref document: JP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 08850873 Country of ref document: EP Kind code of ref document: A1 |