US20160217207A1 - Semantic structure search device and semantic structure search method - Google Patents
Semantic structure search device and semantic structure search method Download PDFInfo
- Publication number
- US20160217207A1 US20160217207A1 US14/995,775 US201614995775A US2016217207A1 US 20160217207 A1 US20160217207 A1 US 20160217207A1 US 201614995775 A US201614995775 A US 201614995775A US 2016217207 A1 US2016217207 A1 US 2016217207A1
- Authority
- US
- United States
- Prior art keywords
- search
- semantic
- document
- information
- words
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G06F17/30696—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
-
- G06F17/30684—
-
- G06F17/30914—
Definitions
- the embodiments discussed herein are related to a semantic structure search device and a semantic structure search method.
- Analyses of natural sentences conducted in text searches utilize lexical analyses, morpheme analyses, semantic analyses, etc.
- a lexical analysis is a process of dividing a character string into words
- a morpheme analysis is a process of dividing a character string into morphemes and assigning information such as word classes, attributes, etc. to each morphemes.
- Morphemes obtained through morpheme analyses may be treated as words.
- a semantic analysis is a process of using a result of a morpheme analysis of a natural sentence so as to obtain the semantic structure of that natural sentence.
- a semantic structure which is a result of semantic analyses, what is meant by a natural sentence can be expressed as data, which is processed by computers.
- a semantic structure includes a plurality of semantic symbols respectively representing the meanings of a plurality of words included in a morpheme analysis result, and also includes information representing the relationship between two semantic symbols. In some cases, one semantic symbol corresponds to a plurality of words.
- a semantic structure can be represented by for example a directed graph having a plurality of nodes representing a plurality of semantic symbols and also having an arc representing the relationship between two nodes. The smallest partial structure of a semantic structure is referred to as a semantic minimum unit and includes two nodes and an arc between those nodes.
- a semantic structure which is a result of a semantic analysis of text data, is several tens of times larger in data volume than the original text data. Further, a semantic structure search is a complicated process, sometimes leading to a situation where data that is the result of a semantic analysis is to be compressed for a semantic structure search.
- An information search device that uses a semantic minimum unit as a search key for a semantic structure search of a natural sentence is also known (see Patent Document 1, for example).
- This information search device accepts a search query of a natural language sentence, conducts a semantic analysis on that natural language sentence, and specifies the semantic minimum unit that serves as a search key. Then, the information search device searches for a search target sentence including a semantic minimum unit that is identical to the search key from a searching index that has in advance stored semantic minimum units included in the search target sentence.
- An information search device that uses results obtained by a sentence-meaning-oriented search in order to efficiently realize display that is easy to understand is also known (see Patent Document 2 for example).
- This information search device compares, on the basis of match profile information, a search key sentence and the match dictionary information in accordance with an associated matching condition, and obtains positional information that represents the position in which a word meeting the matching condition appears in sentences of the match dictionary information. Then, on the basis of the obtained result of the comparison, the information search device transmits, to a terminal device, search result information in which a sentence including a word meeting the matching condition and the positional information are associated.
- An information processing program that can increase the efficiency of compression of character codes and accelerate the speed of a compression process and an expansion process is also known (See Patent Document 3 for example).
- a generation program that can construct a 2 N branch nodeless Huffman tree in which the optimum code length is assigned to the total number of types of pieces of character information etc. is also known (see Patent Document 4 for example).
- An information generation program that can accelerate the generation of index information representing presence or absence of basic words or characters and optimize the size of index information is also known (see Patent Document 5 for example).
- Patent Document 1 Japanese Laid-open Patent Publication No. 2013-186766
- Patent Document 2 Japanese Laid-open Patent Publication No. 2010-267247
- Patent Document 3 Japanese Laid-open Patent Publication No. 2010-93414
- Patent Document 4 International Publication Pamphlet No. WO 2012/111078
- Patent Document 5 International Publication Pamphlet No. WO 2011/148511
- a non-transitory computer-readable recording medium stores a semantic structure search program.
- the semantic structure search program causes the computer to execute the following process.
- the computer generates a plurality of search semantic symbols from a search request.
- the computer specifies a position of a specific word that corresponds to the search request in a search target document, by the plurality of search semantic symbols and document semantic structure position information.
- the document semantic structure position information includes a relationship information between a plurality of semantic symbols and a plurality of positions of a plurality of words in the search target document.
- the plurality of semantic symbols represent a semantic structure corresponding to the plurality of words.
- the computer outputs a search result including the specific word and the position of the specific word in the search target document.
- FIG. 1 illustrates a functional configuration of a semantic structure search device
- FIG. 2 is a flowchart illustrating a semantic structure search process
- FIG. 3 illustrates a specific example of the functional configuration of the semantic structure search device
- FIG. 4 is a flowchart of a compression process
- FIG. 5 illustrates an analysis result of a document
- FIG. 6 illustrates a first example of semantic symbol information
- FIG. 7 illustrates a second example of semantic symbol information
- FIG. 8 illustrates a role of mapping information
- FIG. 9 illustrates mapping information
- FIG. 10 illustrates bit map information
- FIG. 11A and FIG. 11B illustrate semantic codes
- FIG. 12 illustrates a flowchart of an encoding process
- FIG. 13 illustrates document semantic structure position information
- FIG. 14 illustrates a flowchart as a specific example of a semantic structure search process
- FIG. 15A through FIG. 15E illustrate logical products of two rows of bit map information
- FIG. 16 illustrates document IDs obtained through logical products of two rows
- FIG. 17 illustrates a flowchart of a semantic structure search process that conducts score calculation
- FIG. 18 illustrates a first example of scores of search formulas
- FIG. 19 illustrates a second example of scores of search formulas
- FIG. 20 illustrates a configuration of an information processing apparatus.
- the conventional semantic structure searches have the following problems.
- a morpheme analysis result and a semantic analysis result for a document are not associated, even when the semantic minimum unit has been searched for in a semantic structure search, the correspondence relationship between a semantic symbol included in the semantic minimum unit and a word included in the original sentence is not known. Accordingly, a word corresponding to a semantic symbol that was searched for is obtained on the basis of information representing the corresponding relationship between the semantic symbol and a word included in the original sentence, and the text corresponding to the position of that word is referred to. In such a case, a process of obtaining a word corresponding a semantic symbol is performed in addition to the semantic structure search, leading to a longer period of processing time.
- Patent Document 2 also discloses construction of a tree structure by using, as a partial tree node, a sentence including phrases consisting of words. However, when information of a tree structure is searched for, the period of processing time becomes longer.
- FIG. 1 illustrates a functional configuration example of a semantic structure search device according to an embodiment.
- a semantic structure search device 101 includes a storage unit 111 , a search unit 112 and an output unit 113 (output interface).
- the storage unit 111 stores document semantic structure position information 121 that includes a relationship information between a plurality of semantic symbols and a plurality of positions of a plurality of words in the search target document.
- the plurality of semantic symbols represent a semantic structure corresponding to the plurality of words.
- the search unit 112 refers to the document semantic structure position information 121 so as to perform a semantic structure search process based on a search request, and the output unit 113 outputs the search result.
- FIG. 2 is a flowchart illustrating an example of a semantic structure search process performed by the semantic structure search device 101 illustrated in FIG. 1 .
- the search unit 112 generates a plurality of search semantic symbols from a search request (step 201 ).
- the search unit 112 refers to the document semantic structure position information 121 so as to specify a position of a specific word that corresponds to the search request in the search target document, by the plurality of search semantic symbols and the document semantic structure position information (step 202 ).
- the output unit 113 outputs the search result including the specific word and the position of the specific word in the search target document (step 203 ).
- semantic structure search device 101 By using the semantic structure search device 101 illustrated in FIG. 1 , it is possible to specify a word corresponding to a search request and the position of that word in the document in a semantic structure search.
- FIG. 3 illustrates a specific example of the semantic structure search device 101 illustrated in FIG. 1 .
- the semantic structure search device 101 illustrated in FIG. 3 includes the storage unit 111 , the search unit 112 , the output unit 113 , an analysis unit 301 , generation units 302 , 303 and 304 , and an encoding unit 305 .
- the analysis unit 301 conducts a morpheme analysis and a semantic analysis on each of a plurality of documents 311 stored in the storage unit 111 , generates an analysis result 312 including a morpheme analysis result and a semantic analysis result, and stores them in the storage unit 111 .
- the generation unit 302 generates semantic symbol information 313 representing a correspondence relationship between a word and a semantic symbol from the analysis result 312 .
- the generation unit 303 generates mapping information 314 , which represents a correspondence relationship between a word and code information from the analysis result 312 and the semantic symbol information 313 , and stores the information in the storage unit 111 .
- the generation unit 304 generates bit map information 315 , which represents presence or absence of each of a plurality of words in each document 311 from the analysis result 312 and the semantic symbol information 313 , and stores the information in the storage unit 111 .
- the encoding unit 305 encodes each document 311 by using the analysis result 312 , the semantic symbol information 313 and the mapping information 314 so as to generate the document semantic structure position information 121 for each document 311 and store the information in the storage unit 111 .
- the search unit 112 refers to the document semantic structure position information 121 , the analysis result 312 , the semantic symbol information 313 , the mapping information 314 and the bit map information 315 so as to perform a semantic structure search process based on the search request.
- FIG. 4 is a flowchart illustrating an example of a compression process performed by the semantic structure search device 101 illustrated in FIG. 3 .
- the analysis unit 301 conducts a morpheme analysis on the document 311 so as to generate a morpheme analysis result (step 401 ), and also conducts a semantic analysis on the document 311 so as to generate a semantic analysis result (step 402 ).
- the analysis unit 301 stores, in the storage unit 111 , the analysis result 312 including the morpheme analysis result and the semantic analysis result.
- the processes of steps 401 and 402 are performed for each document 311 .
- the generation unit 302 generates the semantic symbol information 313 from the analysis result 312 , and stores the information in the storage unit 111 (step 403 ).
- the generation unit 303 generates the mapping information 314 for the document 311 from the analysis result 312 and the semantic symbol information 313 , and stores the information in the storage unit 111 (step 404 ).
- the generation unit 304 generates the bit map information 315 for the document 311 from the analysis result 312 and the semantic symbol information 313 and stores the information in the storage unit 111 (step 405 ).
- the encoding unit 305 encodes the document 311 by using the analysis result 312 , the semantic symbol information 313 and the mapping information 314 so as to generate the document semantic structure position information 121 , and stores the information in the storage unit 111 (step 406 ).
- the processes in steps 404 through 406 are performed for each document 311 .
- FIG. 5 illustrates an example of a text file corresponding to the analysis result 312 of the document 311 , the text file being generated in steps 401 and 402 illustrated in FIG. 4 .
- a text file 501 illustrated in FIG. 5 includes an analysis result 502 of each sentence included in the document 311 , and the analysis result 502 of each sentence includes a morpheme analysis result 511 and a semantic analysis result 512 .
- the morpheme analysis result 511 includes a word 521 included in a sentence, a document ID 522 , a sentence ID 523 , a word position 524 in the sentence, word data length 525 , a semantic symbol 526 corresponding to the word, and attribute information 527 .
- the attribute information 527 includes information representing for example the word class of the word, whether or not the word is a categorematic word, etc.
- one morpheme obtained through a morpheme analysis may be treated as one word, and in other cases, a compound word consisting of a plurality of morphemes may be treated as one word.
- the document ID 522 and the sentence ID 523 of “GYOUMU (meaning “business” in English) is “7502” and “4”, respectively.
- the word position 524 of “GYOUMU” represents the position corresponding to “6” bytes from the beginning of the sentence, while the data length 525 represents “4” bytes.
- the semantic symbol 526 of “GYOUMU” is “BUSINESS= GYOUMU”, and the attribute information 527 is “:N: IW) ” “N” in the attribute information 527 represents a noun, and “IW” represents a categorematic word.
- the semantic analysis result 512 includes a semantic minimum unit 531 included in a sentence, a document ID 532 , a sentence ID 533 , a phrase 534 corresponding to the semantic minimum unit, a starting position 535 of a phrase in the sentence and an ending position 536 of the phrase.
- the semantic minimum unit 531 includes a source node, a termination node and an arc starting from the source node to the termination node.
- the semantic analysis result 512 further includes a starting position 537 of a word corresponding to the source node, an ending position 538 of the word corresponding to the source node, a starting position 539 of a word corresponding to the termination node and an ending position 540 of the word corresponding to the termination node.
- the termination node is “ENTREPRENEUR”
- the arc is “-- ⁇ CONCERN>-->”.
- the document ID 532 , the sentence ID 533 and the phrase 534 of this semantic minimum unit are “7502”, “4” and “JIGYOUSHA NO DETASENTA TO INTANETTO”, respectively.
- the starting position 535 of “JIGYOUSHA NO DETASENTA TO INTANETTO” represents the position corresponding to “42” bytes from the beginning of the sentence and the ending position 536 represents the position corresponding to “78” bytes from the beginning of the sentence.
- the starting position 537 of word “INTANETTO” corresponding to the source node represents the position corresponding to “64” bytes from the beginning of the sentence and the ending position 538 represents the position corresponding to “78” bytes from the beginning of the sentence.
- the starting position 539 of word “JIGYOUSHA” corresponding to the termination node represents the position corresponding to “42” bytes from the beginning of the sentence and the ending position 540 represents the position corresponding to “48” bytes from the beginning of the sentence.
- FIG. 6 illustrates a first example of a word code dictionary corresponding to the semantic symbol information 313 that is generated in step 403 illustrated in FIG. 4 .
- the word code dictionary illustrated in FIG. 6 represents a correspondence relationship between a dictionary ID and an entry.
- An entry includes a word, attribute information and a semantic symbol, and a dictionary ID represents a word code added to an entry.
- a word code dictionary like this, it is possible to compress the document 311 by replacing a word specified by an entry with a word code specified by a dictionary ID.
- HONYAKU+noun+abstract thing+TRANSLATION is a word
- “noun” and “abstract thing” are attribute information
- “TRANSLATION” is a semantic symbol.
- “noun” represents a word class
- “abstract thing” represents a category of a word.
- the word code dictionary illustrated in FIG. 6 also represents a correspondence relationship between a word and a semantic symbol.
- FIG. 7 illustrates a second example of a word code dictionary corresponding to the semantic symbol information 313 .
- the word code dictionary illustrated in FIG. 7 has a configuration in which semantic symbols have been removed from the entries illustrated in FIG. 6 .
- a word dictionary representing a correspondence relationship between a combination of a word and attribute information and a semantic symbol is used together with a word code dictionary.
- FIG. 8 illustrates a role of the mapping information 314 generated in step 404 illustrated in FIG. 4 .
- the mapping information 314 represents a correspondence relationship between code information (intra-document semantic symbol ID) representing a semantic symbol that corresponds to each word included in each document 311 and the dictionary ID of a word code dictionary.
- intra-document semantic symbol ID “0” of document ID “0” is associated with dictionary ID “5023” while intra-document semantic symbol ID “1” is associated with dictionary ID “7025”.
- Intra-document semantic symbol ID “2” is associated with dictionary ID “8653”.
- intra-document semantic symbol ID “0” of document ID “1” is associated with dictionary ID “7025” while intra-document semantic symbol ID “1” is associated with dictionary ID “8653”.
- intra-document semantic symbol IDs can be expressed by data of a length shorter than that of the dictionary ID.
- mapping information 314 it is possible to increase the rate of compression based on encoding of words. For example, when an integer equal to or greater than zero is used as an intra-document semantic symbol ID, the maximum number can be suppressed to several hundred through several thousand.
- the generation unit 303 sequentially reads the analysis result 312 of each document 311 , assigns an intra-document semantic symbol ID to each semantic symbol and associates the intra-document semantic symbol IDs with dictionary IDs so as to generate the mapping information 314 .
- the generation unit 303 allocates a memory space for the mapping information 314 .
- the generation unit 303 uses an operator “new” so as to secure the array below.
- MAX_DOCWORD defines the maximum value for the total number of semantic symbols included in one document 311 and mapping_dic[i][j] represents the dictionary ID corresponding to the semantic symbol of intra-document semantic symbol ID “j” appearing in the document 311 with document ID “i”.
- the generation unit 303 secures the following array by an operator “new”.
- mapping_dic_index [i] represents the total number of semantic symbols appearing in the document 311 with document ID “i”.
- the generation unit 303 reads the analysis result 312 of one document 311 , generates a semantic symbol list of that document 311 , and assigns intra-document semantic symbol IDs in that document 311 in accordance with the order of the dictionary IDs.
- the semantic symbols appearing in document 311 with document ID “0” are “WORK”, “TRANSLATOR” and “TRANSLATION”, the corresponding dictionary IDs are “7025”, “8653” and “5023”, respectively from the word code dictionary illustrated in FIG. 8 .
- these dictionary IDs are sorted in the ascending order, the result is “5023”, “7025” and “8653”. Accordingly, the following intra-document semantic symbol IDs are assigned to the respective semantic symbols.
- mapping_dic_index [0] 3.
- mapping_dic[i] [j] and mapping_dic_index [i] have been terminated for all of the documents 311 .
- the generation unit 303 outputs the generated two arrays to the following two files corresponding to the mapping information 314 .
- mapping_dic[i] [j] is output to file map.dic as below.
- mapping_dic_index [i] is output to file map.idx as below.
- FIG. 9 illustrates an example of the mapping information 314 generated in the above manner.
- “OFFSET” of file map.idx represents the document ID
- the “CONTENT” of file map.idx represents the offset of file map.dic.
- “OFFSET” of file map.dic represents the position corresponding to the intra-document semantic symbol ID of each semantic symbol
- “CONTENT” of file map.dic represents the dictionary ID.
- FIG. 10 illustrates an example of the bit map information 315 generated in step 405 in FIG. 4 .
- the bit map information 315 illustrated in FIG. 10 represents presence or absence of a word registered in the word code dictionary in the sentence specified by a document ID.
- Logic “1” represents the presence of such a word and logic “0” represents the absence of such a word.
- a word corresponding to dictionary ID “1088” is not included in the document 311 with document ID “0”, and is included in the document 311 with document ID “1”.
- bit map information 315 like this, it is possible to narrow the documents 311 including a specified word from among many documents 311 at a high speed.
- the encoding unit 305 uses intra-document semantic symbol IDs so as to encode the semantic structures of the document 311 , and thereby generates the document semantic structure position information 121 .
- semantic minimum unit The semantic structures of the document 311 are expressed by a semantic minimum unit and semantic minimum units are categorized into the following three patterns.
- Node 1 and node 2 of pattern 1 represent the source node and the termination node, respectively, NIL of pattern 2 indicates that a termination node does not exist, and NIL of pattern 3 indicates that a source node does not exist.
- a pattern type can be expressed by a code of two bits. Also, when the total number of the types of words included in one document 311 is equal to or smaller than 32768, node 1 and node 2 can be expressed by a code of 15 bits or shorter. However, because the same word can appear a plurality of number of times in one sentence, in order to distinguish such words, it is desirable to add a code of 4 bits for representing the ordinal number of the number of times that the same character has appeared counting from the beginning of a sentence. Also, when the total number of types of arcs used in the semantic minimum unit is equal to or smaller than 256, it is possible to express an arc by using a code of 1 byte (8 bits) or smaller.
- FIG. 11A illustrates an example of a unit code representing the semantic minimum unit of pattern 1
- FIG. 11B illustrates an example of a unit code representing the semantic minimum units of patterns 2 and 3 .
- the first two bits of the first byte represent a pattern type.
- the next three bits of the first byte and the first one bit of the second byte represent the order of node 1 (ordinal number of the number of times that the same character has appeared counting from the beginning of a sentence among a plurality of the same words).
- the last three bits of the first byte and the first one bit of the fourth byte represent the order of node 2 .
- the remaining seven bits of the second byte and all the bits of the third byte represent the intra-document semantic symbol ID of node 1
- the remaining seven bits of the fourth byte and all the bits of the fifth byte represent the intra-document semantic symbol ID of node 2
- all the bits in the sixth byte represent an arc type. Note that when the order of a node can be represented by three bits and the first one bit of the second byte and the first one bit of the fourth byte are not necessary for representing symbols, it is not necessary to use these bits. In the embodiments described below, a case where the first one bit of the second byte or the first one bit of the fourth byte is not used is explained for the sake of simplicity of explanations.
- the first two bits of the first byte represent a pattern type
- the next six bits of the first byte represent the order of node 1
- the second and third bytes represent the intra-document semantic symbol ID of node 1
- the fourth byte represents an arc type.
- FIG. 12 is a flowchart illustrating an example of an encoding process in which one semantic minimum unit is encoded so as to generate a unit code.
- the encoding unit 305 refers to a semantic analysis result included in the analysis result 312 , determines the pattern type of the semantic minimum unit (step 1201 ), and determines the arc type (step 1202 ).
- the encoding unit 305 obtains the orders of nodes included in the semantic minimum unit (step 1203 ). In the case of pattern 1 , the encoding unit 305 obtains the orders of node 1 and node 2 , and in the case of patterns 2 and 3 , the encoding unit 305 obtains the order of node 1 .
- the encoding unit 305 obtains intra-document semantic symbol IDs of the nodes (step 1204 ). In the case of pattern 1 , the encoding unit 305 obtains the intra-document semantic symbol IDs of nodes 1 and 2 , and in the case of patterns 2 and 3 , the encoding unit 305 obtains the intra-document semantic symbol ID of node 1 .
- the encoding unit 305 refers to the morpheme analysis result included in the analysis result 312 so as to obtain the word, the attribute information and the semantic symbol corresponding to a node included in the semantic minimum unit.
- the encoding unit 305 refers to the semantic symbol information 313 (word code dictionary) so as to obtain the dictionary ID corresponding to the combination of the word, the attribute information and the semantic symbol.
- the encoding unit 305 obtains the intra-document semantic symbol ID from the mapping information 314 on the basis of the document ID and the dictionary ID.
- the encoding unit 305 refers to the position of offset “0” of map.idx so as to obtain content “0”.
- the encoding unit 305 refers to the position of offset “0” of map.dic so as to detect that the content in that position is identical to dictionary ID “5023”. In this case, because dictionary ID “5023” has been detected without shifting the referred-to position of i map.dic, the encoding unit 305 determines the intra-document semantic symbol ID to be “0”.
- the encoding unit 305 refers to the position of offset “2” of map.idx so as to obtain content “5”.
- the encoding unit 305 refers to the position of offset “5” of map.dic and shifts the referred-to position rightwardly from that position one-by-one so as to detect that the content in the position of offset “8” is identical to dictionary ID “35”. In such a case, because dictionary ID “35” has been detected just by shifting the referred-to position of map.dic by three, the encoding unit 305 determines the intra-document semantic symbol ID to be “3”.
- FIG. 13 illustrates an example of the document semantic structure position information 121 generated by an encoding process illustrated in FIG. 12 .
- Unit codes 1301 through 1303 illustrated in FIG. 13 correspond to the unit codes illustrated in FIG. 11A .
- pattern type “0” of the unit code 1301 indicates that it is the semantic minimum unit of pattern 1
- order “5” of node 1 indicates that it is the fifth word counting from the beginning of the sentence
- order “8” of node 2 indicates that it is the eighth word counting from the beginning of the sentence.
- the intra-document semantic symbol ID of node 1 is “2”
- the intra-document semantic symbol ID of node 2 and the arc type are “29” and “21”, respectively.
- unit codes 1301 through 1303 are grouped in the document semantic structure position information 121 for each sentence ID of a sentence to which the unit codes of them belong.
- FIG. 14 is a flowchart illustrating a specific example of a semantic structure search process performed by the semantic structure search device 101 illustrated in FIG. 3 .
- the analysis unit 301 conducts a morpheme analysis on a search request described in a form of a natural sentence so as to generate a morpheme analysis result (step 1401 ), and also conducts a semantic analysis on the search request so as to generate a semantic analysis result (step 1402 ).
- the analysis unit 301 stores in the storage unit 111 the morpheme analysis result and the semantic analysis result conducted on the search request.
- the search unit 112 generates a search key including a plurality of search semantic symbols expressing a semantic structure of the search request from the result of the semantic structure analysis conducted on the search request, and stores the search key in the storage unit 111 (step 1403 ).
- the search unit 112 refers to the semantic symbol information 313 , and specifies combinations, corresponding to a plurality of search semantic symbols included in the search key, of search words, attribute information thereof and semantic symbols thereof (step 1404 ). Then, the search unit 112 refers to the bit map information 315 so as to specify at least one search target document including the specified search words (step 1405 ).
- the search unit 112 refers to the semantic symbol information 313 and the mapping information 314 so as to encode the search key, and generates search code information (step 1406 ).
- the search unit 112 searches for a unit code that is identical to the search code information from the document semantic structure position information 121 of the search target document (step 1407 ). Then, on the basis of the order of the node included in a detected unit code, the position of the search word in the search target document is specified (step 1408 ). The processes insteps 1407 and 1408 are executed for each search target document.
- the output unit 113 outputs a search result (step 1409 ) that represents the search target document in which a unit code identical to the search code information has been detected, each search word used for the search and the position of each search word in the search target document (step 1409 ).
- search un it 112 can generate search keys that represent the following semantic minimum units.
- GIVE in the search key of (GIVE, HANAKO, OBJECT) represents the semantic symbol of the source node corresponding to “AGE”, “HANAKO” in the search key of (GIVE, HANAKO, OBJECT) represents the semantic symbol of the termination node corresponding to “HANAKO”, and “OBJECT” represents the arc.
- the search unit 112 obtains, from the semantic symbol information 313 , dictionary IDs and entries respectively corresponding to “GIVE” and “HANAKO”, which are included in (GIVE, HANAKO, OBJECT). Thereby, six types of pieces of information as below for example are obtained.
- the search unit 112 combines the dictionary ID corresponding to “GIVE” and the dictionary ID corresponding to “HANAKO” so as to generate search formulas as below.
- the search unit 112 refers to the bit map information 315 illustrated in FIG. 10 so as to obtain a set of document IDs that satisfy the search formulas.
- a document ID satisfying (1088 AND 200291) can be obtained by calculating the logical product of the row of “1088” and the row of “200291” in the bit map information 315 as illustrated in FIG. 15A .
- the document ID of the document 311 including “AGE” and “HANAKO” is “3”.
- the document ID satisfying (2183 AND 200291) can be obtained by calculating the logical product of the rows of “2183” and “200291” in the bit map information 315 as illustrated in FIG. 15B .
- the document IDs of the documents 311 including “ATAE” and “HANAKO” are “24” and “522”, respectively.
- the document ID satisfying (4021 AND 200291) can be obtained by calculating the local product of the rows of “4021” and “200291” in the bit map information 315 as illustrated in FIG. 15C .
- the document ID the document 311 including “ZOUYO” and “HANAKO” is “9283”.
- the document ID satisfying (5911 AND 200291) can be obtained by calculating the local product of the rows of “5911” and “200291” in the bit map information 315 as illustrated in FIG. 15D .
- the document 311 including “KIFU” and “HANAKO” does not exist.
- the document ID satisfying (9827 AND 200291) can be obtained by calculating the local product of the rows of “9827” and “200291” in the bit map information 315 as illustrated in FIG. 15E .
- the document ID of the document 311 including “TEWATASHI” and “HANAKO” is “82”.
- FIG. 16 illustrates the document IDs obtained through the logical products illustrated in FIG. 15A through FIG. 15E .
- the set of the document IDs satisfying the search formulas is (3, 24, 522, 9283, 82), which means that the number of the search target documents has been narrowed to five.
- the search unit 112 replaces the dictionary IDs of the search formulas with intra-document semantic symbol IDs corresponding to the document IDs of the search target documents so as to generate search code information for each search target document.
- search target documents are different, a piece of search code information is replaced with a different piece of search code information even if the same search formulas are used.
- the search unit 112 refers to the position of offset “3” in map.idx so as to obtain content “11”.
- the search unit 112 refers to the position of offset “11” in map.dic and shifts the referred-to position rightwardly from that position one by one so as to detect that the content in the position of offset “15” is identical to dictionary ID “1088”.
- the search unit 112 determines the intra-document semantic symbol ID to be “4”.
- the search unit 112 refers to the position of offset “3” in map.idx so as to obtain content “11”.
- the search unit 112 refers to the position of offset “11” in map.dic and shifts the referred-to position rightwardly from that position one by one so as to detect that the content in the position of offset “22” is identical to dictionary ID “200291”. In such a case, because dictionary ID “200291” has been detected just by shifting the referred-to position by eleven in map.dic, the search unit 112 determines the intra-document semantic symbol ID to be “11”.
- the search unit 112 refers to the position of offset “24” in map.idx so as to obtain content “1690”.
- the search unit 112 refers to the position of offset “1690” in map.dic and shifts the referred-to position rightwardly from that position one by one so as to detect that the content in the position of offset “1694” is identical to dictionary ID “2183”. In such a case, because dictionary ID “2183” has been detected just by shifting the referred-to position by four in map.dic, the search unit 112 determines the intra-document semantic symbol ID to be “4”.
- the search unit 112 refers to the position of offset “24” in map.idx so as to obtain content “1690”.
- the search unit 112 refers to the position of offset “1690” in map.dic and shifts the referred-to position rightwardly from that position one by one so as to detect that the content in the position of offset “1705” is identical to dictionary ID “200291”. In such a case, because dictionary ID “200291” has been detected just by shifting the referred-to position by fifteen in map.dic, the search unit 112 determines the intra-document semantic symbol ID to be “15”.
- search code information as below is generated from the search formulas.
- FIG. 13 illustrates the document semantic structure position information 121 with document ID “3”
- the search unit 112 uses search code information (4, 11) for document ID “3” so as to search the document semantic structure position information 121 illustrated in FIG. 13 .
- the search unit 112 detects the unit code 1303 including intra-document semantic symbol ID “4” of node 1 and intra-document semantic symbol ID “11” of node 2 so as to obtain order “3” of node 1 and order “1” of node 2 .
- search word “AGE” corresponding to intra-document semantic symbol ID “4” is the third “AGE” counting from the beginning of the sentence
- search word “HANAKO” corresponding to intra-document semantic symbol ID “11” is the first “HANAKO” counting from the beginning of the sentence.
- the search unit 112 refers to the morpheme analysis result (which corresponds to the morpheme analysis result 511 illustrated in FIG. 5 ) included in the analysis result 312 so as to search the sentence with the sentence ID corresponding to the unit code 1303 for word “AGE”, which corresponds to “AGE+VERB+GIVE”. Then, the search unit 112 specifies the position, in the search target document, (which corresponds to the word position 524 illustrated in FIG. 5 ) of the third “AGE” counting from the beginning of the sentence among detected words “AGE” as the position of search word “AGE” in the search target document.
- the search unit 112 refers to the morpheme analysis result included in the analysis result 312 so as to search for word “HANAKO”, which corresponds to “HANAKO+NOUN+HANAKO” in the sentence with the sentence ID that corresponds to the unit code 1303 . Then, the search unit 112 specifies the position of the first “HANAKO” counting from the beginning of the sentence among detected words “HANAKO” as the position of search word “HANAKO” in the search target document.
- the output unit 113 may for example conduct emphasized display for the texts of “AGE” and “HANAKO” existing in the specified positions in the search target document.
- the total number of combinations of two semantic symbols included in a semantic minimum unit is N*N, and accordingly the calculation amount and the data amount of the search results in a semantic structure search using a conventional semantic minimum unit is in the order of N*N.
- the semantic minimum unit of a search target document and the semantic minimum unit of a search request are encoded together, making it enough to just compare the encoded semantic minimum units, and accordingly the calculation amount and the data amount of search results are in the order of N.
- FIG. 17 is a flowchart illustrating the semantic structure search process illustrated in FIG. 14 to which a process of calculating the score of a search target document has been added.
- the processes in step 1701 through 1708 illustrated in FIG. 17 are similar to those insteps 1401 through 1408 illustrated in FIG. 14 .
- step 1709 the search unit 112 calculates the score of a search target document by using the score of a search key. Then, in step 1710 , the output unit 113 ranks search target documents in accordance with the scores and outputs the search results.
- FIG. 18 illustrates a first example of scores of search formulas that are used as scores of search keys.
- a score thereof is calculated in advance.
- score S of a search formula the following equation for example can be used.
- idf 1 represents the inverse document frequency of a search word that corresponds to the first dictionary ID included in the search formula
- idf 2 represents the inverse document frequency of a search word that corresponds to the second dictionary ID included in the search formula.
- N 1 represents the number of times that the search word corresponding to the first dictionary ID appears in the search request
- N 2 represents the number of times that the search word corresponding to the second dictionary ID appears in the search request.
- FIG. 19 illustrates a second example of scores of search formulas that are used as scores of search keys.
- the following equation for example is used.
- idf 11 represents the inverse document frequency of a semantic symbol that corresponds to the first dictionary ID included in the search formula
- idf 12 represents the inverse document frequency of a semantic symbol that corresponds to the second dictionary ID included in the search formula.
- N 11 represents the number of times that the semantic symbol corresponding to the first dictionary ID appears in the search request
- N 12 represents the number of times that the semantic symbol corresponding to the second dictionary ID appears in the search request.
- All of the search formulas illustrated in FIG. 19 are generated from the same search key (GIVE, HANAKO, OBJECT). Accordingly, when scores S of the search formulas are calculated by using the inverse document frequency of the semantic symbol and the number of times that it appears, the scores S of all the search formulas will have the same value.
- Score DS of the search target document is calculated by for example the following equation, which uses score S of a search formula.
- S at the right-hand side in equation (3) represents the score of a search formula identical to the document semantic structure position information 121 in the search target document
- P represents the number of times that the search formula is turned to be identical
- summation symbol ⁇ represents the summation of the value of S*P for a plurality of search formulas.
- the configuration of the semantic structure search device 101 in FIG. 1 or FIG. 3 is just an example, and the constituents can partially be omitted or changed in accordance with the purposes or conditions of the semantic structure search device 101 .
- the analysis unit 301 , the generation unit 302 , the generation unit 303 and the generation unit 304 illustrated in FIG. 3 can be omitted.
- the encoding unit 305 illustrated in FIG. 3 can be omitted.
- FIG. 2 , FIG. 4 , FIG. 12 , FIG. 14 and FIG. 17 are just examples and the processes can partially be omitted or changed in accordance with the configurations or conditions of the semantic structure search device 101 .
- the compression process illustrated in FIG. 4 can be omitted.
- step 1405 and step 1705 can be omitted.
- a search key including three or more semantic symbols can be used instead of a search key including two semantic symbols.
- the analysis result 312 illustrated in FIG. 5 , the semantic symbol information 313 illustrated in FIG. 6 and FIG. 7 , the mapping information 314 illustrated in FIG. 9 and the bit map information 315 illustrated in FIG. 10 are just examples, and information in a different data configuration can be used in accordance with the configuration or conditions of the semantic structure search device 101 .
- the semantic symbols illustrated in FIG. 11 and the document semantic structure position information 121 illustrated in FIG. 13 are just examples, and information in a different data configuration can be used in accordance with the configurations or conditions of the semantic structure search device 101 .
- the attribute information and semantic symbols can be omitted in the morpheme analysis result 511 illustrated in FIG. 5 and the bit map information 315 illustrated in FIG. 10 .
- the attribute information can be omitted in the semantic symbol information 313 illustrated in FIG. 6 and FIG. 7 .
- Equations (1) through (3) are just examples, and scores of search target documents may be calculated by using a different equation.
- the semantic structure search device 101 illustrated in FIG. 1 and FIG. 3 can be implemented by using for example an information processing apparatus (computer) as illustrated in FIG. 20 .
- the information processing apparatus illustrated in FIG. 20 includes a central processing unit (CPU) 2001 , a memory 2002 , an input device 2003 , an output device 2004 , an auxiliary storage device 2005 , a medium driving device 2006 and a network connection device 2007 . These constituents are connected to each other via a bus 2008 .
- CPU central processing unit
- the memory 2002 is for example a semiconductor memory such as a Read Only Memory (ROM), a Random Access Memory (RAM), a flash memory, etc., and stores a program and data used for the processing.
- the memory 2002 can be used as the storage unit 111 illustrated in FIG. 1 and FIG. 3 .
- the CPU 2001 executes a program by utilizing for example the memory 2002 , and thereby operates as the search unit 112 , the analysis unit 301 , the generation unit 302 , the generation unit 303 , the generation unit 304 and the encoding unit 305 illustrated in FIG. 1 and FIG. 3 .
- the input device 2003 is for example a keyboard, a pointing device, etc. and is used for inputting instructions or information from the operator or the user. Instructions from the operator or the user may be a search request of a semantic structure search.
- the output device 2004 is for example a display device, a printer, a speaker, etc., and is used for outputting queries or instructions for the operator or the user and for outputting process results.
- the output device 2004 can be used as the output unit 113 illustrated in FIG. 1 and FIG. 3 .
- Process results can be search results of a semantic structure search.
- the auxiliary storage device 2005 is for example a magnetic disk device, an optical disk device, a magneto-optical disk device, a tape device, etc.
- the auxiliary storage device 2005 may be a hard disk drive or a flash memory.
- the information processing apparatus can store a program and data in the auxiliary storage device 2005 and load them onto the memory 2002 so as to use them.
- the auxiliary storage device 2005 can be used as the storage unit 111 illustrated in FIG. 1 and FIG. 3 .
- the medium driving device 2006 drives a portable recording medium 2009 so as to access information recorded in it.
- the portable recording medium 2009 is for example a memory device, a flexible disk, an optical disk, a magneto-optical disk, etc.
- the portable recording medium 2009 may be a Compact Disk Read Only Memory (CD-ROM), a Digital Versatile Disk (DVD), Universal Serial Bus (USB), etc.
- CD-ROM Compact Disk Read Only Memory
- DVD Digital Versatile Disk
- USB Universal Serial Bus
- a computer-readable recording medium that stores a program and data used for the processes is a physical (non-transitory) recording medium such as the memory 2002 , the auxiliary storage device 2005 and the portable recording medium 2009 .
- the network connection device 2007 is a communication interface that is connected to a communication network such as a Local Area Network, a Wide Area Network, etc. so as to conduct data conversion accompanying communications.
- the information processing apparatus can receive a program and data from an external device via the network connection device 2007 and load them onto the memory 2002 to use them.
- the information processing apparatus can receive a search request from a user terminal via the network connection device 2007 so as to send a search result to the user terminal.
- the network connection device 2007 can be used as the output unit 113 illustrated in FIG. 1 and FIG. 3 .
- the information processing apparatus can include all the constituents illustrated in FIG. 20 , and the constituents can partially be omitted in accordance with the purposes or conditions.
- the input device 2003 and the output device 2004 can be omitted.
- the medium driving device 2006 or the network connection device 2007 can be omitted.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
A non-transitory computer-readable recording medium stores a semantic structure search program. The semantic structure search program causes a computer to execute the following process. The computer generates a plurality of search semantic symbols from a search request. Next, the computer specifies a position of a specific word that corresponds to the search request in a search target document, by the plurality of search semantic symbols and document semantic structure position information. The document semantic structure position information includes a relationship information between a plurality of semantic symbols and a plurality of positions of a plurality of words in the search target document. The plurality of semantic symbols represent a semantic structure corresponding to the plurality of words. Thereafter, the computer outputs a search result including the specific word and the position of the specific word in the search target document.
Description
- This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2015-008936, filed on Jan. 20, 2015, the entire contents of which are incorporated herein by reference.
- The embodiments discussed herein are related to a semantic structure search device and a semantic structure search method.
- In recent years, the importance of text searches has been increasing accompanying the explosive increase in text data volumes. Particularly, as research on semantic processing for secretarial function application software etc. has been becoming active, searches for semantic structures of natural sentences have been becoming important more and more.
- Analyses of natural sentences conducted in text searches utilize lexical analyses, morpheme analyses, semantic analyses, etc. A lexical analysis is a process of dividing a character string into words, while a morpheme analysis is a process of dividing a character string into morphemes and assigning information such as word classes, attributes, etc. to each morphemes. Morphemes obtained through morpheme analyses may be treated as words.
- A semantic analysis is a process of using a result of a morpheme analysis of a natural sentence so as to obtain the semantic structure of that natural sentence. By using a semantic structure, which is a result of semantic analyses, what is meant by a natural sentence can be expressed as data, which is processed by computers.
- A semantic structure includes a plurality of semantic symbols respectively representing the meanings of a plurality of words included in a morpheme analysis result, and also includes information representing the relationship between two semantic symbols. In some cases, one semantic symbol corresponds to a plurality of words. A semantic structure can be represented by for example a directed graph having a plurality of nodes representing a plurality of semantic symbols and also having an arc representing the relationship between two nodes. The smallest partial structure of a semantic structure is referred to as a semantic minimum unit and includes two nodes and an arc between those nodes.
- By conducting a morpheme analysis and a semantic analysis on text data included in a plurality of documents, it is possible to realize a semantic structure search that searches for a plurality of documents by using a semantic structure of a search request for a natural sentence.
- However, a semantic structure, which is a result of a semantic analysis of text data, is several tens of times larger in data volume than the original text data. Further, a semantic structure search is a complicated process, sometimes leading to a situation where data that is the result of a semantic analysis is to be compressed for a semantic structure search.
- An information search device that uses a semantic minimum unit as a search key for a semantic structure search of a natural sentence is also known (see
Patent Document 1, for example). This information search device accepts a search query of a natural language sentence, conducts a semantic analysis on that natural language sentence, and specifies the semantic minimum unit that serves as a search key. Then, the information search device searches for a search target sentence including a semantic minimum unit that is identical to the search key from a searching index that has in advance stored semantic minimum units included in the search target sentence. - An information search device that uses results obtained by a sentence-meaning-oriented search in order to efficiently realize display that is easy to understand is also known (see
Patent Document 2 for example). This information search device compares, on the basis of match profile information, a search key sentence and the match dictionary information in accordance with an associated matching condition, and obtains positional information that represents the position in which a word meeting the matching condition appears in sentences of the match dictionary information. Then, on the basis of the obtained result of the comparison, the information search device transmits, to a terminal device, search result information in which a sentence including a word meeting the matching condition and the positional information are associated. - An information processing program that can increase the efficiency of compression of character codes and accelerate the speed of a compression process and an expansion process is also known (See
Patent Document 3 for example). A generation program that can construct a 2N branch nodeless Huffman tree in which the optimum code length is assigned to the total number of types of pieces of character information etc. is also known (seePatent Document 4 for example). An information generation program that can accelerate the generation of index information representing presence or absence of basic words or characters and optimize the size of index information is also known (seePatent Document 5 for example). - Patent Document 1: Japanese Laid-open Patent Publication No. 2013-186766
- Patent Document 2: Japanese Laid-open Patent Publication No. 2010-267247
- Patent Document 3: Japanese Laid-open Patent Publication No. 2010-93414
- Patent Document 4: International Publication Pamphlet No. WO 2012/111078
- Patent Document 5: International Publication Pamphlet No. WO 2011/148511
- According to an aspect of the embodiments, a non-transitory computer-readable recording medium stores a semantic structure search program. The semantic structure search program causes the computer to execute the following process.
- (1) The computer generates a plurality of search semantic symbols from a search request.
- (2) The computer specifies a position of a specific word that corresponds to the search request in a search target document, by the plurality of search semantic symbols and document semantic structure position information. The document semantic structure position information includes a relationship information between a plurality of semantic symbols and a plurality of positions of a plurality of words in the search target document. The plurality of semantic symbols represent a semantic structure corresponding to the plurality of words.
- (3) The computer outputs a search result including the specific word and the position of the specific word in the search target document.
- The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
- It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
-
FIG. 1 illustrates a functional configuration of a semantic structure search device; -
FIG. 2 is a flowchart illustrating a semantic structure search process; -
FIG. 3 illustrates a specific example of the functional configuration of the semantic structure search device; -
FIG. 4 is a flowchart of a compression process; -
FIG. 5 illustrates an analysis result of a document; -
FIG. 6 illustrates a first example of semantic symbol information; -
FIG. 7 illustrates a second example of semantic symbol information; -
FIG. 8 illustrates a role of mapping information; -
FIG. 9 illustrates mapping information; -
FIG. 10 illustrates bit map information; -
FIG. 11A andFIG. 11B illustrate semantic codes; -
FIG. 12 illustrates a flowchart of an encoding process; -
FIG. 13 illustrates document semantic structure position information; -
FIG. 14 illustrates a flowchart as a specific example of a semantic structure search process; -
FIG. 15A throughFIG. 15E illustrate logical products of two rows of bit map information; -
FIG. 16 illustrates document IDs obtained through logical products of two rows; -
FIG. 17 illustrates a flowchart of a semantic structure search process that conducts score calculation; -
FIG. 18 illustrates a first example of scores of search formulas; -
FIG. 19 illustrates a second example of scores of search formulas; and -
FIG. 20 illustrates a configuration of an information processing apparatus. - Hereinafter, the embodiments will be explained in detail by referring to the drawings.
- The conventional semantic structure searches have the following problems.
- Because a morpheme analysis result and a semantic analysis result for a document are not associated, even when the semantic minimum unit has been searched for in a semantic structure search, the correspondence relationship between a semantic symbol included in the semantic minimum unit and a word included in the original sentence is not known. Accordingly, a word corresponding to a semantic symbol that was searched for is obtained on the basis of information representing the corresponding relationship between the semantic symbol and a word included in the original sentence, and the text corresponding to the position of that word is referred to. In such a case, a process of obtaining a word corresponding a semantic symbol is performed in addition to the semantic structure search, leading to a longer period of processing time.
- In a case, as a preprocess for a semantic structure search, when information including both the correspondence relationship between the morpheme analysis result and the original sentence and the correspondence relationship between the semantic structure analysis result and the morpheme analysis result is to be generated, the size of the information to be generated becomes immense, leading to a longer period of processing time, which is against the intention. Accordingly, it is desirable to associate the semantic structure and the original sentence efficiently in a semantic structure search.
-
Patent Document 2 also discloses construction of a tree structure by using, as a partial tree node, a sentence including phrases consisting of words. However, when information of a tree structure is searched for, the period of processing time becomes longer. - Note that these problems arise not only when the semantic minimum unit is searched for from a semantic structure but also when a partial structure including three or more semantic symbols is searched for from a semantic structure.
-
FIG. 1 illustrates a functional configuration example of a semantic structure search device according to an embodiment. A semanticstructure search device 101 includes astorage unit 111, asearch unit 112 and an output unit 113 (output interface). - The
storage unit 111 stores document semanticstructure position information 121 that includes a relationship information between a plurality of semantic symbols and a plurality of positions of a plurality of words in the search target document. The plurality of semantic symbols represent a semantic structure corresponding to the plurality of words. Thesearch unit 112 refers to the document semanticstructure position information 121 so as to perform a semantic structure search process based on a search request, and theoutput unit 113 outputs the search result. -
FIG. 2 is a flowchart illustrating an example of a semantic structure search process performed by the semanticstructure search device 101 illustrated inFIG. 1 . First, thesearch unit 112 generates a plurality of search semantic symbols from a search request (step 201). Next, thesearch unit 112 refers to the document semanticstructure position information 121 so as to specify a position of a specific word that corresponds to the search request in the search target document, by the plurality of search semantic symbols and the document semantic structure position information (step 202). Then, theoutput unit 113 outputs the search result including the specific word and the position of the specific word in the search target document (step 203). - By using the semantic
structure search device 101 illustrated inFIG. 1 , it is possible to specify a word corresponding to a search request and the position of that word in the document in a semantic structure search. -
FIG. 3 illustrates a specific example of the semanticstructure search device 101 illustrated inFIG. 1 . The semanticstructure search device 101 illustrated inFIG. 3 includes thestorage unit 111, thesearch unit 112, theoutput unit 113, ananalysis unit 301,generation units encoding unit 305. - The
analysis unit 301 conducts a morpheme analysis and a semantic analysis on each of a plurality ofdocuments 311 stored in thestorage unit 111, generates ananalysis result 312 including a morpheme analysis result and a semantic analysis result, and stores them in thestorage unit 111. Thegeneration unit 302 generatessemantic symbol information 313 representing a correspondence relationship between a word and a semantic symbol from theanalysis result 312. - The
generation unit 303 generatesmapping information 314, which represents a correspondence relationship between a word and code information from theanalysis result 312 and thesemantic symbol information 313, and stores the information in thestorage unit 111. Thegeneration unit 304 generatesbit map information 315, which represents presence or absence of each of a plurality of words in eachdocument 311 from theanalysis result 312 and thesemantic symbol information 313, and stores the information in thestorage unit 111. Theencoding unit 305 encodes eachdocument 311 by using theanalysis result 312, thesemantic symbol information 313 and themapping information 314 so as to generate the document semanticstructure position information 121 for eachdocument 311 and store the information in thestorage unit 111. - The
search unit 112 refers to the document semanticstructure position information 121, theanalysis result 312, thesemantic symbol information 313, themapping information 314 and thebit map information 315 so as to perform a semantic structure search process based on the search request. - First, by referring to
FIG. 4 throughFIG. 13 , explanations will be given for a compression process that is conducted for compressing data of thedocument 311. -
FIG. 4 is a flowchart illustrating an example of a compression process performed by the semanticstructure search device 101 illustrated inFIG. 3 . First, theanalysis unit 301 conducts a morpheme analysis on thedocument 311 so as to generate a morpheme analysis result (step 401), and also conducts a semantic analysis on thedocument 311 so as to generate a semantic analysis result (step 402). Then, theanalysis unit 301 stores, in thestorage unit 111, theanalysis result 312 including the morpheme analysis result and the semantic analysis result. The processes ofsteps document 311. - Next, the
generation unit 302 generates thesemantic symbol information 313 from theanalysis result 312, and stores the information in the storage unit 111 (step 403). - Next, the
generation unit 303 generates themapping information 314 for thedocument 311 from theanalysis result 312 and thesemantic symbol information 313, and stores the information in the storage unit 111 (step 404). Next, thegeneration unit 304 generates thebit map information 315 for thedocument 311 from theanalysis result 312 and thesemantic symbol information 313 and stores the information in the storage unit 111 (step 405). Then, theencoding unit 305 encodes thedocument 311 by using theanalysis result 312, thesemantic symbol information 313 and themapping information 314 so as to generate the document semanticstructure position information 121, and stores the information in the storage unit 111 (step 406). The processes insteps 404 through 406 are performed for eachdocument 311. -
FIG. 5 illustrates an example of a text file corresponding to theanalysis result 312 of thedocument 311, the text file being generated insteps FIG. 4 . Atext file 501 illustrated inFIG. 5 includes an analysis result 502 of each sentence included in thedocument 311, and the analysis result 502 of each sentence includes amorpheme analysis result 511 and asemantic analysis result 512. - The
morpheme analysis result 511 includes aword 521 included in a sentence, adocument ID 522, a sentence ID 523, a word position 524 in the sentence, word data length 525, asemantic symbol 526 corresponding to the word, and attributeinformation 527. Theattribute information 527 includes information representing for example the word class of the word, whether or not the word is a categorematic word, etc. In some cases, one morpheme obtained through a morpheme analysis may be treated as one word, and in other cases, a compound word consisting of a plurality of morphemes may be treated as one word. - For example, the
document ID 522 and the sentence ID 523 of “GYOUMU (meaning “business” in English) is “7502” and “4”, respectively. The word position 524 of “GYOUMU” represents the position corresponding to “6” bytes from the beginning of the sentence, while the data length 525 represents “4” bytes. Thesemantic symbol 526 of “GYOUMU” is “BUSINESS&# 061; GYOUMU”, and theattribute information 527 is “:N: IW) ” “N” in theattribute information 527 represents a noun, and “IW” represents a categorematic word. - The
semantic analysis result 512 includes a semanticminimum unit 531 included in a sentence, adocument ID 532, asentence ID 533, aphrase 534 corresponding to the semantic minimum unit, a startingposition 535 of a phrase in the sentence and anending position 536 of the phrase. The semanticminimum unit 531 includes a source node, a termination node and an arc starting from the source node to the termination node. Thesemantic analysis result 512 further includes a startingposition 537 of a word corresponding to the source node, an endingposition 538 of the word corresponding to the source node, a startingposition 539 of a word corresponding to the termination node and anending position 540 of the word corresponding to the termination node. - For example, the source node of semantic minimum unit “INTERNET=7--<CONCERN>-->ENTREPRENEUR” is “INTERNET=7”, the termination node is “ENTREPRENEUR”, and the arc is “--<CONCERN>-->”.
- The
document ID 532, thesentence ID 533 and thephrase 534 of this semantic minimum unit are “7502”, “4” and “JIGYOUSHA NO DETASENTA TO INTANETTO”, respectively. The startingposition 535 of “JIGYOUSHA NO DETASENTA TO INTANETTO” represents the position corresponding to “42” bytes from the beginning of the sentence and the endingposition 536 represents the position corresponding to “78” bytes from the beginning of the sentence. - The starting
position 537 of word “INTANETTO” corresponding to the source node represents the position corresponding to “64” bytes from the beginning of the sentence and the endingposition 538 represents the position corresponding to “78” bytes from the beginning of the sentence. The startingposition 539 of word “JIGYOUSHA” corresponding to the termination node represents the position corresponding to “42” bytes from the beginning of the sentence and the endingposition 540 represents the position corresponding to “48” bytes from the beginning of the sentence. -
FIG. 6 illustrates a first example of a word code dictionary corresponding to thesemantic symbol information 313 that is generated instep 403 illustrated inFIG. 4 . The word code dictionary illustrated inFIG. 6 represents a correspondence relationship between a dictionary ID and an entry. An entry includes a word, attribute information and a semantic symbol, and a dictionary ID represents a word code added to an entry. By referring to a word code dictionary like this, it is possible to compress thedocument 311 by replacing a word specified by an entry with a word code specified by a dictionary ID. - For example, in entry “HONYAKU+noun+abstract thing+TRANSLATION” that corresponds to dictionary ID “5023”, “HONYAKU” is a word, “noun” and “abstract thing” are attribute information, and “TRANSLATION” is a semantic symbol. “noun” represents a word class, and “abstract thing” represents a category of a word. The word code dictionary illustrated in
FIG. 6 also represents a correspondence relationship between a word and a semantic symbol. - It is made possible to distinguish a plurality of same word having different meanings by using a combination of a word, attribute information and a semantic symbol as an entry of a word code dictionary.
-
FIG. 7 illustrates a second example of a word code dictionary corresponding to thesemantic symbol information 313. The word code dictionary illustrated inFIG. 7 has a configuration in which semantic symbols have been removed from the entries illustrated inFIG. 6 . In such a case, as thesemantic symbol information 313, a word dictionary representing a correspondence relationship between a combination of a word and attribute information and a semantic symbol is used together with a word code dictionary. By referring to the word code dictionary illustrated inFIG. 7 and a word dictionary, it is possible to uniquely determine a correspondence relationship between a dictionary ID and a semantic symbol. -
FIG. 8 illustrates a role of themapping information 314 generated instep 404 illustrated inFIG. 4 . Themapping information 314 represents a correspondence relationship between code information (intra-document semantic symbol ID) representing a semantic symbol that corresponds to each word included in eachdocument 311 and the dictionary ID of a word code dictionary. - For example, intra-document semantic symbol ID “0” of document ID “0” is associated with dictionary ID “5023” while intra-document semantic symbol ID “1” is associated with dictionary ID “7025”. Intra-document semantic symbol ID “2” is associated with dictionary ID “8653”.
- Also, intra-document semantic symbol ID “0” of document ID “1” is associated with dictionary ID “7025” while intra-document semantic symbol ID “1” is associated with dictionary ID “8653”.
- Even when the number of entries in a word code dictionary is immense, the total number of semantic symbols corresponding to words included in one sentence is limited, and accordingly intra-document semantic symbol IDs can be expressed by data of a length shorter than that of the dictionary ID. In view of this, by replacing the dictionary ID with intra-document semantic symbol IDs by using the
mapping information 314, it is possible to increase the rate of compression based on encoding of words. For example, when an integer equal to or greater than zero is used as an intra-document semantic symbol ID, the maximum number can be suppressed to several hundred through several thousand. - Next, an example of a process of generating the
mapping information 314 will be explained. Thegeneration unit 303 sequentially reads theanalysis result 312 of eachdocument 311, assigns an intra-document semantic symbol ID to each semantic symbol and associates the intra-document semantic symbol IDs with dictionary IDs so as to generate themapping information 314. - First, the
generation unit 303 allocates a memory space for themapping information 314. In a case of C language for example, thegeneration unit 303 uses an operator “new” so as to secure the array below. - unsigned int mapping_dic[number of documents] [MAX_DOCWORD] #define MAX_DOCWORD 1024
- MAX_DOCWORD defines the maximum value for the total number of semantic symbols included in one
document 311 and mapping_dic[i][j] represents the dictionary ID corresponding to the semantic symbol of intra-document semantic symbol ID “j” appearing in thedocument 311 with document ID “i”. - Next, the
generation unit 303 secures the following array by an operator “new”. - int mapping_dic_index[number of documents]
- mapping_dic_index [i] represents the total number of semantic symbols appearing in the
document 311 with document ID “i”. - Next, the
generation unit 303 reads theanalysis result 312 of onedocument 311, generates a semantic symbol list of thatdocument 311, and assigns intra-document semantic symbol IDs in thatdocument 311 in accordance with the order of the dictionary IDs. - For example, the semantic symbols appearing in
document 311 with document ID “0” are “WORK”, “TRANSLATOR” and “TRANSLATION”, the corresponding dictionary IDs are “7025”, “8653” and “5023”, respectively from the word code dictionary illustrated inFIG. 8 . When these dictionary IDs are sorted in the ascending order, the result is “5023”, “7025” and “8653”. Accordingly, the following intra-document semantic symbol IDs are assigned to the respective semantic symbols. - mapping_dic[0] [0]=5023;
- mapping_dic[0] [1]=7025;
- mapping_dic[0] [2]=8653;
- Also, the total number of the semantic symbols appearing in the
document 311 with document ID “0” is “3”, which results in mapping_dic_index [0]=3. - When the generation of mapping_dic[i] [j] and mapping_dic_index [i] has been terminated for all of the
documents 311, thegeneration unit 303 outputs the generated two arrays to the following two files corresponding to themapping information 314. - (1) File map.dic
- The content of mapping_dic[i] [j] is output to file map.dic as below.
-
for(i=0;i<number of documents;i++) { fwrite(&mapping_dic[i],mapping_dic_index[i], sizeof(unsigned int),fp_map_dic); } - (2) File map.idx
- The content of mapping_dic_index [i] is output to file map.idx as below.
-
int loc=0; for(i=0;i<number of documents;i++) { loc+=mapping_dic_index[i]; fwrite(&loc,1,sizeof(int),fp_map_idx); } -
FIG. 9 illustrates an example of themapping information 314 generated in the above manner. “OFFSET” of file map.idx represents the document ID, and the “CONTENT” of file map.idx represents the offset of file map.dic. “OFFSET” of file map.dic represents the position corresponding to the intra-document semantic symbol ID of each semantic symbol, and “CONTENT” of file map.dic represents the dictionary ID. - When for example document ID “n” and intra-document semantic symbol ID “x” have been given, it is possible to obtain offset “m” of map.dic corresponding to document ID “n” by referring to the position of offset “n” of map.idx and obtaining content “m”. Then, by referring to the position of offset “m+x” of map.dic so as to obtain the content, the dictionary ID corresponding to intra-document semantic symbol ID “x” can be obtained.
- When document ID “1” and intra-document semantic symbol ID “1” have been given, it is possible to obtain offset “3” of map.dic corresponding to document ID “1” by referring to the position of offset “1” of map.idx and obtaining content “3”. Then, by referring to the position of offset “3+1=4” of map.dic and obtaining the content, the dictionary ID “8653” corresponding to intra-document semantic symbol ID “1” can be obtained.
-
FIG. 10 illustrates an example of thebit map information 315 generated instep 405 inFIG. 4 . Thebit map information 315 illustrated inFIG. 10 represents presence or absence of a word registered in the word code dictionary in the sentence specified by a document ID. Logic “1” represents the presence of such a word and logic “0” represents the absence of such a word. - For example, a word corresponding to dictionary ID “1088” is not included in the
document 311 with document ID “0”, and is included in thedocument 311 with document ID “1”. By using thebit map information 315 like this, it is possible to narrow thedocuments 311 including a specified word from amongmany documents 311 at a high speed. - Next, explanations will be given for an example of the encoding process in
step 406 inFIG. 4 . Theencoding unit 305 uses intra-document semantic symbol IDs so as to encode the semantic structures of thedocument 311, and thereby generates the document semanticstructure position information 121. - The semantic structures of the
document 311 are expressed by a semantic minimum unit and semantic minimum units are categorized into the following three patterns. - Pattern 1: (
node 1,node 2, arc) - Pattern 2: (
node 1, NIL, arc) - Pattern 3: (NIL,
node 1, arc) -
Node 1 andnode 2 ofpattern 1 represent the source node and the termination node, respectively, NIL ofpattern 2 indicates that a termination node does not exist, and NIL ofpattern 3 indicates that a source node does not exist. - A pattern type can be expressed by a code of two bits. Also, when the total number of the types of words included in one
document 311 is equal to or smaller than 32768,node 1 andnode 2 can be expressed by a code of 15 bits or shorter. However, because the same word can appear a plurality of number of times in one sentence, in order to distinguish such words, it is desirable to add a code of 4 bits for representing the ordinal number of the number of times that the same character has appeared counting from the beginning of a sentence. Also, when the total number of types of arcs used in the semantic minimum unit is equal to or smaller than 256, it is possible to express an arc by using a code of 1 byte (8 bits) or smaller. -
FIG. 11A illustrates an example of a unit code representing the semantic minimum unit ofpattern 1, andFIG. 11B illustrates an example of a unit code representing the semantic minimum units ofpatterns - In the unit code illustrated in
FIG. 11A , the first two bits of the first byte represent a pattern type. The next three bits of the first byte and the first one bit of the second byte represent the order of node 1 (ordinal number of the number of times that the same character has appeared counting from the beginning of a sentence among a plurality of the same words). The last three bits of the first byte and the first one bit of the fourth byte represent the order ofnode 2. The remaining seven bits of the second byte and all the bits of the third byte represent the intra-document semantic symbol ID ofnode 1, the remaining seven bits of the fourth byte and all the bits of the fifth byte represent the intra-document semantic symbol ID ofnode 2, and all the bits in the sixth byte represent an arc type. Note that when the order of a node can be represented by three bits and the first one bit of the second byte and the first one bit of the fourth byte are not necessary for representing symbols, it is not necessary to use these bits. In the embodiments described below, a case where the first one bit of the second byte or the first one bit of the fourth byte is not used is explained for the sake of simplicity of explanations. - In the unit code illustrated in
FIG. 11B , the first two bits of the first byte represent a pattern type, the next six bits of the first byte represent the order ofnode 1, the second and third bytes represent the intra-document semantic symbol ID ofnode 1, and the fourth byte represents an arc type. - By using a unit code as described above, it is possible to represent a semantic minimum unit by using four or six bytes while holding a link between a symbol information of a node in a semantic minimum unit and a word in an original sentence.
-
FIG. 12 is a flowchart illustrating an example of an encoding process in which one semantic minimum unit is encoded so as to generate a unit code. First, theencoding unit 305 refers to a semantic analysis result included in theanalysis result 312, determines the pattern type of the semantic minimum unit (step 1201), and determines the arc type (step 1202). - Next, the
encoding unit 305 obtains the orders of nodes included in the semantic minimum unit (step 1203). In the case ofpattern 1, theencoding unit 305 obtains the orders ofnode 1 andnode 2, and in the case ofpatterns encoding unit 305 obtains the order ofnode 1. - Next, the
encoding unit 305 obtains intra-document semantic symbol IDs of the nodes (step 1204). In the case ofpattern 1, theencoding unit 305 obtains the intra-document semantic symbol IDs ofnodes patterns encoding unit 305 obtains the intra-document semantic symbol ID ofnode 1. - For this obtainment, the
encoding unit 305 refers to the morpheme analysis result included in theanalysis result 312 so as to obtain the word, the attribute information and the semantic symbol corresponding to a node included in the semantic minimum unit. Next, theencoding unit 305 refers to the semantic symbol information 313 (word code dictionary) so as to obtain the dictionary ID corresponding to the combination of the word, the attribute information and the semantic symbol. Then, theencoding unit 305 obtains the intra-document semantic symbol ID from themapping information 314 on the basis of the document ID and the dictionary ID. - When for example the intra-document semantic symbol ID corresponding to document ID “0” and dictionary ID “5023” is obtained by using the
mapping information 314 illustrated inFIG. 9 , theencoding unit 305 refers to the position of offset “0” of map.idx so as to obtain content “0”. Next, theencoding unit 305 refers to the position of offset “0” of map.dic so as to detect that the content in that position is identical to dictionary ID “5023”. In this case, because dictionary ID “5023” has been detected without shifting the referred-to position of i map.dic, theencoding unit 305 determines the intra-document semantic symbol ID to be “0”. - Also, when the intra-document semantic symbol ID corresponding to document ID “2” and dictionary ID “35” is to be obtained, the
encoding unit 305 refers to the position of offset “2” of map.idx so as to obtain content “5”. Next, theencoding unit 305 refers to the position of offset “5” of map.dic and shifts the referred-to position rightwardly from that position one-by-one so as to detect that the content in the position of offset “8” is identical to dictionary ID “35”. In such a case, because dictionary ID “35” has been detected just by shifting the referred-to position of map.dic by three, theencoding unit 305 determines the intra-document semantic symbol ID to be “3”. -
FIG. 13 illustrates an example of the document semanticstructure position information 121 generated by an encoding process illustrated inFIG. 12 .Unit codes 1301 through 1303 illustrated inFIG. 13 correspond to the unit codes illustrated inFIG. 11A . - For example, pattern type “0” of the
unit code 1301 indicates that it is the semantic minimum unit ofpattern 1, order “5” ofnode 1 indicates that it is the fifth word counting from the beginning of the sentence, and order “8” ofnode 2 indicates that it is the eighth word counting from the beginning of the sentence. The intra-document semantic symbol ID ofnode 1 is “2”, and the intra-document semantic symbol ID ofnode 2 and the arc type are “29” and “21”, respectively. - These
unit codes 1301 through 1303 are grouped in the document semanticstructure position information 121 for each sentence ID of a sentence to which the unit codes of them belong. - As described above, encoding a word included in a morpheme analysis result and a semantic symbol included in a semantic analysis result of the
document 311 as a series of encoding, it is possible to include correspondence relationships between words and semantic symbols to the document semanticstructure position information 121 effectively. This makes it possible to directly access words in the original sentence from semantic symbols in the document semanticstructure position information 121 that is in a compressed state. - Next, by referring to
FIG. 14 throughFIG. 19 , explanations will be given for a semantic structure search that searches for a semantic structure of thedocument 311. -
FIG. 14 is a flowchart illustrating a specific example of a semantic structure search process performed by the semanticstructure search device 101 illustrated inFIG. 3 . First, theanalysis unit 301 conducts a morpheme analysis on a search request described in a form of a natural sentence so as to generate a morpheme analysis result (step 1401), and also conducts a semantic analysis on the search request so as to generate a semantic analysis result (step 1402). Then, theanalysis unit 301 stores in thestorage unit 111 the morpheme analysis result and the semantic analysis result conducted on the search request. - Next, the
search unit 112 generates a search key including a plurality of search semantic symbols expressing a semantic structure of the search request from the result of the semantic structure analysis conducted on the search request, and stores the search key in the storage unit 111 (step 1403). - Next, the
search unit 112 refers to thesemantic symbol information 313, and specifies combinations, corresponding to a plurality of search semantic symbols included in the search key, of search words, attribute information thereof and semantic symbols thereof (step 1404). Then, thesearch unit 112 refers to thebit map information 315 so as to specify at least one search target document including the specified search words (step 1405). - Next, the
search unit 112 refers to thesemantic symbol information 313 and themapping information 314 so as to encode the search key, and generates search code information (step 1406). - Next, the
search unit 112 searches for a unit code that is identical to the search code information from the document semanticstructure position information 121 of the search target document (step 1407). Then, on the basis of the order of the node included in a detected unit code, the position of the search word in the search target document is specified (step 1408). The processes insteps 1407 and 1408 are executed for each search target document. - Next, the
output unit 113 outputs a search result (step 1409) that represents the search target document in which a unit code identical to the search code information has been detected, each search word used for the search and the position of each search word in the search target document (step 1409). - When a search request of “TARO WA HANAKO NI HON O A GETA (Taro gave Hanako a book)” has been input, the search un it 112 can generate search keys that represent the following semantic minimum units.
- (GIVE, HANAKO, OBJECT)
- (GIVE, TARO, SUBJECT)
- (GIVE, BOOK, TARGET)
- “GIVE” in the search key of (GIVE, HANAKO, OBJECT) represents the semantic symbol of the source node corresponding to “AGE”, “HANAKO” in the search key of (GIVE, HANAKO, OBJECT) represents the semantic symbol of the termination node corresponding to “HANAKO”, and “OBJECT” represents the arc.
- When the
document 311 is to be searched by using this search key, thesearch unit 112 obtains, from thesemantic symbol information 313, dictionary IDs and entries respectively corresponding to “GIVE” and “HANAKO”, which are included in (GIVE, HANAKO, OBJECT). Thereby, six types of pieces of information as below for example are obtained. - Dictionary ID:1088:AGE+VERB+GIVE
- Dictionary ID:2183:ATAE+VERB+GIVE
- Dictionary ID:4021:ZOUYO+INFLECTED FORM OF NOUN+GIVE
- Dictionary ID:5911:KIFU+INFLECTED FORM OF NOUN+GIVE
- Dictionary ID:9827:TEWATASHI+VERB+GIVE
- Dictionary ID:200291:HANAKO+NOUN+HANAKO
- Then, the
search unit 112 combines the dictionary ID corresponding to “GIVE” and the dictionary ID corresponding to “HANAKO” so as to generate search formulas as below. - (1088 AND 200291)OR
- (2183 AND 200291)OR
- (4021 AND 200291)OR
- (5911 AND 200291)OR
- (9827 AND 200291)
- Next, the
search unit 112 refers to thebit map information 315 illustrated inFIG. 10 so as to obtain a set of document IDs that satisfy the search formulas. For example, a document ID satisfying (1088 AND 200291) can be obtained by calculating the logical product of the row of “1088” and the row of “200291” in thebit map information 315 as illustrated inFIG. 15A . In such a case, the document ID of thedocument 311 including “AGE” and “HANAKO” is “3”. - The document ID satisfying (2183 AND 200291) can be obtained by calculating the logical product of the rows of “2183” and “200291” in the
bit map information 315 as illustrated inFIG. 15B . In such a case, the document IDs of thedocuments 311 including “ATAE” and “HANAKO” are “24” and “522”, respectively. - The document ID satisfying (4021 AND 200291) can be obtained by calculating the local product of the rows of “4021” and “200291” in the
bit map information 315 as illustrated inFIG. 15C . In such a case, the document ID thedocument 311 including “ZOUYO” and “HANAKO” is “9283”. - The document ID satisfying (5911 AND 200291) can be obtained by calculating the local product of the rows of “5911” and “200291” in the
bit map information 315 as illustrated inFIG. 15D . In such a case, thedocument 311 including “KIFU” and “HANAKO” does not exist. - The document ID satisfying (9827 AND 200291) can be obtained by calculating the local product of the rows of “9827” and “200291” in the
bit map information 315 as illustrated inFIG. 15E . In such a case, the document ID of thedocument 311 including “TEWATASHI” and “HANAKO” is “82”. -
FIG. 16 illustrates the document IDs obtained through the logical products illustrated inFIG. 15A throughFIG. 15E . As a result, the set of the document IDs satisfying the search formulas is (3, 24, 522, 9283, 82), which means that the number of the search target documents has been narrowed to five. - Next, similarly to step 1204 illustrated in
FIG. 12 , thesearch unit 112 replaces the dictionary IDs of the search formulas with intra-document semantic symbol IDs corresponding to the document IDs of the search target documents so as to generate search code information for each search target document. In such a case, when search target documents are different, a piece of search code information is replaced with a different piece of search code information even if the same search formulas are used. - When for example the intra-document semantic symbol ID corresponding to document ID “3” and dictionary ID “1088” by using the
mapping information 314 illustrated inFIG. 9 is to be obtained, thesearch unit 112 refers to the position of offset “3” in map.idx so as to obtain content “11”. Next, thesearch unit 112 refers to the position of offset “11” in map.dic and shifts the referred-to position rightwardly from that position one by one so as to detect that the content in the position of offset “15” is identical to dictionary ID “1088”. In such a case, because dictionary ID “1088” has been detected just by shifting the referred-to position by four in map.dic, thesearch unit 112 determines the intra-document semantic symbol ID to be “4”. - When the intra-document semantic symbol ID corresponding to document ID “3” and dictionary ID “200291” is to be obtained, the
search unit 112 refers to the position of offset “3” in map.idx so as to obtain content “11”. Next, thesearch unit 112 refers to the position of offset “11” in map.dic and shifts the referred-to position rightwardly from that position one by one so as to detect that the content in the position of offset “22” is identical to dictionary ID “200291”. In such a case, because dictionary ID “200291” has been detected just by shifting the referred-to position by eleven in map.dic, thesearch unit 112 determines the intra-document semantic symbol ID to be “11”. - Also, when the intra-document semantic symbol ID corresponding to document ID “24” and dictionary ID “2183” is to be obtained, the
search unit 112 refers to the position of offset “24” in map.idx so as to obtain content “1690”. Next, thesearch unit 112 refers to the position of offset “1690” in map.dic and shifts the referred-to position rightwardly from that position one by one so as to detect that the content in the position of offset “1694” is identical to dictionary ID “2183”. In such a case, because dictionary ID “2183” has been detected just by shifting the referred-to position by four in map.dic, thesearch unit 112 determines the intra-document semantic symbol ID to be “4”. - Also, when the intra-document semantic symbol ID corresponding to document ID “24” and dictionary ID “200291” is to be obtained, the
search unit 112 refers to the position of offset “24” in map.idx so as to obtain content “1690”. Next, thesearch unit 112 refers to the position of offset “1690” in map.dic and shifts the referred-to position rightwardly from that position one by one so as to detect that the content in the position of offset “1705” is identical to dictionary ID “200291”. In such a case, because dictionary ID “200291” has been detected just by shifting the referred-to position by fifteen in map.dic, thesearch unit 112 determines the intra-document semantic symbol ID to be “15”. - In the above manner, search code information as below is generated from the search formulas.
- (1088 AND 200291)
- →search code information:(4,11) for document ID “3”
- (2183 AND 200291)
- →search code information:(4,15) for document ID “24”
- →search code information:(5,14) for document ID “522”
- (4021 AND 200291)
- →search code information:(4,10) for document ID “9283”
- (9827 AND 200291)
- →search code information:(23,106) for document ID “82”
- When for example
FIG. 13 illustrates the document semanticstructure position information 121 with document ID “3”, thesearch unit 112 uses search code information (4, 11) for document ID “3” so as to search the document semanticstructure position information 121 illustrated inFIG. 13 . Then, thesearch unit 112 detects theunit code 1303 including intra-document semantic symbol ID “4” ofnode 1 and intra-document semantic symbol ID “11” ofnode 2 so as to obtain order “3” ofnode 1 and order “1” ofnode 2. - Thereby, it is learned that search word “AGE” corresponding to intra-document semantic symbol ID “4” is the third “AGE” counting from the beginning of the sentence, and search word “HANAKO” corresponding to intra-document semantic symbol ID “11” is the first “HANAKO” counting from the beginning of the sentence.
- Then, the
search unit 112 refers to the morpheme analysis result (which corresponds to themorpheme analysis result 511 illustrated inFIG. 5 ) included in theanalysis result 312 so as to search the sentence with the sentence ID corresponding to theunit code 1303 for word “AGE”, which corresponds to “AGE+VERB+GIVE”. Then, thesearch unit 112 specifies the position, in the search target document, (which corresponds to the word position 524 illustrated inFIG. 5 ) of the third “AGE” counting from the beginning of the sentence among detected words “AGE” as the position of search word “AGE” in the search target document. - Also, the
search unit 112 refers to the morpheme analysis result included in theanalysis result 312 so as to search for word “HANAKO”, which corresponds to “HANAKO+NOUN+HANAKO” in the sentence with the sentence ID that corresponds to theunit code 1303. Then, thesearch unit 112 specifies the position of the first “HANAKO” counting from the beginning of the sentence among detected words “HANAKO” as the position of search word “HANAKO” in the search target document. - When the
output unit 113 outputs the search result, theoutput unit 113 may for example conduct emphasized display for the texts of “AGE” and “HANAKO” existing in the specified positions in the search target document. - For N semantic symbols, the total number of combinations of two semantic symbols included in a semantic minimum unit is N*N, and accordingly the calculation amount and the data amount of the search results in a semantic structure search using a conventional semantic minimum unit is in the order of N*N. When for example N=50, the total number of the combinations is 50*50=2500, while when N=5,000,000, the total number of the combinations is 5,000,000 5,000,000=25,000,000,000,000.
- In contrast to this, according to the semantic structure search process illustrated in
FIG. 14 , the semantic minimum unit of a search target document and the semantic minimum unit of a search request are encoded together, making it enough to just compare the encoded semantic minimum units, and accordingly the calculation amount and the data amount of search results are in the order of N. - Also, in a conventional semantic structure search, vast amount of data that represents correspondence relationships between morpheme analysis results and original sentences and vast amount of data that represents correspondence relationships between semantic structure analysis results and morpheme analysis results are used for performing a search, leading to use of a large capacity of memory.
- In contrast to this, the total number of words registered in a word code dictionary used as the
semantic symbol information 313 is about several million and the total number of semantic symbols registered in the document semanticstructure position information 121 is about several hundred. Accordingly, the amount of data of the document semanticstructure position information 121 is reduced by about four digits from the data amount of thesemantic symbol information 313, and thereby the reduction by sixteen digits (16=4*4) is expected for combinations of two semantic symbols. - When a search is conducted for text data compressed by using the conventional LZ77 coding, all compressed data is expanded once and thereafter the search is conducted for the expanded data, which leads to reduced processing speeds. In contrast to this, in the semantic structure search process illustrated in
FIG. 14 , encoded semantic minimum units are compared without expanding thedocument 311 in a compressed state, which leads to accelerated processing speeds. -
FIG. 17 is a flowchart illustrating the semantic structure search process illustrated inFIG. 14 to which a process of calculating the score of a search target document has been added. The processes instep 1701 through 1708 illustrated inFIG. 17 are similar to thoseinsteps 1401 through 1408 illustrated inFIG. 14 . - In
step 1709, thesearch unit 112 calculates the score of a search target document by using the score of a search key. Then, instep 1710, theoutput unit 113 ranks search target documents in accordance with the scores and outputs the search results. -
FIG. 18 illustrates a first example of scores of search formulas that are used as scores of search keys. In this example, for each search formula, a score thereof is calculated in advance. For calculations of score S of a search formula, the following equation for example can be used. -
S=idf1*N1+idf2*N2 (1) - idf1 represents the inverse document frequency of a search word that corresponds to the first dictionary ID included in the search formula, and idf2 represents the inverse document frequency of a search word that corresponds to the second dictionary ID included in the search formula. N1 represents the number of times that the search word corresponding to the first dictionary ID appears in the search request, and N2 represents the number of times that the search word corresponding to the second dictionary ID appears in the search request.
-
FIG. 19 illustrates a second example of scores of search formulas that are used as scores of search keys. In this example, for calculations of score S of a search formula, the following equation for example is used. -
S=idf11*N11+idf12*N12 (2) - idf11 represents the inverse document frequency of a semantic symbol that corresponds to the first dictionary ID included in the search formula, and idf12 represents the inverse document frequency of a semantic symbol that corresponds to the second dictionary ID included in the search formula. N11 represents the number of times that the semantic symbol corresponding to the first dictionary ID appears in the search request, and N12 represents the number of times that the semantic symbol corresponding to the second dictionary ID appears in the search request.
- All of the search formulas illustrated in
FIG. 19 are generated from the same search key (GIVE, HANAKO, OBJECT). Accordingly, when scores S of the search formulas are calculated by using the inverse document frequency of the semantic symbol and the number of times that it appears, the scores S of all the search formulas will have the same value. - Score DS of the search target document is calculated by for example the following equation, which uses score S of a search formula.
-
DS=Σ(S*P) (3) - S at the right-hand side in equation (3) represents the score of a search formula identical to the document semantic
structure position information 121 in the search target document, P represents the number of times that the search formula is turned to be identical, summation symbol Σ represents the summation of the value of S*P for a plurality of search formulas. By ranking search target documents in the descending order of score DS so as to output the documents, it is possible to present search target documents in the order of importance as search results. - The configuration of the semantic
structure search device 101 inFIG. 1 orFIG. 3 is just an example, and the constituents can partially be omitted or changed in accordance with the purposes or conditions of the semanticstructure search device 101. For example, when theanalysis result 312, thesemantic symbol information 313, themapping information 314 and thebit map information 315 are generated by using an external device, theanalysis unit 301, thegeneration unit 302, thegeneration unit 303 and thegeneration unit 304 illustrated inFIG. 3 can be omitted. Also, when the document semanticstructure position information 121 is generated by using an external device, theencoding unit 305 illustrated inFIG. 3 can be omitted. - The flowcharts illustrated in
FIG. 2 ,FIG. 4 ,FIG. 12 ,FIG. 14 andFIG. 17 are just examples and the processes can partially be omitted or changed in accordance with the configurations or conditions of the semanticstructure search device 101. For example, when theanalysis result 312, thesemantic symbol information 313, themapping information 314, thebit map information 315 and the document semanticstructure position information 121 are generated by using an external device, the compression process illustrated inFIG. 4 can be omitted. - When search target documents are not narrowed in the semantic structure search process illustrated in
FIG. 14 orFIG. 17 , the processes instep 1405 andstep 1705 can be omitted. In the semantic structure search processes illustrated inFIG. 14 andFIG. 17 , a search key including three or more semantic symbols can be used instead of a search key including two semantic symbols. - The
analysis result 312 illustrated inFIG. 5 , thesemantic symbol information 313 illustrated inFIG. 6 andFIG. 7 , themapping information 314 illustrated inFIG. 9 and thebit map information 315 illustrated inFIG. 10 are just examples, and information in a different data configuration can be used in accordance with the configuration or conditions of the semanticstructure search device 101. Also, the semantic symbols illustrated inFIG. 11 and the document semanticstructure position information 121 illustrated inFIG. 13 are just examples, and information in a different data configuration can be used in accordance with the configurations or conditions of the semanticstructure search device 101. - For example, when it is not necessary to distinguish a plurality of same word having different meanings, the attribute information and semantic symbols can be omitted in the
morpheme analysis result 511 illustrated inFIG. 5 and thebit map information 315 illustrated inFIG. 10 . In such a case, the attribute information can be omitted in thesemantic symbol information 313 illustrated inFIG. 6 andFIG. 7 . - Equations (1) through (3) are just examples, and scores of search target documents may be calculated by using a different equation.
- The semantic
structure search device 101 illustrated inFIG. 1 andFIG. 3 can be implemented by using for example an information processing apparatus (computer) as illustrated inFIG. 20 . - The information processing apparatus illustrated in
FIG. 20 includes a central processing unit (CPU) 2001, amemory 2002, aninput device 2003, anoutput device 2004, anauxiliary storage device 2005, amedium driving device 2006 and anetwork connection device 2007. These constituents are connected to each other via abus 2008. - The
memory 2002 is for example a semiconductor memory such as a Read Only Memory (ROM), a Random Access Memory (RAM), a flash memory, etc., and stores a program and data used for the processing. Thememory 2002 can be used as thestorage unit 111 illustrated inFIG. 1 andFIG. 3 . - The CPU 2001 (processor) executes a program by utilizing for example the
memory 2002, and thereby operates as thesearch unit 112, theanalysis unit 301, thegeneration unit 302, thegeneration unit 303, thegeneration unit 304 and theencoding unit 305 illustrated inFIG. 1 andFIG. 3 . - The
input device 2003 is for example a keyboard, a pointing device, etc. and is used for inputting instructions or information from the operator or the user. Instructions from the operator or the user may be a search request of a semantic structure search. - The
output device 2004 is for example a display device, a printer, a speaker, etc., and is used for outputting queries or instructions for the operator or the user and for outputting process results. Theoutput device 2004 can be used as theoutput unit 113 illustrated inFIG. 1 andFIG. 3 . Process results can be search results of a semantic structure search. - The
auxiliary storage device 2005 is for example a magnetic disk device, an optical disk device, a magneto-optical disk device, a tape device, etc. Theauxiliary storage device 2005 may be a hard disk drive or a flash memory. The information processing apparatus can store a program and data in theauxiliary storage device 2005 and load them onto thememory 2002 so as to use them. Theauxiliary storage device 2005 can be used as thestorage unit 111 illustrated inFIG. 1 andFIG. 3 . - The
medium driving device 2006 drives aportable recording medium 2009 so as to access information recorded in it. Theportable recording medium 2009 is for example a memory device, a flexible disk, an optical disk, a magneto-optical disk, etc. Theportable recording medium 2009 may be a Compact Disk Read Only Memory (CD-ROM), a Digital Versatile Disk (DVD), Universal Serial Bus (USB), etc. The operator or the user may store a program or data in theportable recording medium 2009 and load them onto thememory 2002 so as to use them. - As described above, a computer-readable recording medium that stores a program and data used for the processes is a physical (non-transitory) recording medium such as the
memory 2002, theauxiliary storage device 2005 and theportable recording medium 2009. - The
network connection device 2007 is a communication interface that is connected to a communication network such as a Local Area Network, a Wide Area Network, etc. so as to conduct data conversion accompanying communications. The information processing apparatus can receive a program and data from an external device via thenetwork connection device 2007 and load them onto thememory 2002 to use them. - The information processing apparatus can receive a search request from a user terminal via the
network connection device 2007 so as to send a search result to the user terminal. In such a case, thenetwork connection device 2007 can be used as theoutput unit 113 illustrated inFIG. 1 andFIG. 3 . - Note that it is not necessary for the information processing apparatus to include all the constituents illustrated in
FIG. 20 , and the constituents can partially be omitted in accordance with the purposes or conditions. For example, when the information processing apparatus receives a search request from the user terminal via a communication network, theinput device 2003 and theoutput device 2004 can be omitted. Also, when theportable recording medium 2009 or a communication network is not used, themedium driving device 2006 or thenetwork connection device 2007 can be omitted. - When the information processing apparatus is a mobile terminal having the telephone call function such as a smartphone, it can include a device for telephone calls such as a microphone or a speaker, and can also include an imaging device such as a camera.
- All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Claims (12)
1. A non-transitory computer-readable recording medium having stored therein a semantic structure search program that causes a computer to execute a process comprising:
generating a plurality of search semantic symbols from a search request;
specifying a position of a specific word that corresponds to the search request in a search target document, by the plurality of search semantic symbols and document semantic structure position information, the document semantic structure position information including a relationship information between a plurality of semantic symbols and a plurality of positions of a plurality of words in the search target document, the plurality of semantic symbols representing a semantic structure corresponding to the plurality of words; and
outputting a search result including the specific word and the position of the specific word in the search target document.
2. The non-transitory computer-readable recording medium according to claim 1 , wherein
the document semantic structure position information represents a correspondence relationship between code information that is generated for the search target document and that represents the plurality of semantic symbols and the plurality of positions of the plurality of words in the search target document, and
the specifying the position of the specific word in the search target document generates search code information for searching the search target document from the plurality of search semantic symbols by referring to semantic symbol information representing a correspondence relationship between a word and a semantic symbol and to mapping information representing a correspondence relationship between a word and code information, obtains a position corresponding to code information that is identical to the search code information from the document semantic structure position information, and specifies the obtained position as the position of the specific word in the search target document.
3. The non-transitory computer-readable recording medium according to claim 2 , wherein
the process further comprises:
specifying a plurality of search words corresponding to the plurality of search semantic symbols by referring to the semantic symbol information; and
specifying a document including the plurality of search words by referring to bit map information that represents presence or absence of each of a plurality of words in each of a plurality of documents, and
the specifying the position of the specific word in the search target document uses the specified document as the search target document.
4. The non-transitory computer-readable recording medium according to claim 3 , wherein
the semantic symbol information includes attribute information added to a word,
the bit map information represents presence or absence of each of the plurality of words in each of the plurality of documents for each combination of each of the plurality of words, attribute information and a semantic symbol,
the specifying the plurality of search words specifies a plurality of combinations of the plurality of search words corresponding to the plurality of search semantic symbols, attribute information of the plurality of search words and a plurality of semantic symbols of the plurality of search words by referring to the semantic symbol information, and
the specifying the document including the plurality of search words specifies a document corresponding to the specified combinations by referring to the bit map information.
5. A semantic structure search device comprising:
a memory configured to store document semantic structure position information including a relationship information between a plurality of semantic symbols and a plurality of positions of a plurality of words in a search target document, the plurality of semantic symbols representing a semantic structure corresponding to the plurality of words;
a processor configured to generate a plurality of search semantic symbols from a search request, and to specify a position of a specific word that corresponds to the search request in the search target document, by the plurality of search semantic symbols and the document semantic structure position information; and
an output interface configured to output a search result including the specific word and the position of the specific word in the search target document.
6. The semantic structure search device according to claim 5 , wherein
the document semantic structure position information represents a correspondence relationship between code information that is generated for the search target document and that represents the plurality of semantic symbols and the plurality of positions of the plurality of words in the search target document,
the memory further stores semantic symbol information representing a correspondence relationship between a word and a semantic symbol and mapping information representing a correspondence relationship between a word and code information, and
the processor generates search code information for searching the search target document from the plurality of search semantic symbols by referring to the semantic symbol information and the mapping information, obtains a position corresponding to code information that is identical to the search code information from the document semantic structure position information, and specifies the obtained position as the position of the specific word in the search target document.
7. The semantic structure search device according to claim 6 , wherein
the memory further stores bit map information that represents presence or absence of each of a plurality of words in each of a plurality of documents, and
the processor specifies a plurality of search words corresponding to the plurality of search semantic symbols by referring to the semantic symbol information, specifies a document including the plurality of search words by referring to the bit map information, and uses the specified document as the search target document.
8. The semantic structure search device according to claim 7 , wherein
the semantic symbol information includes attribute information added to a word,
the bit map information represents presence or absence of each of the plurality of words in each of the plurality of documents for each combination of each of the plurality of words, attribute information and a semantic symbol, and
the processor specifies a plurality of combinations of the plurality of search words corresponding to the plurality of search semantic symbols, attribute information of the plurality of search words and a plurality of semantic symbols of the plurality of search words by referring to the semantic symbol information, specifies a document corresponding to the specified combinations by referring to the bit map information, and uses the specified document as the search target document.
9. A semantic structure search method comprising:
generating a plurality of search semantic symbols from a search request by a processor;
specifying, by the processor, a position of a specific word that corresponds to the search request in a search target document, by the plurality of search semantic symbols and document semantic structure position information, the document semantic structure position information including a relationship information between a plurality of semantic symbols and a plurality of positions of a plurality of words in the search target document, the plurality of semantic symbols representing a semantic structure corresponding to the plurality of words; and
outputting a search result including the specific word and the position of the specific word in the search target document.
10. The semantic structure search method according to claim 9 , wherein
the document semantic structure position information represents a correspondence relationship between code information that is generated for the search target document and that represents the plurality of semantic symbols and the plurality of positions of the plurality of words in the search target document, and
the specifying the position of the specific word in the search target document generates search code information for searching the search target document from the plurality of search semantic symbols by referring to semantic symbol information representing a correspondence relationship between a word and a semantic symbol and to mapping information representing a correspondence relationship between a word and code information, obtains a position corresponding to code information that is identical to the search code information from the document semantic structure position information, and specifies the obtained position as the position of the specific word in the search target document.
11. The semantic structure search method according to claim 10 , wherein
the method further comprises:
specifying a plurality of search words corresponding to the plurality of search semantic symbols by referring to the semantic symbol information; and
specifying a document including the plurality of search words by referring to bit map information that represents presence or absence of each of a plurality of words in each of the plurality of documents, and
the specifying the position of the specific word in the search target document uses the specified document as the search target document.
12. The semantic structure search method according to claim 11 , wherein
the semantic symbol information includes attribute information added to a word,
the bit map information represents presence or absence of each of the plurality of words in each of the plurality of documents for each combination of each of the plurality of words, attribute information and a semantic symbol,
the specifying the plurality of search words specifies a plurality of combinations of the plurality of search words corresponding to the plurality of search semantic symbols, attribute information of the plurality of search words and a plurality of semantic symbols of the plurality of search words by referring to the semantic symbol information, and
the specifying the document including the plurality of search words specifies a document corresponding to the specified combination by referring to the bit map information.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/785,656 US11334609B2 (en) | 2015-01-20 | 2020-02-10 | Semantic structure search device and semantic structure search method |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2015-008936 | 2015-01-20 | ||
JP2015008936A JP6447161B2 (en) | 2015-01-20 | 2015-01-20 | Semantic structure search program, semantic structure search apparatus, and semantic structure search method |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/785,656 Division US11334609B2 (en) | 2015-01-20 | 2020-02-10 | Semantic structure search device and semantic structure search method |
Publications (1)
Publication Number | Publication Date |
---|---|
US20160217207A1 true US20160217207A1 (en) | 2016-07-28 |
Family
ID=56434110
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/995,775 Abandoned US20160217207A1 (en) | 2015-01-20 | 2016-01-14 | Semantic structure search device and semantic structure search method |
US16/785,656 Active 2036-05-10 US11334609B2 (en) | 2015-01-20 | 2020-02-10 | Semantic structure search device and semantic structure search method |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/785,656 Active 2036-05-10 US11334609B2 (en) | 2015-01-20 | 2020-02-10 | Semantic structure search device and semantic structure search method |
Country Status (2)
Country | Link |
---|---|
US (2) | US20160217207A1 (en) |
JP (1) | JP6447161B2 (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9906238B2 (en) * | 2016-07-25 | 2018-02-27 | Fujitsu Limited | Encoding device, encoding method and search method |
US20180101580A1 (en) * | 2016-10-07 | 2018-04-12 | Fujitsu Limited | Non-transitory computer-readable recording medium, encoded data searching method, and encoded data searching apparatus |
US20180285443A1 (en) * | 2017-03-29 | 2018-10-04 | Fujitsu Limited | Non-transitory computer readable medium, encode device, and encode method |
US10678820B2 (en) | 2018-04-12 | 2020-06-09 | Abel BROWARNIK | System and method for computerized semantic indexing and searching |
US10747946B2 (en) * | 2015-07-24 | 2020-08-18 | Fujitsu Limited | Non-transitory computer-readable storage medium, encoding apparatus, and encoding method |
US10922343B2 (en) | 2016-10-21 | 2021-02-16 | Fujitsu Limited | Data search device, data search method, and recording medium |
US11487817B2 (en) | 2017-03-28 | 2022-11-01 | Fujitsu Limited | Index generation method, data retrieval method, apparatus of index generation |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6161084A (en) * | 1997-03-07 | 2000-12-12 | Microsoft Corporation | Information retrieval utilizing semantic representation of text by identifying hypernyms and indexing multiple tokenized semantic structures to a same passage of text |
US6289353B1 (en) * | 1997-09-24 | 2001-09-11 | Webmd Corporation | Intelligent query system for automatically indexing in a database and automatically categorizing users |
US20050131886A1 (en) * | 2000-06-22 | 2005-06-16 | Hapax Limited | Method and system for information extraction |
US20090063473A1 (en) * | 2007-08-31 | 2009-03-05 | Powerset, Inc. | Indexing role hierarchies for words in a search index |
US20120204104A1 (en) * | 2009-10-11 | 2012-08-09 | Patrick Sander Walsh | Method and system for document presentation and analysis |
US20130086086A1 (en) * | 2010-05-28 | 2013-04-04 | Fujitsu Limited | Information generating computer product, apparatus, and method; and information search computer product, apparatus, and method |
US20140114649A1 (en) * | 2006-10-10 | 2014-04-24 | Abbyy Infopoisk Llc | Method and system for semantic searching |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4057681B2 (en) | 1997-09-10 | 2008-03-05 | 富士通株式会社 | Document information storage device, document information storage method, document information search device, document information search method, recording medium on which document information storage program is recorded, and recording medium on which document information search program is recorded |
JP2008176489A (en) | 2007-01-17 | 2008-07-31 | Toshiba Corp | Text discrimination device and text discrimination method |
JP5062131B2 (en) | 2008-10-06 | 2012-10-31 | 富士通株式会社 | Information processing program, information processing apparatus, and information processing method |
US8880537B2 (en) * | 2009-10-19 | 2014-11-04 | Gil Fuchs | System and method for use of semantic understanding in storage, searching and providing of data or other content information |
JP5493779B2 (en) * | 2009-11-30 | 2014-05-14 | 富士ゼロックス株式会社 | Information search program and information search apparatus |
JP4967037B2 (en) * | 2010-02-08 | 2012-07-04 | 株式会社エヌ・ティ・ティ・データ | Information search device, information search method, terminal device, and program |
EP2677662B1 (en) | 2011-02-14 | 2019-02-20 | Fujitsu Limited | Huffman tree generation program, device, and method |
JP5915274B2 (en) * | 2012-03-09 | 2016-05-11 | 富士通株式会社 | Information search method, program, and information search apparatus |
JP6152711B2 (en) * | 2013-06-04 | 2017-06-28 | 富士通株式会社 | Information search apparatus and information search method |
-
2015
- 2015-01-20 JP JP2015008936A patent/JP6447161B2/en active Active
-
2016
- 2016-01-14 US US14/995,775 patent/US20160217207A1/en not_active Abandoned
-
2020
- 2020-02-10 US US16/785,656 patent/US11334609B2/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6161084A (en) * | 1997-03-07 | 2000-12-12 | Microsoft Corporation | Information retrieval utilizing semantic representation of text by identifying hypernyms and indexing multiple tokenized semantic structures to a same passage of text |
US6289353B1 (en) * | 1997-09-24 | 2001-09-11 | Webmd Corporation | Intelligent query system for automatically indexing in a database and automatically categorizing users |
US20050131886A1 (en) * | 2000-06-22 | 2005-06-16 | Hapax Limited | Method and system for information extraction |
US20140114649A1 (en) * | 2006-10-10 | 2014-04-24 | Abbyy Infopoisk Llc | Method and system for semantic searching |
US20090063473A1 (en) * | 2007-08-31 | 2009-03-05 | Powerset, Inc. | Indexing role hierarchies for words in a search index |
US20120204104A1 (en) * | 2009-10-11 | 2012-08-09 | Patrick Sander Walsh | Method and system for document presentation and analysis |
US20130086086A1 (en) * | 2010-05-28 | 2013-04-04 | Fujitsu Limited | Information generating computer product, apparatus, and method; and information search computer product, apparatus, and method |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10747946B2 (en) * | 2015-07-24 | 2020-08-18 | Fujitsu Limited | Non-transitory computer-readable storage medium, encoding apparatus, and encoding method |
US9906238B2 (en) * | 2016-07-25 | 2018-02-27 | Fujitsu Limited | Encoding device, encoding method and search method |
US20180101580A1 (en) * | 2016-10-07 | 2018-04-12 | Fujitsu Limited | Non-transitory computer-readable recording medium, encoded data searching method, and encoded data searching apparatus |
US10942934B2 (en) * | 2016-10-07 | 2021-03-09 | Fujitsu Limited | Non-transitory computer-readable recording medium, encoded data searching method, and encoded data searching apparatus |
US10922343B2 (en) | 2016-10-21 | 2021-02-16 | Fujitsu Limited | Data search device, data search method, and recording medium |
US11487817B2 (en) | 2017-03-28 | 2022-11-01 | Fujitsu Limited | Index generation method, data retrieval method, apparatus of index generation |
US20180285443A1 (en) * | 2017-03-29 | 2018-10-04 | Fujitsu Limited | Non-transitory computer readable medium, encode device, and encode method |
US11055328B2 (en) * | 2017-03-29 | 2021-07-06 | Fujitsu Limited | Non-transitory computer readable medium, encode device, and encode method |
US10678820B2 (en) | 2018-04-12 | 2020-06-09 | Abel BROWARNIK | System and method for computerized semantic indexing and searching |
Also Published As
Publication number | Publication date |
---|---|
US20200233887A1 (en) | 2020-07-23 |
US11334609B2 (en) | 2022-05-17 |
JP6447161B2 (en) | 2019-01-09 |
JP2016134037A (en) | 2016-07-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11334609B2 (en) | Semantic structure search device and semantic structure search method | |
US10467271B2 (en) | Search apparatus and search method | |
US10740562B2 (en) | Search apparatus, encoding method, and search method based on morpheme position in a target document | |
US10747946B2 (en) | Non-transitory computer-readable storage medium, encoding apparatus, and encoding method | |
US9916314B2 (en) | File extraction method, computer product, file extracting apparatus, and file extracting system | |
US9906238B2 (en) | Encoding device, encoding method and search method | |
CN113986950A (en) | SQL statement processing method, device, equipment and storage medium | |
US20210342534A1 (en) | Sentence structure vectorization device, sentence structure vectorization method, and storage medium storing sentence structure vectorization program | |
US20140358522A1 (en) | Information search apparatus and information search method | |
US11487817B2 (en) | Index generation method, data retrieval method, apparatus of index generation | |
US10915559B2 (en) | Data generation method, information processing device, and recording medium | |
US10803243B2 (en) | Method, device, and medium for restoring text using index which associates coded text and positions thereof in text data | |
JP6838471B2 (en) | Index generator, data search program, index generator, data search device, index generation method, and data search method | |
JP2019159743A (en) | Correspondence generation program, correspondence generation device, correspondence generation method, and translation program | |
US9871536B1 (en) | Encoding apparatus, encoding method and search method | |
Bharathi et al. | A plain-text incremental compression (pic) technique with fast lookup ability | |
Islam et al. | Short text compression for smart devices | |
KR100205956B1 (en) | Language code translation device and method | |
Lin et al. | Text Compression for Myanmar Information Retrieval | |
KR20220007170A (en) | Communication server device, communication device(s), and method of operation thereof | |
JP2012159875A (en) | Compound word generation device, compound word generation method and compound word generation program | |
KR20090104376A (en) | Inverted Index data generation method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: FUJITSU LIMITED, JAPAN Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:OKURA, SEIJI;KATAOKA, MASAHIRO;IDEUCHI, MASAO;SIGNING DATES FROM 20160104 TO 20160105;REEL/FRAME:037493/0158 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |