WO2002005131A1 - Dispositif de recherche - Google Patents

Dispositif de recherche Download PDF

Info

Publication number
WO2002005131A1
WO2002005131A1 PCT/JP2001/005796 JP0105796W WO0205131A1 WO 2002005131 A1 WO2002005131 A1 WO 2002005131A1 JP 0105796 W JP0105796 W JP 0105796W WO 0205131 A1 WO0205131 A1 WO 0205131A1
Authority
WO
WIPO (PCT)
Prior art keywords
character string
input
search
unit
concept
Prior art date
Application number
PCT/JP2001/005796
Other languages
English (en)
Japanese (ja)
Inventor
Takashi Miyake
Original Assignee
Iiga Co., Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Iiga Co., Ltd filed Critical Iiga Co., Ltd
Priority to AU2001269434A priority Critical patent/AU2001269434A1/en
Publication of WO2002005131A1 publication Critical patent/WO2002005131A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3338Query expansion

Definitions

  • the present invention provides, for example, a method that replaces words that can be
  • the present invention relates to a search device for searching for synonyms.
  • It relates to a cable device.
  • the present invention relates to a concept relationship extracting device that analyzes a relationship between a concept represented by a character string and a concept represented by another character string.
  • a user When searching data distributed on a network such as the Internet or data stored in a database, a user inputs a keyword to search for related data. If the keyword does not allow the user to search for the desired data, the user often searches using other keywords that are similar to the keyword. It took a lot of time, a lot of thinking, and a lot of experience to think about the other keywords that the user is akin to.
  • a synonym dictionary is used to automatically provide synonyms.
  • words to be synonymous are registered in advance.
  • This synonym dictionary contains synonyms selected by the creator of the synonym dictionary, but there may be cases where not enough synonyms are registered. Another problem is that registering synonyms requires a great deal of effort. is there. Also, the synonym creator may not be familiar with all fields, and synonyms for specialized fields, special fields, and private fields may not be registered in the synonym dictionary.
  • a synonym dictionary is created using a probabilistic model that uses the co-occurrence and appearance probability of two words, the probability of the relationship between two words, etc.
  • a probabilistic model that uses the co-occurrence and appearance probability of two words, the probability of the relationship between two words, etc.
  • An object of the present invention is to provide, for example, a search device for searching for a word (character string) similar to an input word (character string).
  • Another object of the present invention is to perform automatic processing for associating character strings based on, for example, the possibility of replacement and the possibility of combination.
  • a search device is a search device that searches for a character string from a dictionary in which the character string is registered as a registered character string.
  • An input unit for inputting a character string as an input character string, and a search unit for searching a dictionary for a replaceable character string that can be replaced with the input character string input by the input unit and outputting the character string.
  • the search unit It is characterized by having.
  • the search unit
  • a primary search unit that searches the dictionary as a primary character string for a registered character string partially contained in the input character string using the input character string input by the input unit, and an input from the primary character string searched by the primary search unit
  • a secondary search unit that searches the dictionary as a secondary character string for a registered character string that partially includes the connectable character string, with the remainder excluding the character string as a connectable character string;
  • An output unit that outputs the remainder of the secondary character string retrieved by the secondary search unit excluding the connectable character string as a replaceable character string that can be replaced with the input character string;
  • the primary search unit It is characterized by having.
  • the primary search unit
  • a forward match search for searching for a registered character string in which the input character string matches the beginning of the registered character string; a backward match search for searching for a registered character string in which the input character string matches the end of the registered character string;
  • a primary character string is searched by at least one of an intermediate match search for a registered character string that matches the middle of the input character string and the registered character string,
  • the secondary search unit calculates the secondary search unit
  • the primary search unit searches for a primary character string using a prefix match search, it searches for a registered character string that matches the connectable character string and the end of the registered character string. Search and If the primary search unit searches for a primary character string by using a backward match search, a secondary character string is searched for by a forward match search that searches for a registered character string that matches the connectable character string and the end of the registered character string. And
  • the primary search unit searches for a primary character string by an intermediate match search
  • a search for a registered character string that matches both the connectable character string and both the front and back of the registered character string will result in a secondary character search using a double match search Characterized by searching columns
  • the input unit sequentially inputs the registered character strings registered in the dictionary as input character strings
  • the search unit searches the dictionary for a replaceable character string of the registered character string registered in the dictionary and outputs it.
  • the search apparatus further includes an association unit that stores information indicating that the replaceable character string output by the search unit is a replaceable character string with a registered character string in a dictionary.
  • the search device further includes a loop unit that searches the replaceable character string for a replaceable character string by inputting the replaceable character string output by the search unit to the input unit as an input character string. Characterized by having
  • the secondary search unit has a counting unit that counts the number of appearances of the same secondary character string when the same secondary character string appears.
  • the output unit includes the number of occurrences of the replaceable character string.
  • a sorting unit that outputs replaceable character strings in descending order of.
  • the search device inputs a search keyword for searching at least one of Internet data and database data as an input character string, and outputs a keyword similar to the search keyword as a replaceable character string. It is characterized by the following.
  • a search device according to the present invention includes: an input character string set that holds a set of input character strings;
  • a sampling string set that holds a set of sampling strings for searching for a character string that can be replaced with the input string
  • a replaceable character string list set that holds a set of replaceable character strings searched and output for each character string in the set of character strings stored in the input character string set;
  • a replaceable character string search unit that searches for a replaceable character string that can be replaced with a certain input character string from the set of sampled character strings held in the sampled character string set and outputs it as a replaceable character string list;
  • a character string is sequentially input from the set of character strings held in the input character string set, and the input character string is sequentially given as an input character string to the replaceable character string search unit.
  • a concept relation extracting device includes: a plurality of character string input units for inputting a plurality of character strings;
  • a sentence example search unit that searches a sentence example database for a sentence example in which a plurality of character strings input by the multiple character string input unit are combined;
  • the A concept relationship analysis unit that analyzes a relationship between concepts represented by character strings of numbers
  • the concept relation analysis unit considers the character string as a single concept when a certain character string is not combined with any sentence example, or when a certain character string is combined with one sentence example, and When the sentence example in which the character string is combined is not combined with another character string, the character string is regarded as a single concept.
  • the conceptual relationship analysis unit determines that there is an inclusive relation (vertical relation) between the plurality of character strings combined with the certain sentence example.
  • the concept relation analysis unit considers the number of sentence examples in which the character strings are combined as the sentence example hit count, and regards a character string with a small sentence example hit number as a character string of a concept higher than a character string with a large sentence example hit number. It is characterized by the following.
  • the concept relation analysis unit expresses the relation of concepts in a hierarchical structure,
  • the concept relation extracting device further checks whether there is a sentence example commonly connected to the character string having the same hierarchy from the relations of the concepts analyzed by the concept relation analysis unit, and if there is a sentence example commonly connected, a common
  • the user is required to input a character string that can be combined with the sentence example that is linked to, and the character string input by the user is modified as a concept that is shared by the sentence example that is commonly connected, and the hierarchical structure that expresses the relationship between the concepts is modified.
  • a search method according to the present invention is a search method for searching a character string from a dictionary in which the character string is registered as a registered character string,
  • a search program according to the present invention is a search program for searching for a character string from a dictionary in which the character string is registered as a registered character string.
  • the concept relation extracting method includes: a plurality of character string input steps for inputting a plurality of character strings;
  • a concept relationship extraction program includes: a plurality of character strings input process for inputting a plurality of character strings; A sentence example search process that searches the sentence example database for a sentence example in which a plurality of character strings input by the multiple string input unit are combined, and a multiple sentence input by the multiple string input unit using the sentence example searched by the sentence example search unit
  • Concept relationship analysis processing for analyzing the relationship between the concepts represented by the character strings of
  • FIG. 1 is a configuration diagram of a personal computer 51 on which a search device according to Embodiment 1 of the present invention operates.
  • FIG. 2 is a configuration diagram of the search program 65 according to the first embodiment of the present invention.
  • FIG. 3 is an operation diagram of the search program 65 according to the first embodiment of the present invention.
  • FIG. 4 is a diagram showing a specific example of Embodiment 1 of the present invention.
  • FIG. 5 is a diagram showing an example of registered words in the dictionary 63 according to the first embodiment of the present invention.
  • FIG. 6 is a diagram showing a chain of synonyms according to the first embodiment of the present invention.
  • FIG. 7 is a cooperative operation diagram according to the second embodiment of the present invention.
  • FIG. 8 is a diagram showing a specific example of Embodiment 2 of the present invention.
  • FIG. 9 is a diagram showing a relationship example of the concept of Embodiment 3 of the present invention.
  • FIG. 10 is an operation diagram of a conceptual relationship extraction program 91 according to the third embodiment of the present invention.
  • FIG. 11 is a diagram showing a specific example of a sentence example database 64 according to the third embodiment of the present invention.
  • FIG. 12 is a diagram showing a correspondence table output by the sentence example search unit 95 according to the third embodiment of the present invention.
  • FIG. 13 is a diagram showing a conceptual relationship based on a hierarchical structure created by a conceptual relationship analysis unit 97 according to the third embodiment of the present invention.
  • FIG. 14 is an operation diagram of the search / concept relation extracting apparatus according to the fourth embodiment of the present invention.
  • FIG. 15 is a diagram showing a correspondence table output by the sentence example search unit 95 according to the fourth embodiment of the present invention.
  • FIG. 16 is a diagram showing a conceptual relationship based on a hierarchical structure created by a conceptual relationship analysis unit 97 according to the fourth embodiment of the present invention.
  • FIG. 17 is an operation diagram of the search / concept relation extracting apparatus according to the fifth embodiment of the present invention.
  • FIG. 18 is a diagram illustrating a conceptual relationship by a hierarchical structure before self-learning, which is created by a conceptual relationship analysis unit 97 according to the fifth embodiment of the present invention.
  • FIG. 19 is a diagram showing a conceptual relationship by a hierarchical structure before self-learning created by a conceptual relationship analyzing unit 97 of the fifth embodiment of the present invention.
  • FIG. 1 is a diagram showing a personal computer 51 in which the search device of this embodiment is realized.
  • the personal computer 51 has a CPU 53, a bus 55, a memory 57, a communication board 59, a magnetic disk 61, a compact disk drive 69, a flexi disk drive 71, a keypad 73, and a mouse. It is composed of 7-5.
  • the magnetic disk 61 stores a dictionary 63, a search program 65, and a browser 67.
  • the dictionary 63 is a word dictionary in which words are simply registered here. Specifically, words are stored in a text file format as character strings.
  • letters are represented using uppercase alphabets, and are described as character strings (words) by combining one or more alphabets.
  • the search program 65 is a search device described below realized by software.
  • the search program 65 is loaded into the memory 57 and executed by the CPU 53, thereby executing the operation of the search device described below.
  • Browser 67 is an Internet search program. Browser 67 is also loaded into memory 57 and executed by CPU 53.
  • the memory 57 is a volatile memory configured by a random access memory or the like.
  • the communication board 59 executes a protocol for connecting to the Internet 77 or the LAN 79. For example, a modem, Ethernet board, or TCPZIP board is used.
  • the personal computer 51 is connected to the Internet 77 or the LAN 79 by the communication port 59, and can communicate with the personal computer 81 or the server computer 83.
  • the dictionary 63, the search program 65, and the browser 67 are stored on the magnetic disk 61, the CD (compact disk) of the compact disk drive 69 or the flexible disk drive 71 It may be stored on a floppy disk (FD). Further, the dictionary 63, the search program 65, and the browser 67 are stored in the personal computer 81 or the server computer 83 and are loaded to the personal computer 51 via a network, or The personal computer 51 may be accessed via the PC.
  • the search program 65 can be created entirely by software. For example, it can be created using C language or Visual Basic language. However, the search program 65 may exist as firmware in a read-only memory (not shown). Alternatively, some or all of them may exist as hardware. If it exists as hardware, it will be implemented as an LSI and will work with other circuits. Alternatively, the search program 65 may operate as a software driver for an operating system or a window system. In this case, the search program 65 operates by a system call from the operating system or the window system, or by interrupt processing.
  • FIG. 2 is a diagram showing the configuration and operation of the search program 65.
  • the search program 65 includes a word input section 21, a search section 23, an association section 37, and a loop section 39.
  • the search unit 23 includes a primary search unit 25, a secondary search unit 27, and an output unit 31.
  • the word input unit 21 inputs a word as an input word 11. This word is, for example, a search keyword for the Internet or a keyword for the database entered by the user into the browser 67.
  • the search unit 23 searches the dictionary 63 for a replaceable word that can be replaced with the input word 11 input by the word input unit 21 and outputs it.
  • the primary search unit 25 searches the dictionary 6 3 using the input word 11 input by the word input unit 21 as a primary word 13 as a registered word in which the input word 11 is included in the-part. .
  • the secondary search unit 27 performs secondary search on the registered word in which a part of the combineable word is included as a combineable word with the remainder obtained by removing the input word 11 from the primary word 13 searched by the primary search unit 25. Search from dictionary 63 as word 17.
  • the output unit 31 outputs, as the replaceable word 19 that can be replaced with the input word 11, the remainder excluding the connectable word from the secondary word 17 searched by the secondary search unit 27.
  • FIG. 3 is a diagram illustrating the operation of the search unit 23 described above.
  • FIG. 4 is a diagram showing a specific example of the operation.
  • FIG. 5 is a diagram showing an example of words registered in the dictionary 63.
  • words as shown in FIG. 5 are registered in the dictionary 63 in advance.
  • the solid and dashed arrows in FIG. 5 will be described later.
  • FIG. 3 it is assumed that AB is input as the input word 11.
  • the primary search unit 25 searches the dictionary 63 for words containing AB using the AB.
  • the primary search unit 25 performs a forward match search, a backward match search, and an intermediate match search using the AB. Then, as the primary word 13, a front match word 81 ⁇ 1 ⁇ , a backward match word 0? 8, and a middle-match word 0: 8? Are searched.
  • the secondary search unit 27 inputs these primary words 13 and generates a connectable word 15 consisting of the remainder of each word excluding AB.
  • AB is removed from AB LM to create LM. Also removes AB from OPAB and generates OP.
  • AB is removed from CD AB EF to generate CD and EF.
  • the secondary search unit 27 searches the dictionary 63 again using these connectable words 15.
  • the secondary search unit 27 performs a backward match search when the primary search unit 25 searches for a front match word.
  • the primary search unit 25 searches for a backward matching word, it performs a forward matching search.
  • the primary search unit 25 searches for an intermediate matching word, a both-match search that matches both the front and the rear is performed.
  • the secondary search unit 27 outputs a secondary word 17 while performing a search and counting the number of occurrences of the same word by the counting unit 29 if there is a matched word.
  • the secondary search unit 27 detects XYZ L M once and GHLM twice as backward matching words. It is also assumed that O PUVW has been detected seven times and O P I J has been detected twice as head-matching words. It is also assumed that CDS TEF is detected four times as both front and rear words, and CD KLEF is detected twice.
  • the output unit 31 inputs these secondary words 17.
  • Output unit 3 1 saw The sorting unit 33 sorts the secondary words 17 using the counts counted by the counting unit 29, and outputs the one having the larger number of appearances first.
  • the weight lowering unit 35 of the output unit 31 changes the weight of the word having a large number of occurrences by the tf.idi method or other methods.
  • This example shows a case where the number of occurrences of OPUVW has been changed to 7, which is the lowest number of occurrences, because the number of occurrences is seven.
  • the t f ⁇ id f method refers to a method of obtaining a score using the frequency of a word and prioritizing the searched words.
  • the score obtained by the tf ⁇ idf method is determined by the following formula.
  • t f is an abbreviation for “T e rm F r e q u e n c y”
  • N is the total number of documents
  • n is the number of files containing the keyword i d i is an abbreviation of “i n v e r se e D o c ume n t F r e q u e n c y”
  • idf is the “weighting” of tf. tf ⁇
  • idf is the “weighting” of tf. tf ⁇
  • the output unit 31 outputs the six words shown in (a) of FIG. 6 as replaceable words 19 that may replace the input word 11 AB. These words are interchangeable with AB and are synonyms.
  • the search program 65 outputs the replaceable word 19, the browser 67 searches the Internet again using a word having a large weight (frequency of occurrence) from the replaceable word 19 instead of the input word 111. It can be carried out.
  • FIG. 4 shows a specific example of FIG. 3 described above. Here, “Aichi” is input as the input word 11, a prefix search is performed, and “Aichi”, “Aichi Medical University”, and “Aichi Mikan” are the primary words 13 with “Aichi” at the front. This shows a case where a search is performed by the next search unit 25.
  • the secondary search unit 27 generates a connectable word 15 such as “prefecture”, “medical university”, and “mikan” in order to perform a backward match search from these three words, and performs a secondary search. As a result, it is shown that “Yamanashi Prefecture”, “Shizuoka Medical University”, and “Mizuka Shizuoka” were searched as secondary words 17.
  • the output unit 31 sorts the secondary words 17 by the number of appearances and outputs “Shizuoka” and “Yamanashi” as the replaceable words 19. In this way, it is possible to search for the word “Aichi” in which the two words “Shizuoka” and “Yamanashi” are replaceable words, that is, they are synonyms.
  • the associating unit 37 shown in FIG. 2 inputs the replaceable word 19 output from the output unit 31 and informs the dictionary 63 that the input word 11 can be replaced with the replaceable word 19.
  • a solid arrow is directed from AB to a replaceable word 19 that can be replaced with AB shown in FIG. 6A.
  • This arrow is added by the association unit 37.
  • this arrow indicates that each word is stored from AB.
  • This can be implemented as a pointer to the address of the address.
  • the relations can be stored by creating relations in a relational database format.
  • the associating unit 37 If the word output as the replaceable word 19 does not exist in the dictionary 63, the associating unit 37 newly registers the word in the dictionary 63 and says that it is a synonym of the input word 111. The association is stored. Thus, each time the input word 11 is input and the replaceable word 19 is output, the associating unit 37 keeps registering synonyms in the dictionary 63.
  • the dictionary 63 can have synonyms for all the stored registered words. For example, if 10,000 words are registered in the dictionary 63, the 10,000 words are sequentially input to the search program 65, and the replaceable words 19 of each word are output, and the relation is established. By storing the synonymous relation by the part 37, the dictionary 63 is changed from a word dictionary in which words are simply registered to a synonym dictionary.
  • This operation may be performed each time a new word is registered in the dictionary 63, or may be performed periodically, for example, once a week or once a month.
  • registration of new words in the dictionary 63 may be performed, for example, on a specific technical field, a specific private field, or a field in which a particular user is particularly interested from each homepage on the Internet. This can be done by executing a robot program or agent program that retrieves data. In this way, dictionary 63 becomes a synonym dictionary for specialized and private fields. In this way, the automatic word registration by the robot program or the agent program is performed, and the automatic synonym registration by the search program 65 is performed, so that the dictionary 63 automatically increases the number of words, and Synonyms will be registered.
  • the loop unit 39 inputs the replaceable word 19 and causes the search program 65 to input it again as the input word 11. That is, the replaceable word 19 is fed back as the input word 11.
  • the replaceable word 19 is fed back as the input word 11.
  • FIG. 3 when XYZ is output as a replaceable word 19, this XYZ is input to the search program 65 as the input word 11 again.
  • the associating unit 37 One word can be registered as a synonym for ⁇ and at the same time as a secondary synonym for ⁇ .
  • the above-described search device is based on the essence of human recognition. That is, when determining whether or not a human is a synonym, it recognizes whether or not it can be replaced, and when it can be replaced, it is considered to be a synonym.
  • the above-described search device automatically obtains a replaceable word for the input word by the search program 65.
  • the algorithm is very simple, such as prefix search or tail search, but the essence of whether or not it can be replaced to determine human synonyms as described above is a search program. Algorithms that search for synonyms based on the essence of human perception exist. Absent.
  • the above-described search device can be adapted to Japanese or other languages such as English, French, German, and Chinese. In other words, the target language does not matter. Also, as described above, if the search device can be started automatically, updating can be performed at all times. It is also possible to extract not only synonyms for the replaceable word 19 described above, but also words that have a significant hierarchical relationship (inclusive relation) (for example, “elastic body” as a superordinate word of “rubber”). It is.
  • FIG. 7 is a diagram illustrating a search device according to the second embodiment.
  • a set is a character, word, clause, or phrase.
  • a set of character strings such as sentences. These sets are data stored in a storage device of a computer, and are stored as a file, as mail, or as a web page. It is stored as a homepage or as a database.
  • control unit and the replaceable character string search unit described below may be realized by software, may be realized by hardware, or may be realized by a combination of software and hardware.
  • RDBMS Relational Database Management System
  • the implementation of the replaceable string list set is an efficient implementation of the association entity.
  • Examples of the method of collecting the character string sets held by the input character string set and the sampling character string set include the following examples, but are not limited thereto.
  • the existence of a character string as a set is the operating condition of the system.
  • Examples of the start timing of the generation operation of the replaceable character string set list are as follows, but the present invention is not limited to them.
  • An activation trigger is generated for the control unit. This trigger is triggered by the user or by a change notification from the input string set. , Or by the change notification of the sampling character string set, or by the periodic startup.
  • control unit acquires the input character string from the input character string set. This acquisition operation is performed in order of the number of character strings in the input character string set. That is, loop processing is performed for each element (character string) of the set.
  • the control unit provides the character string obtained from the input character string set to the replaceable character string search unit.
  • the replaceable character string search unit searches for a character string list that can be replaced with the input character string by the operation described below and outputs the list to the control unit.
  • the replaceable character string search unit performs the same search as in the first embodiment on the sampled character string set.
  • the sampled character string set may be a dictionary similar to that in the first embodiment, or may be a database that can extract a subset including character strings that match the input character string with prefix, middle, and backward matches. No problem.
  • the sampling character string set may be a database that accepts SQL statements, or a database that accepts cures (queries).
  • the replaceable character string search unit performs a secondary search similar to that of the first embodiment on the sampling character string set to obtain a replaceable character string.
  • the replaceable character string search unit assigns priorities or priorities using the number of appearances and the frequency of occurrence of the replaceable character strings obtained by the secondary search, creates a replaceable character string list, and controls Output to the section.
  • FIG. 8 is a diagram illustrating an example of frequency (ranking). It is assumed that five words are held in the sampling character string set as shown in Fig. 8 (a). Then, as shown in Fig. 8 (b), when the primary search is performed for the word "Busai", two words, "Busai Railway” and “Busai Department Store” are searched. ”And“ department store ”. As a result of performing a secondary search using these two connectable character strings, the words “Budo Railway”, “Buto Department Store”, and “Koshisan Department Store” are obtained as a result of the secondary search. Then, as shown in Fig. 8 (c), “Buto” and “Koshizo” are obtained as replaceable character strings. The force is S1.
  • the control unit adds the replaceable character string list obtained from the replaceable character string search unit to the replaceable character string list set.
  • the control unit extracts all character strings from the input character string set in order, obtains a replaceable character string list, and outputs it to the replaceable character string list set.
  • the configuration of the concept relation extracting device according to the third embodiment can be realized using a personal computer 51 as shown in FIG.
  • the dictionary 63 in FIG. 1 is a sentence example database 64 and the search program 65 is a concept relation extraction program 91
  • FIG. 10 is a concept relation extraction program 91
  • FIG. 3 is a diagram showing the configuration and operation of the first embodiment.
  • the concept relationship extraction program 91 includes a multiple word input unit 93, a sentence example search unit 95, a concept relationship analysis unit 97, and a concept relationship output unit 99.
  • the multiple word input section 93 inputs a plurality of words into the memory 57.
  • the sentence example search unit 95 searches the sentence example database 64 for a sentence example in which a plurality of words input by the multiple word input unit 93 are combined. It is assumed that a plurality of sentence examples are registered and stored in the sentence example database 64 in advance.
  • the sentence example database 64 may be stored in a magnetic disk or an optical disk, or may exist in another computer or server connected via a network.
  • the concept relationship analysis unit 97 analyzes the relationship between the concepts of a plurality of words using the sentence examples searched by the sentence example search unit 95.
  • the concept relationship output unit 99 outputs the concept relationship analyzed by the concept relationship analysis unit 97 to a screen of a display device or a network.
  • the operation of the concept relation extraction program 91 will be described using a
  • the multi-word input unit 93 includes “creatures”, “trees”, “animals”, “mammals”, “birds”, “metals”, and “cars” as input words. Suppose you enter seven words. It is also assumed that the sentence example database 64 stores 12 sentences in advance as shown in FIG.
  • the sentence example search unit 95 searches the sentence example database 64 for a sentence example in which each word is used, using the seven words input by the plural word input unit 93.
  • the sentence example search unit 95 converts the search result into a sentence-word correspondence table as shown in FIG.
  • the correspondence table is temporarily stored in the memory.
  • sentence numbers and examples of sentence examples are arranged vertically.
  • seven words input horizontally are arranged.
  • the X mark indicates that the sentence example in which the word is combined with the sentence cannot be searched.
  • the symbol ⁇ indicates that the word may be used in combination with the sentence example.
  • the number of sentence examples to which the word was connected is indicated as “number of sentence hits”.
  • the display order of the words in the correspondence table in the horizontal direction is sorted and stored in ascending order of the number of sentence example hits. For example, “car” is stored at the left end because the number of sentence hits is 0. The two words “mammals” and “birds” have the maximum number of sentence hits of 3 and are the largest, so they are stored at the right end.
  • Y is a single concept because there is no other word connected to sentence 6 because Y is connected to sentence 6.
  • A, B, C, D, and E are grouped as a set having common parts, and A with the smallest number of sentence example hits among A, B, C, ⁇ , and E, and Position at the top of the group (the concept with the highest level of abstraction).
  • D is associated with sentence 1, sentence 3, and sentence 4, but there is nothing else associated with sentence 1, sentence 3, and sentence 4, so there is no concrete concept under it. I understand. D is included in the group of A and the group of C. Since C is a concrete concept of A, D is positioned directly below C.
  • E does not have a concrete concept under it because there is nothing other than S that is connected to sentences 1, 3, and 5, and there is nothing else connected to sentences 1, 3, and 5. I understand. E is included in A group and C group, but C Is a concrete concept of A, so E is positioned directly below C.
  • a character string having a small number of sentence example hits is regarded as a character string of a higher concept than a character string having a large number of sentence example hits. If the number of sentence hits is the same, it is regarded as a character string of the same level (same level).
  • the conceptual relationship output unit 99 outputs the conceptual relationship 101 as shown in FIG. 13 to the screen of the display device.
  • the above is the configuration and operation of the concept relationship extraction program 91 shown in FIG.
  • the above-mentioned correspondence table may be stored in the memory 57 or the magnetic disk 61 and may be reused later.
  • the concept relation 101 output by the concept relation output unit 99 is stored in the memory 57 or the magnetic disk 61.
  • the information may be displayed on a display device. Alternatively, it may be sent to another computer via a network.
  • 2 and X indicate the word and its Although whether or not a statement exists is shown, a correspondence table may be created based on the number of occurrences instead of ⁇ and X. That is, the X mark has a value of 0, and the ⁇ mark indicates the number of times it actually existed. Then, weighting or weighting may be performed using the number of times of existence, so that the hierarchical relationship between concepts may be determined. For example, using the tf ⁇ idf method, the importance of sentence examples that occur frequently may be reduced, and the importance of sentence examples that occur infrequently may be increased. That is, the concept relationship may be determined based on the ratio of the sentence examples. Embodiment 4.
  • FIG. 14 is a configuration diagram showing a search / conceptual relation extracting apparatus having both the search program 65 shown in the first embodiment and the conceptual relation extraction program 91 shown in the third embodiment.
  • the search program 65 inputs the input word 11 1.
  • the search program 65 searches the dictionary 63 or the sentence example database 64 and outputs the replaceable word 19 of the input word 11 according to the procedure described in the first embodiment.
  • the replaceable words 19 are “creature”, “animal”, “mammal”, “bird”, “tree”, “human”, “ The words "dog”, “dove”, and "cherry” are output.
  • the concept relation extraction program 91 inputs these replaceable words 19 and outputs a concept relation 101 by searching the dictionary 63 or the sentence example database 64.
  • the internal configuration of the concept relation extraction program 91 is the same as that of FIG. 10 described above. That is, the multiple word input unit 93 inputs the replaceable word 19.
  • the sentence example search unit 95 searches the dictionary 63 and the sentence example database 64 to create, for example, a correspondence table shown in FIG. 15 (a).
  • the correspondence table shown in Fig. 15 (a) is not sorted by the number of sentence example hits, so the sentence example
  • the search unit 95 sorts the words in ascending order of the number of sentence example hits and completes the correspondence table as shown in FIG. 15 (b).
  • Conceptual relationship analysis section 97 performs the same operation as that described in the third embodiment, and obtains the conceptual relationship of replaceable word 19 as a hierarchical structure shown in FIG.
  • the concept relation output unit 99 outputs the hierarchical structure shown in FIG. 16 as the concept relation 101.
  • a specific use example of the fourth embodiment is as described below.
  • the search program 65 can search for a replaceable word and present it to the user.
  • the replaceable words are simply displayed, whereas in the fourth embodiment, the hierarchical relation between the concepts of the replaceable words is displayed. Can be shown. For example, when the conceptual relationship 101 shown in FIG. 16 is displayed to the user, the user can know that the word “creature” exists as a superordinate concept of “animal”. Similarly, we can know that words such as "mammals” and "birds” exist in the subordinate concept of "animals”.
  • the concept relation 101 may not only be displayed to the user, but the concept relation 101 may be output to the dictionary 63 or the sentence example database 64 and stored.
  • the memory of the conceptual relationship 101 is, for example, between words This can be realized by designating the upper / lower relationship with a pointer attached.
  • the concept relationship extracting program of FIG. 17 has a self-learning unit 103 added to the concept relationship extracting program of FIG. 14 (or FIG. 10).
  • Figures 14 to 16 described above show the case where all replaceable words are included in one hierarchical structure, but as a result of analyzing the conceptual relationship, it is necessary to have one hierarchical structure. Not necessarily. For example, if the sentence example database 64 does not contain two sentences, “creatures have DNA” and “creatures have descendants”, the sentence example of “creatures” in Fig. 15 (b) does not exist. The number of bits will be 0. In this case, two conceptual relationships 101 are generated: a hierarchical structure below “animals” and a hierarchical structure below “trees”.
  • the descriptions of mammals that do not exist in the sentence example database 64 are “Mammals have DNA”, “Mammals leave offspring”, “Mammals drink water”, “Mammals are children” It gives birth. " In this case, the tree obtained as a result of the conceptual relationship analysis is one in which “human” and “dog” are directly arranged under “animal” as shown in Figure 18.
  • sentence 1, sentence 2, and sentence 3 are descriptions (combined sentences) that characterize “animal”, which is a superordinate concept of "human”, “dog”, and “bird”. And that there is nothing in common between “dogs” and “birds” beyond being “animals”. However, in the case of a sentence that is common between “human” and “dog”, there is a sentence 5 that is not connected to “animal”, so some abstract concept exists under “animal” and "human” ”And“ dog ”are concrete concepts.
  • the self-study department 103 asks the user, "What are the concepts that can be linked to sentence 1, sentence 2, sentence 3, and sentence 5?" Or the top of "human” and “dog” Since the concept is already known to be “animal", the question for the user may be "What is” animal "and what is the concept associated with sentence 5?"
  • the self-learning unit 103 responds to the sentence 1, sentence 2, sentence 3, and sentence 5 by saying that “mammals have DNA” and “mammals are descendants. , “Mammals drink water” and “Mammals give birth to children” are added to the example database 64.
  • the self-learning unit 103 corrects the tree structure, placing “human” and “dog” under “Mammal J” and “mammal” under “animal”.
  • the only sentences that do not exist in the sentence example database 64 are “Mammals give birth to children”, and descriptions of other “mammals” “Mammalia have DNA”, “Mammals leave offspring”, “Mammals” Drinks water. " In this case, "animal” and “mammal” are associated with exactly the same sentence, so they are synonymous (or words in the same hierarchy).
  • the brittle structure obtained in this case is as shown in FIG.
  • the self-learning unit 103 examines the common part of “human”, “dog”, and “bird J” in the same manner as above, and as a result, “sentence 1, sentence 2, sentence 3, sentence 5 What is the concept that can be connected? "Or” What is the concept of "animal” and connected to sentence 5? " If the user answers "mammal" to the query, the self-learning unit 103 adds a new sentence 2 corresponding to sentence 5, "Mammals have children" to the sentence example database 64. Make sure that the tree is as shown in Figure 16.
  • the procedure of the self-learning function of the self-learning unit 103 is summarized as follows.
  • the self-learning unit 103 may execute the self-learning function only when the hierarchical structure does not become one as shown in (Example 1.).
  • the self-learning unit 103 may execute the self-learning function only when there is a synonymous word in the hierarchical structure as shown in (Example 2).
  • the self-learning unit 103 may execute the self-learning function even when the hierarchical structure is uniquely determined as shown in (Example 3).
  • the terms indicating operations such as “input”, “output”, “register”, “search”, “acquire”, etc. Indicates the operation or the operation of the program (software) executed by the computer, and means the operation for memory, disk, network, screen, database, etc. I have.
  • program may be recorded on a recording medium, or may be transmitted as a signal online and distributed.
  • a synonym is automatically searched for as a replaceable word, so that the user finds a synonym in a hidden relationship that cannot be easily found by the user. It is possible. In other words, there is a possibility that a relationship that cannot be found with a prejudice may be found.
  • the user wants to know why the output as a synonym has become a synonym prepare an interactive interface (not shown) and ask the user why the synonym is a synonym for the system. You can ask if it was found. In that case, the interactive interface displays the primary word 13 and the connectable word 15 and the secondary word 17 shown in Fig. 3, that is, by displaying the progress of the search, The user can be told why the word was searched for as a synonym.
  • a synonym dictionary can be automatically constructed as an application example of the present invention.
  • synonym dictionaries are manually compiled and require a great deal of effort. For this reason, it was not possible for individuals to own a thesaurus.
  • a personal private synonym dictionary can be automatically created.
  • synonym dictionaries for specialized fields and special fields can be created automatically and updated automatically.
  • a basic database for artificial intelligence can be constructed using a search device.
  • Synonym dictionaries provide clues for combinators to know the meaning of words.
  • This synonym dictionary can be automatically constructed, and this synonym dictionary can be used as a basic database for artificial intelligence. .
  • the present invention it is possible to input a plurality of character strings (a plurality of words) and to know the inclusion relation of the character strings (words).
  • the input given to the search engine to a keyword with a higher degree of abstraction or a keyword with a higher degree of concreteness is performed.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un dispositif de recherche permettant d'effectuer un traitement automatique consistant à mettre en rapport une chaîne de caractères avec une autre chaîne de caractères en fonction de leur possibilité de remplacement et de combinaison. Les mots comprenant un mot d'entrée (11) sont recherchés sous forme de mots primaires (13). Les mots secondaires (17) sont recherchés à l'aide de mots pouvant être combinés (15) définis selon un procédé consistant à exclure le mot d'entrée (11) des mots primaires (13). Les mots pouvant être remplacés (19) sont définis par exclusion des mots pouvant être combinés (15) des mots secondaires (17). Les mots pouvant être remplacés (19) sont définis sous forme de synonymes du mot d'entrée (11) et enregistrés dans un dictionnaire.
PCT/JP2001/005796 2000-07-06 2001-07-04 Dispositif de recherche WO2002005131A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2001269434A AU2001269434A1 (en) 2000-07-06 2001-07-04 Searching device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2000204568A JP2005099884A (ja) 2000-07-06 2000-07-06 検索装置
JP2000-204568 2000-07-06

Publications (1)

Publication Number Publication Date
WO2002005131A1 true WO2002005131A1 (fr) 2002-01-17

Family

ID=18701809

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2001/005796 WO2002005131A1 (fr) 2000-07-06 2001-07-04 Dispositif de recherche

Country Status (3)

Country Link
JP (1) JP2005099884A (fr)
AU (1) AU2001269434A1 (fr)
WO (1) WO2002005131A1 (fr)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5339628B2 (ja) * 2010-01-20 2013-11-13 株式会社Kddi研究所 未知語を含む文章を分類するための文章分類プログラム、方法及び文章解析サーバ
JP7091685B2 (ja) * 2018-02-08 2022-06-28 富士通株式会社 検索処理プログラム、検索処理方法及び検索処理装置

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08147307A (ja) * 1994-11-22 1996-06-07 Gijutsu Kenkyu Kumiai Shinjoho Shiyori Kaihatsu Kiko 意味知識獲得装置
JPH11143875A (ja) * 1997-11-10 1999-05-28 Nec Corp 単語自動分類装置及び単語自動分類方法
JP2000137718A (ja) * 1998-11-04 2000-05-16 Nippon Telegr & Teleph Corp <Ntt> 単語の類似性判別方法および単語の類似性判別プログラムを記録した記録媒体

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08147307A (ja) * 1994-11-22 1996-06-07 Gijutsu Kenkyu Kumiai Shinjoho Shiyori Kaihatsu Kiko 意味知識獲得装置
JPH11143875A (ja) * 1997-11-10 1999-05-28 Nec Corp 単語自動分類装置及び単語自動分類方法
JP2000137718A (ja) * 1998-11-04 2000-05-16 Nippon Telegr & Teleph Corp <Ntt> 単語の類似性判別方法および単語の類似性判別プログラムを記録した記録媒体

Also Published As

Publication number Publication date
JP2005099884A (ja) 2005-04-14
AU2001269434A1 (en) 2002-01-21

Similar Documents

Publication Publication Date Title
KR100666064B1 (ko) 인터랙티브 검색 쿼리 개선 시스템 및 방법
Alwaneen et al. Arabic question answering system: a survey
US20040117352A1 (en) System for answering natural language questions
US20150081277A1 (en) System and Method for Automatically Classifying Text using Discourse Analysis
US20070106499A1 (en) Natural language search system
CN112395395B (zh) 文本关键词提取方法、装置、设备及存储介质
Skusa et al. Extraction of biological interaction networks from scientific literature
KR100396826B1 (ko) 정보검색에서 질의어 처리를 위한 단어 클러스터 관리장치 및 그 방법
CN111611356A (zh) 信息查找方法、装置、电子设备及可读存储介质
CN112989208B (zh) 一种信息推荐方法、装置、电子设备及存储介质
JP2011118689A (ja) 検索方法及びシステム
CN111325018A (zh) 一种基于web检索和新词发现的领域词典构建方法
JPH0520362A (ja) 文書テキスト間の連鎖自動作成システム
Subhashini et al. Shallow NLP techniques for noun phrase extraction
Galvez et al. Term conflation methods in information retrieval: Non‐linguistic and linguistic approaches
JP2003150624A (ja) 情報抽出装置および情報抽出方法
WO2000026839A9 (fr) Modele evolue destine a l&#39;extraction automatique des informations relatives au savoir-faire et aux connaissances depuis un document electronique
JP2004355550A (ja) 自然文検索装置、その方法及びプログラム
CN1114165C (zh) 中文文本中的字词分割方法
Rondon et al. Never-ending multiword expressions learning
KR20030006201A (ko) 홈페이지 자동 검색을 위한 통합형 자연어 질의-응답시스템
WO2002005131A1 (fr) Dispositif de recherche
JP2005202924A (ja) 対訳判断装置、方法及びプログラム
JP3856388B2 (ja) 類義性計算方法、類義性計算プログラム、類義性計算プログラムを記録したコンピュータ読み取り可能な記録媒体
Biricik et al. A turkish automatic question answering system with question multiplexing: Ben bilirim

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP