WO2018088027A1 - 文書検索装置、文書検索方法およびコンピュータプログラム - Google Patents

文書検索装置、文書検索方法およびコンピュータプログラム Download PDF

Info

Publication number
WO2018088027A1
WO2018088027A1 PCT/JP2017/033316 JP2017033316W WO2018088027A1 WO 2018088027 A1 WO2018088027 A1 WO 2018088027A1 JP 2017033316 W JP2017033316 W JP 2017033316W WO 2018088027 A1 WO2018088027 A1 WO 2018088027A1
Authority
WO
WIPO (PCT)
Prior art keywords
search
words
document
document data
code
Prior art date
Application number
PCT/JP2017/033316
Other languages
English (en)
French (fr)
Japanese (ja)
Inventor
羽翔 毛
Original Assignee
株式会社野村総合研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社野村総合研究所 filed Critical 株式会社野村総合研究所
Priority to CN201780069191.9A priority Critical patent/CN109923538B/zh
Publication of WO2018088027A1 publication Critical patent/WO2018088027A1/ja

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities

Definitions

  • This invention relates to a data processing technique, and more particularly to a technique for retrieving a document.
  • N-gram full-text search technology is widely used (see, for example, Patent Document 1).
  • a plurality of keywords specified by the user provide a proximity search for searching for documents described in a range specified by the user (hereinafter also referred to as “neighboring range”). There is also.
  • the N-gram search engine can perform neighborhood search based on the number of characters, but cannot perform neighborhood search based on the number of words.
  • the neighborhood range which is a neighborhood search condition, can be specified by the number of characters, but not by the number of words. Although the number of characters can be different for each word, there is a great need for specifying the neighborhood range by the number of words.
  • the present invention has been made in view of these problems, and a main object thereof is to provide a technique for realizing a word number-based neighborhood search while using an N-gram search engine.
  • a document search apparatus is a first document storage unit that stores a plurality of document data, and each of the plurality of document data is mutually stored in an original document.
  • a first document storage unit in which a plurality of different words are converted into different fixed-length codes, a search request for designating a plurality of words, and a range in which the plurality of words should exist
  • a reception unit that accepts a search request specified by the number of words, an acquisition unit that acquires a fixed-length code corresponding to each of a plurality of words specified by the search request, and a range of word number criteria specified by the search request
  • the derivation unit for deriving a character number reference range according to a fixed code length, the codes of a plurality of words acquired by the acquisition unit, and the character number reference range derived by the derivation unit Run the neighbor search, and a search unit that extracts a satisfying document data from among a plurality of document data stored in the first document storage unit.
  • Another aspect of the present invention is a document search method.
  • This method is a document storage unit for storing a plurality of document data, each of which is obtained by converting a plurality of different words described in an original document into different fixed-length codes.
  • a computer capable of accessing the document storage unit is a search request for specifying a plurality of words, and receiving a search request for specifying a range in which the plurality of words should be present by the number of words, and a search request Deriving a character number reference range according to the step of obtaining a fixed length code corresponding to each of the plurality of words specified in, the word number reference range specified in the search request, and the fixed code length
  • a search engine included in the document search apparatus executes N-gram method document search. For example, if searching for “eat”, “beat”, “heat”, and “beaten” all hit. N-gram search engines do not recognize word breaks. For example, “eat” and “beaten” both hit the search by “eat”.
  • the N-gram search engine does not consider the number of words in the neighborhood search. For example, “eat pathologically you are discovering” and “eat it much fast you can hunt yourself” have the same number of characters between “eat” and “you”, but the number of words between them is 1 and 3.
  • An N-gram search engine can perform neighborhood searches based on the number of characters, but cannot perform neighborhood searches based on the number of words.
  • the user designates a plurality of keywords and a range in which the plurality of keywords should exist (that is, a neighborhood range) as a condition.
  • This range can be said to indicate the degree of proximity of a plurality of keywords, and is hereinafter referred to as “distance”.
  • the distance specified by the number of characters, in other words, the neighborhood range based on the number of characters is called “character distance”, and the distance specified by the number of words, in other words, the neighborhood range based on the number of words is called “word distance”.
  • neighborhood search based on the number of characters. For example, when a neighborhood search is executed on condition that the keyword “eat”, the keyword “you”, and the character distance “15”, a document including both “eat” and “you” within the range of 15 characters is hit. .
  • the N-gram search engine of the embodiment searches for a connected character string with a perfect match. For example, in the case of 3-gram, “aboveboard” is not hit in a search using the keyword “discarded”.
  • a neighborhood search based on the number of words is realized using an N-gram search engine.
  • “computer to car” is composed of three words “car”, “ni”, and “computer”, but in the neighborhood search with the keywords “car” and “computer” and the word distance “3”, Make sure that documents that contain "computer in car” are hit.
  • the word distance “2” means that two keywords are adjacent to each other.
  • the word distance “N (N is 3 or more)” means that a plurality of keywords are present in N words including the keywords.
  • the document search apparatus stores a document obtained by converting each word of the original document into a fixed-length code as a plurality of documents to be searched.
  • the document search device converts the keyword specified in the search request into a fixed-length code.
  • the character distance is derived based on the word distance specified in the search request and the fixed code length.
  • the document search apparatus inputs a keyword code and a character distance to an N-gram search engine, and performs a neighborhood search based on the number of characters.
  • a neighborhood search based on the number of words it is possible to realize a neighborhood search based on the number of words while using an existing N-gram search engine.
  • a neighborhood search based on the number of words can be realized using an existing search engine that performs a neighborhood search based on the number of characters.
  • FIG. 1 shows a document search system 10 of the first embodiment.
  • the document search system 10 includes a document search device 12 and a user terminal 14.
  • the document search device 12 is an information processing device that provides a document search service such as keyword search or neighborhood search.
  • the document search device 12 may be a server that provides a patent document search service via the Internet.
  • the user terminal 14 is an information processing apparatus operated by a user (hereinafter referred to as “user”) of a document search service provided by the document search apparatus 12.
  • the user terminal 14 may be, for example, a PC, a smartphone, or a tablet terminal.
  • the document search device 12 and the user terminal 14 are connected via a communication network 16 including a LAN, a WAN, the Internet, and the like. Although one user terminal 14 is illustrated in FIG. 1, a plurality of user terminals 14 operated by a plurality of users may actually be connected to the document search device 12.
  • FIG. 2 is a block diagram showing a functional configuration of the document search device 12 of FIG.
  • the document search device 12 includes a control unit 20, a storage unit 22, and a communication unit 24.
  • the control unit 20 executes various data processing related to document search.
  • the storage unit 22 is a storage area that stores data that is referred to or updated by the control unit 20.
  • the communication unit 24 communicates with an external device according to a predetermined communication protocol.
  • the control unit 20 transmits / receives data to / from the user terminal 14 via the communication unit 24.
  • each block of the control unit 20 may be implemented as a computer program, and the computer program may be installed in the storage of the document search device 12.
  • the function of each block of the control unit 20 may be exhibited by the CPU of the document search device 12 reading the computer program into the main memory and executing it.
  • the storage unit 22 may be realized by the main memory or storage of the document search device 12.
  • the document search device 12 may be realized by a plurality of devices such as a web server, an application server, and a database server cooperating via a communication network.
  • the storage unit 22 includes a dictionary storage unit 26, an original document storage unit 28, and a coded document storage unit 30.
  • the original document storage unit 28 stores a plurality of original document data.
  • the original document data is document data before a word described in the document is converted into a code.
  • the original document data is document data having contents at the time of creation.
  • the original document data includes documents written in various languages such as Japanese and English.
  • the original document data may include a published patent gazette and management data (such as various numbers), and a patent publication gazette and management data.
  • the dictionary storage unit 26 stores correspondence between a plurality of words that can be included in the original document data and a plurality of codes.
  • This word is a morpheme and includes words defined in various languages such as Japanese and English.
  • FIG. 3 schematically illustrates a configuration example of the dictionary storage unit 26.
  • the dictionary storage unit 26 stores dictionary data in which a plurality of different words in various languages are associated with different fixed-length codes.
  • the code can be said to be a unique ID of each of a plurality of words.
  • the code of the embodiment is 11-byte fixed-length data, specifically, a character string with 11 characters fixed.
  • a special value (in other words, a delimiter) that is not used except for the head is set at the head (for example, the first byte) of each code.
  • the special value may be a value (a sequence of bits) that is not used in the 2nd to 11th bytes of the code. In the embodiment, the special value at the beginning of the code is indicated by “#”.
  • the coded document storage unit 30 also refers to document data (hereinafter also referred to as “coded document data”) in which a plurality of different words described in the original document data are converted into different fixed-length codes. ) And a plurality of encoded document data.
  • the coded document data includes, for example, coded data of a published patent gazette in which a plurality of words described in the original published patent gazette is converted into a code, and a plurality of words described in the original patent gazette in the code.
  • the converted patent publication gazette data may be included.
  • coded document data An example of coded document data will be described with reference to FIG.
  • the original document data includes a character string “My invention” (three words “I”, “No”, and “Invention”)
  • the character string is “# 0024F76DA7 # 0024F76DD8 # 0024F76DA6” in the coded document data. "And recorded.
  • Each of the plurality of coded document data stored in the coded document storage unit 30 is associated with the original document data before conversion stored in the original document storage unit 28.
  • Each of the plurality of pieces of encoded document data may include an identifier of the corresponding original document data, or may include an address (that is, a pointer) on the memory of the corresponding original document data.
  • the control unit 20 includes a document conversion unit 32, a search request reception unit 34, a code acquisition unit 36, a character distance derivation unit 38, a search instruction unit 40, a search execution unit 42, and a search result providing unit 44.
  • the control unit 20 may include a function of a known web server.
  • the document conversion unit 32 generates coded document data from the original document data stored in the original document storage unit 28. For example, the document conversion unit 32 executes a known morphological analysis process to identify a plurality of words described in the original document data.
  • the document conversion unit 32 refers to the dictionary data in the dictionary storage unit 26 and generates coded document data by replacing a plurality of words described in the original document data with a fixed-length code corresponding to each word. .
  • the document conversion unit 32 stores the generated encoded document data in the encoded document storage unit 30.
  • the search request reception unit 34 receives a search request transmitted from the document search device 12.
  • the search request of the embodiment is a query message that requests a neighborhood search. Specifically, in the search request, a plurality of words are specified, and a range in which the plurality of words should exist is specified by the number of words. In other words, the search request of the embodiment specifies a plurality of keywords and a word distance between these keywords.
  • the code acquisition unit 36 refers to the dictionary data in the dictionary storage unit 26 and acquires a code corresponding to each of a plurality of words specified in the search request received by the search request reception unit 34.
  • the character distance deriving unit 38 is a character distance that becomes a condition for the neighborhood search according to the word distance specified in the search request received by the search request receiving unit 34 and the fixed code length (11 characters in the embodiment). Is derived. Specifically, the character distance deriving unit 38 derives the result of (word distance ⁇ code length) as the character distance.
  • the search instruction unit 40 and the search execution unit 42 cooperate with each other to function as a search unit that executes various search processes including neighborhood search.
  • the search instruction unit 40 is a search instruction for designating a plurality of codes acquired by the code acquisition unit 36 as keywords, and inputs a search instruction for designating the character distance derived by the character distance deriving unit 38 to the search execution unit 42. To do.
  • the search execution unit 42 is a search engine that executes document search processing by the N-gram method in accordance with the search instruction input from the search instruction unit 40.
  • the search execution unit 42 executes a neighborhood search based on the character distance regardless of whether or not a neighborhood search based on the word distance is requested from the user terminal 14.
  • the search execution unit 42 executes a neighborhood search on the condition of a plurality of keywords specified by the search instruction and the character distance.
  • the search execution unit 42 extracts coded document data that satisfies the above conditions from a plurality of coded document data stored in the coded document storage unit 30.
  • the search execution unit 42 extracts original document data corresponding to the encoded document data extracted from the encoded document storage unit 30 (that is, hit the neighborhood search) from the original document storage unit 28.
  • the search execution unit 42 outputs the original document data extracted from the original document storage unit 28 and / or the encoded document data extracted from the encoded document storage unit 30 to the search result providing unit 44.
  • the search result providing unit 44 transmits the original document data extracted by the search execution unit 42 to the user terminal 14 that is the search request source. For example, identification information (such as various management numbers) indicating the original document data or at least a part of the text of the original document data may be transmitted to the user terminal 14.
  • identification information such as various management numbers
  • the search result providing unit 44 uses the encoded document data extracted by the search execution unit 42 together with the original document data or instead of the original document data (for example, a management number common to the original document data). You may transmit to the terminal 14.
  • a portion where a plurality of keywords indicated by the search instruction in the original document data appears may be highlighted. Good.
  • the coded document data stored in the coded document storage unit 30 is a code in which a code associated in advance with a word recorded in the original document data is recorded.
  • the document conversion unit 32 converts the specific word into a plurality of single-character words. Disassembled into A single character word is a word composed of one character, and includes, for example, “a”, “b”, “a”, “i”, “day”, “month”, and the like.
  • the dictionary data of the dictionary storage unit 26 a plurality of codes corresponding to a plurality of single character words are stored in advance.
  • the document conversion unit 32 refers to the dictionary data in the dictionary storage unit 26 and records a code associated with each of a plurality of single character words in the coded document data.
  • the code acquisition unit 36 selects a word (that is, a specific word) that does not have a corresponding code as a plurality of single characters. Break down into words and get codes for each single word.
  • the new word distance is “7”, “77”, which is the product of the fixed code length “11”, is derived as the character distance.
  • the word distance specified by the user is substantially shortened as a result of decomposing the specific word into a plurality of single-character words. A decrease in accuracy can be avoided.
  • the search instructing unit 40 is derived by the character distance deriving unit 38, the code of one or more words having a corresponding code among the plurality of words specified in the search request, the code of the plurality of single character words, and A neighborhood search instruction with a character distance as a condition is input to the search execution unit 42.
  • the search execution unit 42 executes neighborhood search based on these conditions, and extracts coded document data satisfying these conditions from the coded document storage unit 30.
  • the search instruction unit 40 further causes the search execution unit 42 to execute a new neighborhood search using one or more coded document data extracted by the neighborhood search as a population. Specifically, a new neighborhood search is further executed on condition that the codes of a plurality of single character words are adjacent. For example, when a certain specific word is decomposed into four single-character words, a new neighborhood search based on the character distance “4” is further executed.
  • the search execution unit 42 executes a new neighborhood search according to the search instruction and narrows down the results of the previous neighborhood search.
  • FIG. 4 is a flowchart showing the operation of the document search apparatus 12 of the first embodiment. This figure shows the operation when generating coded document data.
  • Original document data to be included in the search population for example, a newly published patent publication
  • the document conversion unit 32 waits until new original document data is stored in the original document storage unit 28 (N in S10).
  • the document conversion unit 32 detects that new original document data is stored in the original document storage unit 28 (Y in S10), the document conversion unit 32 executes morphological analysis processing on the character string described in the new original document data. Then, a plurality of words described in the new original document data are extracted (S11).
  • the document conversion unit 32 refers to the dictionary data and acquires a code corresponding to each word extracted in S11 (S12). When a code corresponding to at least one word (referred to as “specific word”) included in the original document data is not defined in the dictionary data (Y in S14), the document conversion unit 32 converts the specific word into a plurality of single characters. The code is decomposed into words, and codes corresponding to the single character words are acquired (S16). If the specific word does not exist in the original document data, that is, if all the words of the original document data are defined in the dictionary data (N in S14), S16 is skipped. The document conversion unit 32 stores the encoded document data obtained by converting each word (which may include a single character word) of the original document data into a fixed-length code in the encoded document storage unit 30 (S18).
  • FIG. 5 is also a flowchart showing the operation of the document search apparatus 12 of the first embodiment.
  • the figure shows the operation at the time of search.
  • the search request reception unit 34 stands by until a search request transmitted from the user terminal 14 is received (N in S20).
  • the search request receiving unit 34 receives a search request (specifically, a plurality of words and a neighborhood search request specifying a word distance) (Y in S20)
  • the code acquisition unit 36 refers to the dictionary data and searches.
  • a plurality of codes corresponding to a plurality of words specified in the request are identified (S22).
  • the code acquisition unit 36 turns on a special flag provided in a predetermined area of the memory.
  • Set The code acquisition unit 36 decomposes the specific word into a plurality of single character words and acquires codes corresponding to the single character words (S28).
  • the character distance deriving unit 38 expands the word distance specified in the search request according to the number of single character words (S30). Specifically, the word distance is increased by the difference between the number of single-character words and the number of specific words. If the specific word is not included in the word specified in the search request, that is, if all the specified words are defined in the dictionary data (N in S24), S26 to S30 are skipped.
  • the character distance deriving unit 38 derives a character distance for neighborhood search based on the word distance specified in the search request or the word distance expanded in S30 and the fixed code length (S32).
  • the search instruction unit 40 is a search instruction in which codes of a plurality of words specified in the search request and codes of a plurality of single character words obtained by decomposing the specific words if there are specific words are used as keywords.
  • a neighborhood search instruction further specifying the derived character distance is input to the search execution unit 42.
  • the search execution unit 42 is a neighborhood search process using a plurality of coded document data stored in the coded document storage unit 30 as a population, and executes a character number-based neighborhood search process using an N-gram method (S34). .
  • the search execution unit 42 extracts the coded document data hit in the neighborhood search from the coded document storage unit 30 and extracts the original document data corresponding to the extracted coded document data from the original document storage unit 28.
  • the search instruction unit 40 is a narrow search using one or more coded document data extracted in the neighborhood search process of S34 as a population, and a plurality of simple search.
  • the search instruction unit 40 is further caused to execute a new neighborhood search on condition that the code of the character word is adjacent (S38).
  • the search instruction unit 40 turns off the special flag (S40). If the special flag is off (N in S36), S38 and S40 are skipped.
  • the search result providing unit 44 transmits information related to the encoded document data extracted by the search execution unit 42 and / or information related to the original document data to the user terminal 14 as a vicinity search result (S42).
  • the search result providing unit 44 transmits the result of the refinement search to the user terminal 14 as a neighborhood search result.
  • the result of the neighborhood search based on the word distance designated by the user is presented to the user.
  • the document search device 12 performs a first neighborhood search on the condition that the five words “governor of Tokyo”, “small”, “pond”, “Mr.”, “excursion”, and the character distance 132, and a code that satisfies this condition. Extract the document data.
  • the document search device 12 is a neighborhood search using the coded document data as a population.
  • the second neighborhood search is executed under the conditions of “Pond” and “Mr.” and the word distance 3 (character distance is 33).
  • the second neighborhood search with the word distance 3 is performed under the condition that “small”, “pond”, and “Mr.” are adjacent to each other.
  • “Governor of Tokyo”, “Small”, “Pond”, “Mr.”, “Excursion” is described within the word distance 12 (substantially the word distance 10), and “Small” “Pond” Coded document data (and original document data) adjacent to “Mr.” is extracted.
  • the search execution unit 42 can specify order-oriented (order-oriented), in the second neighborhood search, the three single-character words “small”, “pond”, and “Mr.” are searched with priority. It is desirable to specify more. This makes it easier to obtain search results that are more suitable for the search conditions specified by the user.
  • the second neighborhood search one or more original document data corresponding to one or more encoded document data as a result of the first neighborhood search is used as a population, and the one or more original document data is searched. Those containing the keyword “Mr. Koike” may be extracted by normal search.
  • a search engine can specify a wild card (here, “*”) between keywords, a search phrase “# (Governor of Tokyo) * # (small) # (pond) # (Mr.) ** A search instruction for designating (external tour) ”(where # (word) indicates a code of a word) and designating a character distance 132 (word distance 12) may be input to the search execution unit 42. Also in this case, it becomes easier to obtain a search result that further matches the search condition specified by the user.
  • the number of characters designated neighborhood search is possible, but the number of words designated using an existing N-gram search engine that does not support the word number designated neighborhood search. Can be realized. Therefore, it is not necessary to modify an existing search engine or purchase a new search engine in order to cope with a neighborhood search in which the number of words is specified.
  • the number of words is more intuitive for the user than the number of characters, and is highly convenient as a search condition.
  • a neighborhood search for specifying the number of words which is convenient for users in English-speaking countries, can be realized using an N-gram search engine.
  • the document search device 12 by setting a delimiter character used only at the beginning at the beginning of a plurality of codes corresponding to a plurality of words, it is always determined whether there is a match from the beginning of the code when searching for the code. . Thereby, it is possible to prevent a match from being determined from the middle of a code (ie, from the middle of a word), and it is possible to prevent a match from being determined from a certain code to the next code. Further, according to the document search device 12, a neighborhood search can be realized even when the user designates a word for which no corresponding code exists.
  • the document search apparatus 12 according to the second embodiment is different from the first embodiment in the processing when the code corresponding to the word is not defined in the dictionary data.
  • the configuration of the document search system 10 in the second embodiment and the functional blocks of the document search device 12 are the same as those in the first embodiment (FIGS. 1 and 2).
  • the description of the configuration overlapping with that of the first embodiment will be omitted as appropriate, and differences from the first embodiment will be mainly described.
  • the document conversion unit 32 Skip transcoding and perform transcoding for the next word. That is, the document conversion unit 32 does not record the code related to the specific word in the coded document data. In other words, only the word associated with the code in the dictionary data among the words recorded in the original document data is coded document. Record in data.
  • the code acquisition unit 36 skips code conversion of a specific word when codes corresponding to some words (referred to as specific words) among a plurality of words specified in the search request are undefined in the dictionary data. , Perform code conversion for the next word.
  • the search instruction unit 40 performs a neighborhood search on the condition that a code of a word for which a corresponding code is defined among a plurality of words specified in the search request and a character distance obtained by converting the word distance specified in the search request.
  • the search execution part 42 is made to perform.
  • the search execution unit 42 extracts one or more original document data corresponding to one or more encoded document data satisfying the above conditions from the original document storage unit 28 as the execution result of the neighborhood search.
  • the search instruction unit 40 inputs one or more original document data extracted by the neighborhood search as a population, and inputs an instruction for a narrow search that specifies a specific word as a keyword to the search execution unit 42.
  • the search execution unit 42 performs a narrow search (here, a general keyword search), and extracts original document data including a specific word from one or more original document data extracted by the proximity search.
  • the search result providing unit 44 transmits the result of the narrow search by the search execution unit 42 to the user terminal 14 that is the request source for the proximity search.
  • FIG. 6 is a flowchart showing the operation of the document search device 12 of the second embodiment. This figure corresponds to FIG. 4 and shows the operation at the time of generating coded document data. S50 to S52 in the figure are the same as S10 to S12 in FIG.
  • the document conversion unit 32 When a code corresponding to at least one word (referred to as “specific word”) included in the original document data is not defined in the dictionary data (Y in S54), the document conversion unit 32 skips the processing of the specific word.
  • the code relating to the specific word is not stored in the coded document data (S56). If the specific word does not exist in the original document data, that is, if all the words of the original document data are associated with the code (N in S54), S56 is skipped.
  • the document conversion unit 32 stores the encoded document data obtained by converting each word (excluding specific words) of the original document data into a fixed-length code in the encoded document storage unit 30 (S58).
  • FIG. 7 is also a flowchart showing the operation of the document search apparatus 12 of the second embodiment. This figure corresponds to FIG. 5 and shows the operation at the time of retrieval. S60 and S62 in the figure are the same as S20 and S22 in FIG.
  • the code acquisition unit 36 turns on a special flag provided in a predetermined area of the memory.
  • Set (S66).
  • the search instruction unit 40 excludes the specific word from the proximity search target (S68). If the specific word is not included in the word specified in the search request, that is, if codes corresponding to all the specified words are defined in the dictionary data (N in S64), S66 and S68 are skipped.
  • the character distance deriving unit 38 derives a character distance for neighborhood search based on the word distance specified in the search request and the fixed code length (S70).
  • the search instruction unit 40 is a search instruction that specifies, as a keyword, a code of a remaining word excluding a specific word among a plurality of words specified in the search request, and a neighborhood search further specifying the character distance derived in S70
  • An instruction is input to the search execution unit 42.
  • the search execution unit 42 is a neighborhood search process using a plurality of coded document data stored in the coded document storage unit 30 as a population, and executes a neighborhood search process based on the number of characters based on the N-gram method.
  • the search execution unit 42 extracts the original document data corresponding to the coded document data extracted by the proximity search process from the original document storage unit 28 (S72).
  • the search instruction unit 40 is an instruction for a narrow search using one or more original document data extracted in the neighborhood search process as a population, and a specific word is a keyword.
  • a search refinement instruction designated as “” is input to the search execution unit 42.
  • the search execution unit 42 extracts original document data including a specific word from one or more original document data extracted by the proximity search process (S76).
  • the search instruction unit 40 turns off the special flag (S78). If the special flag is off from the beginning (N in S74), S76 and S78 are skipped.
  • the search result providing unit 44 transmits information related to the encoded document data extracted by the search execution unit 42 and / or information related to the original document data to the user terminal 14 as a vicinity search result (S80). .
  • the search result providing unit 44 transmits the result of the refinement search to the user terminal 14 as a neighborhood search result.
  • the result of the neighborhood search based on the word distance designated by the user is presented to the user.
  • the document search device 12 When the document search device 12 extracts one or more original document data as a result of the neighborhood search, the document search device 12 is a narrow search using the original document data as a population, and a keyword search using the keyword “Mr. Koike” as a condition. Execute. Thereby, “Governor of Tokyo” and “Excursion” are described within the word distance 10 and original document data including “Mr. Koike” is extracted.
  • the document search device 12 of the second embodiment has the same effect as the document search device 12 of the first embodiment. For example, according to the document search device 12 of the second embodiment, even if the user specifies a word for which no corresponding code exists, noise may occur, but a search result that matches the conditions specified by the user as much as possible is obtained. It can be presented to the user.
  • any of a plurality of special values that are not used except for the head may be set at the head of the fixed-length code defined by the dictionary data in the dictionary storage unit 26.
  • a value within a certain range may be set as a special value (delimiter).
  • a range from U + 9000 to U + 9FFF in Unicode may be defined as a special value, and any special value in the above range may be set as the first character of the code (here, 10 characters).
  • the range of U + 1000 to U + 8FFF may be set to the second character to the tenth character of the code.
  • the delimiter indicating the beginning does not appear in the middle of the code, the same effect as the delimiter in the embodiment is obtained. For example, whether or not there is a match is always determined from the beginning of the code, and it is possible to prevent a match from being determined from the middle of the code.
  • the beginning of the code (for example, the first character) can also be used for identifying the code. That is, the code needs to be unique for each word, but since a different value can be set from the special value group within a predetermined range at the beginning of the code, the code length can be made shorter than in the embodiment. Thereby, the size of dictionary data and coded document data can be reduced.
  • the coded document data stored in the coded document storage unit 30 may be obtained by converting a plurality of words that are related to each other described in the original document data into a common code.
  • a common code may be assigned to a plurality of words related to each other.
  • the same code may be assigned to a plurality of words having the same basic form but different utilization forms. In English or the like, the same code may be assigned to the original form, past form, past participle, and plural form of a word.
  • the same code may be assigned to a plurality of words having the same stem but different prefixes or suffixes.
  • the document conversion unit 32 refers to the dictionary data, and is a plurality of words with different spellings, but may convert a plurality of words that are related to each other into the same code.
  • the undefined word (called a specific word) is decomposed into a plurality of single-character words.
  • the document search device 12 of the first embodiment transmits a message to the user terminal 14 that prompts the word distance as the search condition to be longer than the current value (in other words, the specified value at the time of the search request). You may further provide the notification part to display.
  • the transmission timing of this message may be when the neighborhood search result is transmitted or when the specific word is detected by the code acquisition unit 36.
  • the document search device 12 may automatically increase the word distance as a search condition from a user-specified value without presenting a message to the user.
  • the value added to the user-specified value may be determined based on at least one of the number of keywords specified as the search condition, the number of specific words, the number of single-character words, and the user-specified value. Further, an appropriate added value (in other words, an added value determination algorithm) may be determined by the knowledge of the developer or an experiment using the document search device 12.
  • the document search device 12 of the second embodiment transmits a message to the user terminal 14 that prompts the word distance as a search condition to be shorter than the current value (in other words, the specified value at the time of the previous search request). It may further include a notification unit to be displayed. The transmission timing of this message may be when the neighborhood search result is transmitted or when the specific word is detected by the code acquisition unit 36.
  • the document search device 12 of the second embodiment may automatically reduce the word distance as a search condition from a user-specified value without presenting a message to the user.
  • the value to be subtracted from the user-specified value may be determined based on at least one of the number of keywords specified as the search condition, the number of specific words, and the user-specified value.
  • an appropriate subtraction value in other words, a subtraction value determination algorithm may be determined by developer's knowledge or experiments using the document search device 12.
  • the document search device 12 of the first and second embodiments uses an undefined word (referred to as a specific word) as another word. You may further provide the notification part which transmits to the user terminal 14 and displays the message which confirms whether it changes. As a result of the confirmation, when the user terminal 14 returns that the specific word is not changed, the document search device 12 may execute the process related to the specific word described in the first and second embodiments. Good. As a further modification, the document search device 12 automatically sends a specific word to another word (herein referred to as a “defined word”) whose code is defined by dictionary data without presenting a message to the user. It may be converted.
  • a specific word whose code is defined by dictionary data without presenting a message to the user. It may be converted.
  • the document search device 12 holds a table that defines the correspondence between a specific word and a defined word, and when a specific word is detected, identifies the defined word corresponding to the specific word, You may convert into the code of the corresponding defined word.
  • the table may be a table in which specific words having similar meanings and / or similar spellings are associated with defined words.
  • the document search device 12 may execute the process related to the specific word described in the first and second examples.
  • the document search device 12 may convert document data (hereinafter also referred to as “input document data”) input from the outside into a format more suitable for search.
  • the original document storage unit 28 of the document search device 12 may hold the converted document data (hereinafter also referred to as “search-type document data”) as original document data. That is, the original document data may include both input document data and search-type document data.
  • Input document data 1 "I bought a book from that shopper, when I was a little girl.”
  • Search document data 1 after converting the input document data 1 1 ⁇ I buy a book from that shop, when I be a little girl.
  • '' Input document data 2 "The chef cooked a special food at a national event while we were devoted in eating.”
  • Search document data 2 after conversion of input document data 1 “The chef cook a special food at a nation event while we were devot in eat.”
  • the search form document data may be obtained by converting verb words (past form, past participle form, present participle form, etc.) included in the input document data into basic forms (in other words, original form).
  • the search form document data may be a noun word (plural form) included in the input document data converted to a singular form, or a noun word converted to a more general noun word.
  • the document search device 12 acquires input document data input from an external device, converts the input document data into search-type document data with reference to a table in which pre-conversion words and post-conversion words are associated in advance.
  • a search form document generation unit that stores the converted search form document data in the original document storage unit 28 may be further provided.
  • the document search apparatus 12 performs a new neighborhood search (restriction) on condition that codes of a plurality of single character words are adjacent after execution of a neighborhood search using a keyword including a single character word obtained by decomposing a specific word. (Search) may not be executed. In other words, S38 in FIG. 5 may be skipped. It should be noted that the proximity search on the condition that the codes of a plurality of single-character words are adjacent may be executed prior to the proximity search using keywords including single-character words.
  • the document search device 12 of the second embodiment does not have to execute a narrow search using a specific word as a keyword after executing a proximity search using a keyword excluding the specific word.
  • S76 in FIG. 7 may be skipped.
  • a search using a specific word as a keyword may be executed prior to a neighborhood search using a keyword excluding the specific word. For example, after extracting original document data including a specific word, neighborhood search using a keyword excluding the specific word may be executed using coded document data corresponding to the extracted original document data as a population.
  • This invention can be applied to an apparatus for searching for documents.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
PCT/JP2017/033316 2016-11-08 2017-09-14 文書検索装置、文書検索方法およびコンピュータプログラム WO2018088027A1 (ja)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201780069191.9A CN109923538B (zh) 2016-11-08 2017-09-14 文本检索装置、文本检索方法以及计算机程序

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2016217884A JP6787755B2 (ja) 2016-11-08 2016-11-08 文書検索装置
JP2016-217884 2016-11-08

Publications (1)

Publication Number Publication Date
WO2018088027A1 true WO2018088027A1 (ja) 2018-05-17

Family

ID=62110263

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2017/033316 WO2018088027A1 (ja) 2016-11-08 2017-09-14 文書検索装置、文書検索方法およびコンピュータプログラム

Country Status (3)

Country Link
JP (1) JP6787755B2 (enrdf_load_stackoverflow)
CN (1) CN109923538B (enrdf_load_stackoverflow)
WO (1) WO2018088027A1 (enrdf_load_stackoverflow)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102336751B1 (ko) 2018-03-27 2021-12-07 미쯔비시 케미컬 아쿠아·솔루션즈 가부시키가이샤 헤더가 부착된 산기 장치 및 막 분리 활성 오니 장치
WO2020213776A1 (ko) * 2019-04-19 2020-10-22 한국과학기술원 토론 상황 시 객관적이고 구체적이고 정보가 풍부한 근거 문장 검색에 특화된 자질 추출기
CN113656277A (zh) * 2020-05-12 2021-11-16 阿里巴巴集团控股有限公司 一种日志存储方法及装置和智能音箱及云端服务器

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5787386A (en) * 1992-02-11 1998-07-28 Xerox Corporation Compact encoding of multi-lingual translation dictionaries
US20020165707A1 (en) * 2001-02-26 2002-11-07 Call Charles G. Methods and apparatus for storing and processing natural language text data as a sequence of fixed length integers
JP2005242416A (ja) * 2004-02-24 2005-09-08 Shogakukan Inc 自然言語文の検索方法および検索装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6883001B2 (en) * 2000-05-26 2005-04-19 Fujitsu Limited Document information search apparatus and method and recording medium storing document information search program therein
JP2010287052A (ja) * 2009-06-11 2010-12-24 Fujitsu Ltd 検索システムおよび記憶媒体
JP5737079B2 (ja) * 2011-08-31 2015-06-17 カシオ計算機株式会社 テキスト検索装置、テキスト検索プログラム、及びテキスト検索方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5787386A (en) * 1992-02-11 1998-07-28 Xerox Corporation Compact encoding of multi-lingual translation dictionaries
US20020165707A1 (en) * 2001-02-26 2002-11-07 Call Charles G. Methods and apparatus for storing and processing natural language text data as a sequence of fixed length integers
JP2005242416A (ja) * 2004-02-24 2005-09-08 Shogakukan Inc 自然言語文の検索方法および検索装置

Also Published As

Publication number Publication date
JP6787755B2 (ja) 2020-11-18
JP2018077611A (ja) 2018-05-17
CN109923538A (zh) 2019-06-21
CN109923538B (zh) 2023-09-15

Similar Documents

Publication Publication Date Title
JP5054210B2 (ja) 属性抽出装置および方法
JP2009075791A (ja) 機械翻訳を行う装置、方法、プログラムおよびシステム
JP7542812B1 (ja) プログラム、方法、情報処理装置、システム
JP2012027788A (ja) 文書検索システム、文書検索方法およびプログラム
JP2019121060A (ja) 生成プログラム、生成方法及び情報処理装置
WO2018088027A1 (ja) 文書検索装置、文書検索方法およびコンピュータプログラム
US10929446B2 (en) Document search apparatus and method
US10346545B2 (en) Method, device, and recording medium for providing translated sentence
JP7022789B2 (ja) 文書検索装置、文書検索方法およびコンピュータプログラム
CN112925882B (zh) 一种信息处理方法及装置
JP5248121B2 (ja) 愛称を推定する装置、方法およびプログラム
JP2009093581A (ja) 類義語検索管理システム
US20220083736A1 (en) Information processing apparatus and non-transitory computer readable medium
CN108536685B (zh) 信息处理装置
JP4187802B2 (ja) 文書作成装置
JP5160120B2 (ja) 情報検索装置、情報検索方法及び情報検索プログラム
JPWO2014102992A1 (ja) データ加工システムおよびデータ加工方法
JP2025039810A (ja) プログラム、方法、情報処理装置、システム
WO2018135023A1 (ja) 情報処理システム、情報処理方法、およびコンピュータプログラム
JP4061283B2 (ja) 字句をデータに変換する装置、方法及びプログラム
KR20220111823A (ko) 신조어 및 이모티콘 감성사전 구축장치 및 방법
JP2005353024A (ja) データ管理プログラムおよびデータ管理システム
JPH0245876A (ja) 文書検索装置
JPH0997257A (ja) 機械翻訳システム
JP2018190350A (ja) 翻訳支援システムおよび翻訳支援方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17869004

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17869004

Country of ref document: EP

Kind code of ref document: A1