WO2020037794A1 - Index building method for english geographical name, and query method and apparatus therefor - Google Patents

Index building method for english geographical name, and query method and apparatus therefor Download PDF

Info

Publication number
WO2020037794A1
WO2020037794A1 PCT/CN2018/109938 CN2018109938W WO2020037794A1 WO 2020037794 A1 WO2020037794 A1 WO 2020037794A1 CN 2018109938 W CN2018109938 W CN 2018109938W WO 2020037794 A1 WO2020037794 A1 WO 2020037794A1
Authority
WO
WIPO (PCT)
Prior art keywords
english
place name
index
query
candidate
Prior art date
Application number
PCT/CN2018/109938
Other languages
French (fr)
Chinese (zh)
Inventor
张雪英
叶鹏
杜咪
Original Assignee
南京师范大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南京师范大学 filed Critical 南京师范大学
Publication of WO2020037794A1 publication Critical patent/WO2020037794A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/907Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/909Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using geographical or spatial information, e.g. location

Definitions

  • the invention relates to the field of natural language processing, and in particular, to a method for querying English place name dictionaries for large-scale place name data.
  • Geographical name dictionary query is the basic operation of applications such as geographical name spelling check, fuzzy matching, optical identification, etc., to provide geographical name knowledge support. With the acceleration of the process of global integration, the speed of international geographical name information transmission continues to increase and the frequency of use is increasing. As one of the languages widely used in the world, English is often used as a standard for the translation, storage and management of place names between different languages. At the same time, the explosive growth of data and the rapid development of information storage technology have made large-scale geographical name data collections increasingly common. Therefore, how to efficiently perform English place name dictionary query in a large-scale data environment has become an important technical challenge to improve many place name services and applications.
  • Inverted files as a simple and efficient way of indexing document data, is a basic technology implemented by modern search engine retrieval systems, and is gradually introduced into the dictionary lookup mechanism.
  • the word-level index is a general organization of inverted documents to achieve phrases or near queries.
  • the N-gram index is the most commonly used word-level index structure. Although the N-gram structure improves the recall rate of the query to a certain extent, the vocabulary generated by the N-gram usually increases the space occupied by the index and reduces the speed of the construction processing and query processing.
  • index items formed in the form of morphemes need to be calculated by using similarity in fuzzy queries, and each index item needs to be compared with the query conditions.
  • This query mode greatly increases the complexity of the operating mechanism, and it is difficult to adapt to the application requirements of large-scale data environments.
  • the technical problems to be solved by the present invention include: how to efficiently return completely accurate or closest query results in the case of inaccurate and incomplete place names input in English place name query, and related technical problems under the technical problem.
  • the invention provides a method for establishing an English place name index, which is applied to user equipment.
  • the method includes: S1) Counting a plurality of feature values of all English place name groups stored in the English place name dictionary text, the feature values including the total number of letters, The number of letters, the total number of words, and the first letter of the word; S2) Generate a corresponding multi-dimensional feature statistical vector according to the feature values of the English noun group; S3) The multi-dimensional feature statistical vector of each English noun group and its The position mapping information of the inverted table is used as an index entry to establish an inverted index file, wherein each of the index entries corresponds to an inverted chain, respectively.
  • the feature values of all the noun groups stored in the English place-name dictionary are counted in order: the total number of letters, the number of letter radicals, the total number of words, and the first letter code of the word.
  • the total number of letters represents the sum of all the letters included in the noun noun group;
  • the radicals of letters are based on the Chinese pictograph idea, and each English letter is set by "
  • the expressions of radicals with different letters are shown in Table 1 below. Obviously, the more identical characters appear in two strings, the more similar they are considered.
  • the first letter of a word is to convert the first letter of a word in a place noun group into a numeric code form.
  • the conversion rule is to map the codes of "01" to "26" in the order of A to Z, that is, A code is "01", B Coded as "02", and so on. In the process of code conversion, the first letter is converted to uppercase.
  • " is represented by number 1
  • radical "-" is represented by number 2
  • radical "/" is represented by number 3
  • radical " ⁇ " is represented by number 4
  • radical "(" is represented by number 5
  • Indicates that the radical ")" is indicated by the number 6.
  • each index entry records the total number of letters, the number of letters, the total number of words, the first letter of the word, and the position of the inverted table.
  • f cn represents the total number of letters, and records a 1-dimensional vector.
  • f ar represents the number of alphabetic radicals, and there are 6 radicals in total, and a 6-dimensional vector is recorded.
  • f wn represents the total number of words and records a 1-dimensional vector.
  • f iw represents the first letter encoding.
  • the first letter encoding information of the first 4 words in the noun group in the phrase is recorded, and the absent complementation code "00" for less than 4 words is recorded, and a 4-dimensional vector is recorded.
  • These vectors according to equation (1), (2) and (3) simultaneous manner, constituting the 12-dimensional vector d i. d i as an index item fully characterizes the textual characteristics of the English place name string, and uses this as the entry point for English place name query.
  • f ar [f ar1 , f ar2 , ..., f ar6 ] (2)
  • each index entry appearing in the dictionary corresponds to an inverted chain
  • the inverted chain uses a data structure record (tf, ⁇ p1, p2, ..., pf>) to record the records Hit information of the index entry in the dictionary of place names.
  • tf represents the number of occurrences of the index entry in the place-name dictionary
  • pi represents position offset information appearing in the place-name dictionary each time. All hits are arranged in order to form the corresponding inverted chain.
  • the invention also provides an English place name query method, which is applied to user equipment.
  • the English place name query method includes: obtaining a search keyword entered by the user on the user device; and searching and searching according to a pre-built index file in the English place name database.
  • the candidate place name set related to the search keywords is described, wherein the index file stored on the user equipment is constructed according to the English place name index establishment method described in the first aspect; and the candidate place name collection is returned to the user equipment for display.
  • the selection process of the candidate place name collection is:
  • f cn represents the total number of letters
  • f ar represents the initial number of letters
  • f wn represents the total number of words
  • f iw represents the first letter code.
  • k cn represents the threshold for the total number of letters
  • k ar represents the threshold for the number of letters in the alphabet
  • k wn represents the threshold for the number of words
  • k iw represents the threshold for the number of letters in the dimension.
  • the index information d i for the reverse analysis offset information according to the position corresponding to the inverted chain ⁇ p 1, p 2, ... , p f>, data on place names to query the storage location gazetteer .
  • the results of all the query of place names are combined to form a set of candidate place names.
  • the above-mentioned English place name query method may further include the following steps: after obtaining a set of candidate place names, calculating a similarity value between each of the English place names and the search keywords in the candidate place name set; Sort the Chinese and English place names in the candidate place name set in a small order, and return the sorted result to the user device for display.
  • sim (P, W) is the order similarity value between P and W, and len (N), len (P), and len (W) represent the string length values of N, P, and W, respectively.
  • the candidate place names are sorted according to the order similarity advanced, and the sorted results are returned to the user as the final query result.
  • the present invention uses the multi-dimensional text statistical features such as the number of words and letters in the summarized place names to perform the place name query according to the main line of "multi-dimensional feature statistics-inverted index generation-candidate place name query-similarity order".
  • index generation for each place name record, features such as the total number of letters, the number of radicals, the total number of words, and the initials of the words are extracted, and a vector composed of multidimensional features is used as an index to construct a corresponding inverted index structure.
  • candidate place name search and order similarity sorting the query request is normalized and multi-dimensional feature extraction is performed.
  • the candidate index name set is queried in the inverted index, and the candidate set is ranked according to the similarity from high to low.
  • the sort is returned to the user. It has been proved through experiments that the query method for English place names based on feature statistical inverted index proposed by the present invention not only maintains high operation efficiency in a large-scale data environment, but also can accurately find out when the place name expression is inaccurate. Target place names for users to have a better user experience.
  • FIG. 1 is a flowchart of a method for establishing an English place name index according to the present invention.
  • FIG. 2 is a flowchart of an English place name query method according to the present invention.
  • FIG. 3 is a flowchart of a method for querying English place names in a preferred embodiment of the present invention.
  • FIG. 4 is a schematic diagram of an English place name query device according to the present invention.
  • FIG. 5 is a schematic diagram of an English place name query device in a preferred embodiment of the present invention.
  • FIG. 6 is a graphic flowchart of a query method for an English place name dictionary according to the present invention.
  • FIG. 7 is a schematic diagram of an inverted index structure in the method for establishing an English place name index according to the present invention.
  • Letter radicals refer to the use of six characters "
  • the number of alphabetic radicals refers to the number of all radicals in English place names corresponding to the number of the alphabetical radicals (requiring the first letter of each word in the English place name to be capitalized).
  • Letter encoding refers to converting the first letter of a word in a place noun group into a numeric encoding form.
  • the conversion rule is to map the encoding of "01" to "26" in the order of A to Z, that is, A encoding is “01” and B encoding "02", and so on.
  • this embodiment provides a method for establishing an English place name index, which is applied to user equipment.
  • the method includes the following steps:
  • the feature values include the total number of letters, the number of radicals, the total number of words, and the initial code of the words;
  • An inverted index file is established by using the multidimensional feature statistical vector of each English noun group and its position mapping information in the inverted table as index entries, where each of the index entries corresponds to an inverted chain.
  • the multi-dimensional feature vectors statistic [f cn, f ar, f wn, f iw], where, d i represents the multidimensional feature vector statistics English names, f cn represents the total number of letters, f ar Represents the initial number of letters, f wn represents the total number of words, f iw represents the first letter code, the f ar includes the number of 6 radicals, and the f iw includes the first letter code of the first 4 words in the English noun group information.
  • the method for establishing an English place name index may further include: when querying an English place name according to the index item, comparing a search key with the index item, and when the index item meets the following conditions, the index Terms as candidates for the query; the conditions include:
  • qf cn indicates the total number of letters in the search keyword
  • qf ar indicates the number of letters in the search keyword
  • qf wn indicates the total number of words in the search keyword
  • qf iw indicates the first letter code in the search keyword
  • k cn represents the threshold of the total number of letters
  • k ar represents the threshold of the number of letters in the alphabet
  • k wn represents the threshold of the total number of words
  • k iw represents the threshold of the first letter encoding dimension.
  • this embodiment provides an English place name query method, which is applied to user equipment.
  • the English place name query method includes the following steps:
  • the English place name query method may further include:
  • a method for calculating a similarity value between each English place name and a search keyword in the candidate place name set is:
  • P is the character string of the search keyword
  • W is the character string of the English place name
  • sim (P, W) is the order similarity value between P and W
  • len (N) len (P) and len (W ) Represents the string length values of N, P, and W, respectively, and N represents the same sequence of characters between P and W.
  • this embodiment provides an English place name query device 300, which is applied to user equipment, and specifically includes a receiving module 310, a searching module 320, and a display module 330.
  • the receiving module 310 is configured to obtain a user input on the user device.
  • Search keywords the search module 320 is configured to search a candidate place name set related to the search keywords according to a pre-established index file in an English place name database, wherein the index file stored on the user equipment is according to claim 1 or 2
  • the English place name index establishing method is constructed;
  • the display module 330 is configured to display a set of candidate place names returned to the user equipment.
  • the English place name query device further includes a similarity calculation module 410 and a ranking module 420.
  • the similarity calculation module 410 is configured to calculate each English word in the candidate place name set after obtaining the set of candidate place names.
  • the similarity value between the place name and the search keyword; the sorting module 420 is used to sort the Chinese and English place names in the candidate place name set in the order of the similarity value and return to the user device; the display module displays the The sorting results are described.
  • the formula for calculating the similarity value of each English place name and the search keyword in the candidate place name set in the similarity calculation module includes:
  • P is the character string of the search keyword
  • W is the character string of the English place name
  • sim (P, W) is the order similarity value between P and W
  • len (N) len (P) and len (W ) Represents the string length values of N, P, and W, respectively, and N represents the same sequence of characters between P and W.
  • Step 11 The feature values of all the noun groups stored in the English place-name dictionary are counted in turn, including: the total number of letters, the number of letter radicals, the total number of words, and the first letter code of the word. Taking the place name "Aalders Lang Lang” as an example, the total number of letters is 16. As for the number of alphabetical radicals, the numbers of the six radicals "
  • Step 12 Build the index dictionary file.
  • each index entry records the total number of letters, the number of initials, the total number of words, the first letter of the word, and the position of the inverted table. Take the place name "Aalders Lang Brook" as an example, since the total number of letters is 16, the initials of the letters are 9, 8, 3, 2, 12, 9, the total number of words is 3, and the first letter of the code is 1, 12, 2, 0, so the multidimensional feature vector is expressed as [16, [9,8,3,2,12,9], 3, [1,12,2,0]].
  • the index entry structure in the index dictionary file is ([16, [9,8,3,2,12,9], 3, [1,12 , 2,0]], ⁇ 1001>).
  • Step 13 Build the inverted file.
  • Each index entry that appears in the dictionary corresponds to an inverted chain.
  • the inverted chain uses a data structure (tf, ⁇ p 1 , p 2 ,..., p f >) to record the index entry in the place name dictionary. Hit information.
  • the corresponding inverted table position mapping information is ⁇ 1001>, That is, the position of 1001 in the inverted chain file stores the storage position information of all the phrases in the English place-name dictionary with multidimensional feature vectors.
  • the record information of the position of the reverse link file 1001 is ( ⁇ 5>, ⁇ 7>, ..., ⁇ 125>, ...), which indicates that the storage position of the related noun group in the English place name dictionary is 5, 7, ... 125 and so on.
  • Step 21 For the submitted query place name, first perform normalization processing, that is, convert the words in the place noun group into the form of initial capital letters. Taking the query of the place name "Alders Langbrook” as an example, it needs to be converted to "Alders Lang Lang".
  • Q [qf cn , qf ar , qf wn , qf iw ].
  • Step 23 Use Q to compare with the index entries in the index dictionary.
  • the index entry d i is a candidate qd i .
  • f cn represents the total number of letters
  • f ar represents the initial number of letters
  • f wn represents the total number of words
  • f iw represents the first letter code.
  • k cn represents the threshold for the total number of letters
  • k ar represents the threshold for the number of letters in the alphabet
  • k wn represents the threshold for the number of words
  • k iw represents the threshold for the number of letters in the dimension.
  • Step 24 Perform reverse analysis on the index information in the candidate qd i , and query the relevant storage locations in the place name dictionary according to the corresponding position offset information ⁇ p 1 , p 2 , ..., p f > in the inverted chain.
  • Place name data The results of all the query of place names are combined to form a set of candidate place names.
  • the index entry [16, [9, 8, 3, 2, 12, 9], 3, [1, 12, 2, 0]], ⁇ 1001 >) Is the candidate qd i , analyze all the inverted chain mapping position information in qd i , and find the related record ⁇ 1001> in the inverted chain.
  • Step 31 Determine the degree of similarity between place names by counting the proportion of the same number of characters between the two strings.
  • P p 1 p 2 ... p n
  • W w 1 w 2 ... w m
  • N the same sequence of characters between P and W.
  • the judgment of the same N order is based on two principles: (1) the same principle of partial order.
  • the order in P is ls 1 ls 2 ls 3
  • the order in W is ls 2 ls 1 ls 3 .
  • N ls 1 ls 3 .
  • the calculation formula for the similarity of place names of P and W is shown in formula (5).
  • sim (P, W) is the order similarity value between P and W, and len (N), len (P), and len (W) represent the string length values of N, P, and W, respectively. That is, the similarity between "Aalders Lang Brook” and “Lang Aalders Brook” is 12/16 ⁇ 0.75.
  • Step 32 Sort by similarity. Based on the similarity calculation result in step 31, the place names C q in the candidate place name set C are sorted according to the similarity results from high to low, and the top n C q is used as the query result.
  • this embodiment uses 115,000 English place name data as an example to construct an English place name dictionary, and extracts 5,409 place names as standard place names. Construct a test set of standard place names by artificially adding errors.
  • the types of errors cover multiple inaccurate description methods (such as: multiple letters; missing letters; letter errors; alphabetical order replacement, etc.);
  • the accuracy of the comparison is divided into 5 levels (as shown in the table). Among them, the definition of accuracy is shown in Equation 6:
  • A represents the exact number of characters in the query place name P compared with the target place name C
  • N represents the number of characters in the search place name P
  • accu (P, C) represents the accuracy of P.
  • the content in brackets is the target place name corresponding to the test place name, which is the standard place name form.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An English gazetteer query method, belonging to the field of natural language processing, comprising: performing geographical name query by utilizing text features such as the total number of letters, the number of letter radicals, the total number of words, and the initial letter coding in a geographical name according to the mainline of "multidimensional feature statistics-inverted index generation-candidate geographical name query-similarity ranking", to obtain a feature statistics-inverted index-based English gazetteer query method. The method maintains high operation efficiency under the large-scale data environment, and can accurately query the target geographical name in the case of inaccurate query geographical name expression, so that the user can obtain better user experience.

Description

一种英文地名的索引建立方法及其查询方法和装置Index establishment method for English place names and query method and device thereof 技术领域Technical field
本发明涉及一种自然语言处理领域,特别是涉及一种面向大规模地名数据的英文地名词典查询方法。The invention relates to the field of natural language processing, and in particular, to a method for querying English place name dictionaries for large-scale place name data.
背景技术Background technique
地名词典查询是地名拼写校验、模糊匹配、光学识别等应用的基础操作,为其提供地名词语知识支持。随着全球一体化进程的加快,国际间地名信息的传递速度不断加快、使用频率日益提高。英文作为世界上广泛使用的语言之一,通常用作不同语言文字之间地名转译、存储和管理的标准。同时,数据的爆炸性增长和信息存储技术的迅速发展,使得大规模的地名数据集合日益普遍。因此,如何在大规模数据环境下高效地进行英文地名词典查询,成为完善众多地名服务与应用的重要技术挑战。Geographical name dictionary query is the basic operation of applications such as geographical name spelling check, fuzzy matching, optical identification, etc., to provide geographical name knowledge support. With the acceleration of the process of global integration, the speed of international geographical name information transmission continues to increase and the frequency of use is increasing. As one of the languages widely used in the world, English is often used as a standard for the translation, storage and management of place names between different languages. At the same time, the explosive growth of data and the rapid development of information storage technology have made large-scale geographical name data collections increasingly common. Therefore, how to efficiently perform English place name dictionary query in a large-scale data environment has become an important technical challenge to improve many place name services and applications.
常规的词典查询方法一般是采用顺序遍历或者二分查找的方法来获取查询记录,但是其运行效率与数据规模大小成线性关系,当面对海量数据时很难满足实际需要。倒排文件作为一种简单、高效的文档数据索引方式,是现代搜索引擎检索系统实现的一项基础技术,逐步被引入到词典查找机制中。词级(Word-Level)索引是倒排文件实现短语或临近查询的一般组织方式,其中N-gram索引是一种最常用的词级索引结构。N-gram结构虽然在一定程度上提升了查询的召回率,但是N-gram产生的词元通常较多增大了索引的空间资源占用,并且导致构建处理和查询处理的速度下降。此外以词素形式构成的索引项在模糊查询时需要借助相似度计算,每一个索引项都需要与查询条件进行相似度比较。这种查询模式大大增加了运行机制的复杂度,很难适应大规模数据环境的应用需求。Conventional dictionary query methods generally use sequential traversal or binary search to obtain query records, but their operating efficiency is linear with the size of the data, and it is difficult to meet actual needs when facing massive data. Inverted files, as a simple and efficient way of indexing document data, is a basic technology implemented by modern search engine retrieval systems, and is gradually introduced into the dictionary lookup mechanism. The word-level index is a general organization of inverted documents to achieve phrases or near queries. The N-gram index is the most commonly used word-level index structure. Although the N-gram structure improves the recall rate of the query to a certain extent, the vocabulary generated by the N-gram usually increases the space occupied by the index and reduces the speed of the construction processing and query processing. In addition, the index items formed in the form of morphemes need to be calculated by using similarity in fuzzy queries, and each index item needs to be compared with the query conditions. This query mode greatly increases the complexity of the operating mechanism, and it is difficult to adapt to the application requirements of large-scale data environments.
因此,为了应对不同场景的实际应用需求,如何在英文地名查询输入地名不准确、不完整的情况下,高效地返回完全准确或者最为接近的查询结果,是目前本领域技术人员需要研究和解决的难题。Therefore, in order to meet the actual application requirements in different scenarios, how to efficiently return completely accurate or closest query results in the case of inaccurate and incomplete place names entered in English place name queries is currently required by those skilled in the art to study and solve problem.
发明内容Summary of the Invention
技术问题technical problem
本发明所要解决的技术问题包括:如何在英文地名查询输入地名不准确、不完整的情况下高效地返回完全准确或者最为接近的查询结果,以及在该技术问题下的相关技术问题。The technical problems to be solved by the present invention include: how to efficiently return completely accurate or closest query results in the case of inaccurate and incomplete place names input in English place name query, and related technical problems under the technical problem.
发明内容Summary of the Invention
技术贡献内容的概要:发掘英文地名中包含的文本特征并与词典查询机制结合,是提升查询性能的关键所在,本发明利用地名中的字母总数、字母部首数、单词总数与单词首字母编码等文本特征,按照“多维特征统计-倒排索引生成-候选地名查询-相似程度排序”的主线进行地名查询,提出一种基于特征统计倒排索引的英文地名词典查询方法。Summary of technical contributions: Discovering the text features contained in English place names and combining them with a dictionary query mechanism is the key to improving query performance. The present invention uses the total number of letters in the place name, the number of radicals, the total number of words, and the first letter of the word. For text features, place name query is performed according to the main line of "multi-dimensional feature statistics-inverted index generation-candidate place name query-similarity order", and an English place name dictionary query method based on feature statistics inverted index is proposed.
技术方案Technical solutions
第一方面first
本发明提供了一种英文地名索引建立方法,应用于用户设备,所述方法包括:S1)统计在英文地名词典文本中存储全部英文地名词组的多个特征值,所述特征值包括字母总数、字母部首数、单词总数与单词首字母编码;S2)根据英文地名词组的各所述特征值生成一组对应的多维特征统计向量;S3)将各英文地名词组的多维特征统计向量及其在倒排表的位置映射信息作为索引项建立倒排索引文件,其中,各所述索引项分别对应一倒排链。The invention provides a method for establishing an English place name index, which is applied to user equipment. The method includes: S1) Counting a plurality of feature values of all English place name groups stored in the English place name dictionary text, the feature values including the total number of letters, The number of letters, the total number of words, and the first letter of the word; S2) Generate a corresponding multi-dimensional feature statistical vector according to the feature values of the English noun group; S3) The multi-dimensional feature statistical vector of each English noun group and its The position mapping information of the inverted table is used as an index entry to establish an inverted index file, wherein each of the index entries corresponds to an inverted chain, respectively.
下面对上述英文地名索引建立方法的过程和原理,进行详细说明。The process and principle of the above English place name index establishment method are described in detail below.
首先,关于特征值的统计,依次统计英文地名词典中存储全部地名词组的特征值包括:字母总数、字母部首数、单词总数与单词首字母编码。其中,(1)字母总数表示地名词组中包含的全部字母总和;(2)字母部首是按照中文汉字的象形文字思想,设定每个英文字母由“|”、“—”、“/”、“\”,“(”与“)”6个部首中的部分部首组成,不同字母的部首表达如下表1所示。显然,两个字符串中出现的相同字符越多,则认为两者越相似。但是英文字母个数较多,在索引项中记录每个字母的出现频次会占用过多的存储空间且不利于字符串间的比较。将每个字母用固定的部首来表达,能够在隐含记录字母出现频次特征的前提下简化查询时的比较复杂度;(3)单词总数表示地名词组中包含的全部单词总和;(4)单词首字母编码是指将地名词组中单词的首字母转换为数字编码形式,转换规则为按照A至Z的顺序分别映射“01”至“26”的编码,即A编码为“01”,B编码为“02”,以此类推。在编码转换的过程中统一将首字母转换为大写字母形式。First, regarding the statistics of the feature values, the feature values of all the noun groups stored in the English place-name dictionary are counted in order: the total number of letters, the number of letter radicals, the total number of words, and the first letter code of the word. Among them, (1) the total number of letters represents the sum of all the letters included in the noun noun group; (2) the radicals of letters are based on the Chinese pictograph idea, and each English letter is set by "|", "-", "/" , "\", "(" And ")" are composed of some radicals. The expressions of radicals with different letters are shown in Table 1 below. Obviously, the more identical characters appear in two strings, the more similar they are considered. However, the number of English letters is large, and recording the occurrence frequency of each letter in the index entry will occupy too much storage space and is not conducive to comparison between strings. Expressing each letter with a fixed radical can simplify the comparative complexity of the query on the premise of implicitly recording the frequency occurrence of the letters; (3) the total number of words represents the sum of all words included in the noun noun group; (4) The first letter of a word is to convert the first letter of a word in a place noun group into a numeric code form. The conversion rule is to map the codes of "01" to "26" in the order of A to Z, that is, A code is "01", B Coded as "02", and so on. In the process of code conversion, the first letter is converted to uppercase.
Figure PCTCN2018109938-appb-000001
Figure PCTCN2018109938-appb-000001
Figure PCTCN2018109938-appb-000002
Figure PCTCN2018109938-appb-000002
表1Table 1
其中,部首“|”用编号1表示,部首“—”用编号2表示,部首“/”用编号3表示,部首“\”用编号4表示,部首“(”用编号5表示,部首“)”用编号6表示。Among them, the radical "|" is represented by number 1, the radical "-" is represented by number 2, the radical "/" is represented by number 3, the radical "\" is represented by number 4, and the radical "(" is represented by number 5 Indicates that the radical ")" is indicated by the number 6.
其次,关于索引项的构成,在索引词典中,每条索引项先后分别记录字母总数、字母部首数、单词总数、单词首字母编码与倒排表位置信息。其中,f cn表示字母总数,记录1维向量。f ar表示字母部首数,共有6个部首的个数信息,记录6维向量。f wn表示单词总数,记录1维向量。f iw表示首字母编码,本方法中记录词组中地名词组中前4个单词的首字母编码信息,不足4个单词的缺位补足编码“00”,记录4维向量。将这些向量按照公式(1)、(2)与(3)的方式联立,构成12维向量d i。d i作为索引项充分表征英文地名字符串的文本特性,以此作为英文地名查询的入口。 Secondly, with regard to the composition of the index entries, in the index dictionary, each index entry records the total number of letters, the number of letters, the total number of words, the first letter of the word, and the position of the inverted table. Among them, f cn represents the total number of letters, and records a 1-dimensional vector. f ar represents the number of alphabetic radicals, and there are 6 radicals in total, and a 6-dimensional vector is recorded. f wn represents the total number of words and records a 1-dimensional vector. f iw represents the first letter encoding. In this method, the first letter encoding information of the first 4 words in the noun group in the phrase is recorded, and the absent complementation code "00" for less than 4 words is recorded, and a 4-dimensional vector is recorded. These vectors according to equation (1), (2) and (3) simultaneous manner, constituting the 12-dimensional vector d i. d i as an index item fully characterizes the textual characteristics of the English place name string, and uses this as the entry point for English place name query.
d i=[f cn,f ar,f wn,f iw]                        (1) d i = [f cn , f ar , f wn , f iw ] (1)
f ar=[f ar1,f ar2,…,f ar6]                       (2) f ar = [f ar1 , f ar2 , ..., f ar6 ] (2)
f iw=[f iw1,f iw2,…,f iw4]                       (3) f iw = [f iw1 , f iw2 , ..., f iw4 ] (3)
再者,关于构建倒排链文件,出现在词典中的每个索引项对应一个倒排链,倒排链利用一个文档命中记录的数据结构(tf,<p1,p2,…,pf>)记录索引项在地名词典中的命中信息。其中,tf表示索引项在地名词典中的出现次数,pi表示每次出现在地名词典中的位置偏移信息。全部命中信息有序排列构成其所对应的倒排链。Furthermore, with regard to constructing an inverted chain file, each index entry appearing in the dictionary corresponds to an inverted chain, and the inverted chain uses a data structure record (tf, <p1, p2, ..., pf>) to record the records Hit information of the index entry in the dictionary of place names. Among them, tf represents the number of occurrences of the index entry in the place-name dictionary, and pi represents position offset information appearing in the place-name dictionary each time. All hits are arranged in order to form the corresponding inverted chain.
第二方面Second aspect
本发明还提供了一种英文地名查询方法,应用于用户设备,所述英文地名查询方法包括:获取用户在用户设备上输入的检索关键词;根据英文地名数据库中预先建立的索引文件查找与所述检索关键词相关的候选地名集合,其中,所述用户设备上存储的索引文件是根据上述第一方面所述的英文地名索引建立方法构建得到;将候选地名集合返回至用户设备上进行显示。The invention also provides an English place name query method, which is applied to user equipment. The English place name query method includes: obtaining a search keyword entered by the user on the user device; and searching and searching according to a pre-built index file in the English place name database. The candidate place name set related to the search keywords is described, wherein the index file stored on the user equipment is constructed according to the English place name index establishment method described in the first aspect; and the candidate place name collection is returned to the user equipment for display.
下面对上述英文地名查询方法的过程和原理,进行详细说明。The process and principle of the above English place name query method are described in detail below.
关于候选地名集合的选取过程为:The selection process of the candidate place name collection is:
第一,对于提交的查询地名,首先进行规范化处理,即将地名词组中单词的转化为首字母大写的形式。First, for submitted query place names, first perform a normalization process, that is, the words in the place noun group are converted into a capital letter.
第二,按照构建索引时的特征统计规则,对查询地名的各项特征值进行统计,并组织成向量的形式表示为Q=[qf cn,qf ar,qf wn,qf iw]。 Second, according to the feature statistics rules when constructing the index, the feature values of the query place names are counted and organized into a vector form Q = [qf cn , qf ar , qf wn , qf iw ].
第三,利用Q与索引词典中的索引项进行比较,当满足公式(4)时则该索引项d i为候选项。 Third, using the index item Q by comparing the index dictionary, When the formula (4) the index entry is d i candidates.
Figure PCTCN2018109938-appb-000003
Figure PCTCN2018109938-appb-000003
式中,f cn表示字母总数,f ar表示字母部首数,f wn表示单词总数,f iw表示首字母编码。k cn表示字母总数维度阈值,k ar表示字母部首数维度阈值,k wn表示单词总数维度阈值,k iw表示首字母编码维度阈值。 In the formula, f cn represents the total number of letters, f ar represents the initial number of letters, f wn represents the total number of words, and f iw represents the first letter code. k cn represents the threshold for the total number of letters, k ar represents the threshold for the number of letters in the alphabet, k wn represents the threshold for the number of words, and k iw represents the threshold for the number of letters in the dimension.
第四,对于d i中的索引信息进行逆向解析,根据倒排链中对应的位置偏移信息<p 1,p 2,…,p f>,查询到地名词典中相关存储位置上的地名数据。将查询到的全部地名数据进行结果合并,形成候选地名集合。 Fourthly, the index information d i for the reverse analysis, offset information according to the position corresponding to the inverted chain <p 1, p 2, ... , p f>, data on place names to query the storage location gazetteer . The results of all the query of place names are combined to form a set of candidate place names.
在一些优选方案中,上述英文地名查询方法还可以包括以下步骤:在得到候选地名集合后,计算所述候选地名集合中各英文地名与检索关键词的相似度值;按照相似度值由大到小的顺序对所述候选地名集合中英文地名进行排序,并将排序结果返回至用户设备上进行显示。In some preferred solutions, the above-mentioned English place name query method may further include the following steps: after obtaining a set of candidate place names, calculating a similarity value between each of the English place names and the search keywords in the candidate place name set; Sort the Chinese and English place names in the candidate place name set in a small order, and return the sorted result to the user device for display.
关于候选地名集合的顺序相似度排序过程:About the order similarity ranking process of the candidate place name set:
第一:顺序相似度计算,对于候选地名集合中的全部地名词组计算其与查询地名的顺序相似度。假设有P=p 1p 2…p n和W=w 1w 2…w m两个地名字符串,N表示P与W之间顺序相同的字符。N顺序相同的判断依据两个原则:(1)局部顺序相同原则。N由局部相似项ls i组成,P与W之间可能存在多个ls i。若P中存在子串q i=p jp j+1…p k,与W中的子串w sw s+1…w t完全相同,则ls i符合局部顺序相同原则,设定ls i为一个局部相似项。(2)整体顺序相同原则。P与W之间顺序相同的ls i组成N。P与W的地名相似度计算公式如公式(5)所示。 First: order similarity calculation, for all place noun groups in the candidate place name set, calculate the order similarity with the query place name. Suppose there are two place name strings, P = p 1 p 2 … p n and W = w 1 w 2 … w m , where N represents the same sequence of characters between P and W. The judgment of the same N order is based on two principles: (1) the same principle of partial order. N similar items ls i by the local composition, there may be a plurality ls i between P and W. If there exists a substring q i = p j p j + 1 … p k in P, which is exactly the same as the sub string w s w s + 1 … w t in W, ls i conforms to the principle of the same local order. Set ls i Is a local similarity. (2) The overall order is the same. Ls i in the same order between P and W constitute N. The calculation formula for the similarity of place names of P and W is shown in formula (5).
Figure PCTCN2018109938-appb-000004
Figure PCTCN2018109938-appb-000004
式中,sim(P,W)为P与W之间的顺序相似度值,len(N)、len(P)与len(W)分别表示N、P与W的字符串长度值。In the formula, sim (P, W) is the order similarity value between P and W, and len (N), len (P), and len (W) represent the string length values of N, P, and W, respectively.
第二:顺序相似度排序。对于候选地名按照顺序相似度高级进行排序,并将排序结果作为最终查询结果返回用户。Second: order similarity ranking. The candidate place names are sorted according to the order similarity advanced, and the sorted results are returned to the user as the final query result.
技术效果Technical effect
本发明利用总结地名中的单词数、字母数等多维文本统计特征,按照“多维特征统计-倒排索引生成-候选地名查询-相似程度排序”的主线进行地名查询。在索引生成过程中,对每条地名记录提取字母总数、字母部首数、单词总数与单词首字母编码的特征,以多维特征组成的向量作为索引项构建相应的倒排索引结构。在候选地名查找与顺序相似度排序过程中,对查询请求进行规范化处理与多维特征提取,依据生成的特征向量在倒排索引中查询获得候选地名集合,并将候选集合按照相似度由高到低排序返回给用户。经过实验证明,本发明提出的基于特征统计倒排索引的英文地名词典查询方法不仅在大规模数据环境下保持较高的运行效率,而且能够在查询地名表述不准确的情况下较为准确地查询到目标地名,让用户获得更优的用户体验。The present invention uses the multi-dimensional text statistical features such as the number of words and letters in the summarized place names to perform the place name query according to the main line of "multi-dimensional feature statistics-inverted index generation-candidate place name query-similarity order". In the process of index generation, for each place name record, features such as the total number of letters, the number of radicals, the total number of words, and the initials of the words are extracted, and a vector composed of multidimensional features is used as an index to construct a corresponding inverted index structure. In the process of candidate place name search and order similarity sorting, the query request is normalized and multi-dimensional feature extraction is performed. According to the generated feature vector, the candidate index name set is queried in the inverted index, and the candidate set is ranked according to the similarity from high to low. The sort is returned to the user. It has been proved through experiments that the query method for English place names based on feature statistical inverted index proposed by the present invention not only maintains high operation efficiency in a large-scale data environment, but also can accurately find out when the place name expression is inaccurate. Target place names for users to have a better user experience.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1为本发明一种英文地名索引建立方法的流程图。FIG. 1 is a flowchart of a method for establishing an English place name index according to the present invention.
图2为本发明一种英文地名查询方法的流程图。FIG. 2 is a flowchart of an English place name query method according to the present invention.
图3为本发明一种英文地名查询方法在一优选实施例中的流程图。FIG. 3 is a flowchart of a method for querying English place names in a preferred embodiment of the present invention.
图4为本发明一种英文地名查询装置的原理图。FIG. 4 is a schematic diagram of an English place name query device according to the present invention.
图5为本发明一种英文地名查询装置在一优选实施例中的原理图。FIG. 5 is a schematic diagram of an English place name query device in a preferred embodiment of the present invention.
图6为本发明一种英文地名词典查询方法的图形流程图。FIG. 6 is a graphic flowchart of a query method for an English place name dictionary according to the present invention.
图7为本发明英文地名索引建立方法中倒排索引结构示意图。FIG. 7 is a schematic diagram of an inverted index structure in the method for establishing an English place name index according to the present invention.
具体实施方式detailed description
以下由特定的具体实施例说明本发明的实施方式,熟悉此技术的人士可由本说明书所揭露的内容轻易地了解本发明的其他优点及功效。The embodiments of the present invention will be described below with specific specific examples. Those skilled in the art can easily understand other advantages and effects of the present invention from the content disclosed in this specification.
技术名称解释Technical name explanation
字母部首,是指使用“|”、“—”、“/”、“\”,“(”与“)”6个字符(即部首)来描述大小字母的构成,即任何大小字母都可以由该6个字符中的部分字符来组成。如果我们分别用编号1-6来依次表示“|”、“—”、“/”、“\”,“(”与“)”,那么任何字母的部首即可一个串数字来表示。例如,“L”是可以由“|”和“—”组成,故“L”的部首数字表示为“12”。Letter radicals refer to the use of six characters "|", "-", "/", "\", "(" and ")" (ie, radicals) to describe the composition of large and small letters, that is, any large and small letters are It can be composed of some of the 6 characters. If we use numbers 1-6 to represent "|", "-", "/", "\", "(" and ")", then the radical of any letter can be represented by a string of numbers. For example, "L" can be composed of "|" and "-", so the radical number of "L" is expressed as "12".
字母部首数,是指将英文地名中所有字母对应为字母部首的编号来表示后(要求英文地名中各单词的首字母为大写),统计所有部首的数量。The number of alphabetic radicals refers to the number of all radicals in English place names corresponding to the number of the alphabetical radicals (requiring the first letter of each word in the English place name to be capitalized).
字母编码,是指将地名词组中单词的首字母转换为数字编码形式,转换规则为按照A至Z的顺序分别映射“01”至“26”的编码,即A编码为“01”,B编码为“02”,以此类推。Letter encoding refers to converting the first letter of a word in a place noun group into a numeric encoding form. The conversion rule is to map the encoding of "01" to "26" in the order of A to Z, that is, A encoding is "01" and B encoding "02", and so on.
实施例1Example 1
见图1,本实施例提供了一种英文地名索引建立方法,应用于用户设备,所述方法包括以下步骤:As shown in FIG. 1, this embodiment provides a method for establishing an English place name index, which is applied to user equipment. The method includes the following steps:
S11,统计在英文地名词典文本中存储全部英文地名词组的多个特征值,所述特征值包括字母总数、字母部首数、单词总数与单词首字母编码;S11. Count multiple feature values of all English place noun groups stored in the English place name dictionary text. The feature values include the total number of letters, the number of radicals, the total number of words, and the initial code of the words;
S12,根据英文地名词组的各所述特征值生成一组对应的多维特征统计向量;S12. Generate a corresponding multi-dimensional feature statistical vector according to each of the feature values of the English noun group;
S13,将各英文地名词组的多维特征统计向量及其在倒排表的位置映射信息作为索引项建立倒排索引文件,其中,各所述索引项分别对应一倒排链。S13. An inverted index file is established by using the multidimensional feature statistical vector of each English noun group and its position mapping information in the inverted table as index entries, where each of the index entries corresponds to an inverted chain.
具体而言,所述多维特征统计向量为:d i=[f cn,f ar,f wn,f iw],其中,d i表示英文地名的多维特征统计向量,f cn表示字母总数,f ar表示字母部首数,f wn表示单词总数,f iw表示首字母编码,所述f ar包括6个部首的个数信息,所述f iw包括英文地名词组中前4个单词的首字母编码信息。 Specifically, the multi-dimensional feature vectors statistic: d i = [f cn, f ar, f wn, f iw], where, d i represents the multidimensional feature vector statistics English names, f cn represents the total number of letters, f ar Represents the initial number of letters, f wn represents the total number of words, f iw represents the first letter code, the f ar includes the number of 6 radicals, and the f iw includes the first letter code of the first 4 words in the English noun group information.
具体而言,,所述英文地名索引建立方法还可以包括:在根据所述索引项查询英文地名时,将检索关键词与所述索引项进行比较,当索引项满足以下条件时,将该索引项作为查询的候选项;所述条件包括:Specifically, the method for establishing an English place name index may further include: when querying an English place name according to the index item, comparing a search key with the index item, and when the index item meets the following conditions, the index Terms as candidates for the query; the conditions include:
Figure PCTCN2018109938-appb-000005
Figure PCTCN2018109938-appb-000005
其中,qf cn表示检索关键词中的字母总数,qf ar表示检索关键词中的字母部首数,qf wn表示检索关键词中的单词总数,qf iw表示检索关键词中的首字母编码,k cn表示字母总数维度阈值,k ar表示字母部首数维度阈值,k wn表示单词总数维度阈值,k iw表示首字母编码维度阈值。 Among them, qf cn indicates the total number of letters in the search keyword, qf ar indicates the number of letters in the search keyword, qf wn indicates the total number of words in the search keyword, qf iw indicates the first letter code in the search keyword, k cn represents the threshold of the total number of letters, k ar represents the threshold of the number of letters in the alphabet, k wn represents the threshold of the total number of words, and k iw represents the threshold of the first letter encoding dimension.
实施例2Example 2
见图2,本实施例提供了一种英文地名查询方法,应用于用户设备,所述英文地名查询方法包括以下步骤:As shown in FIG. 2, this embodiment provides an English place name query method, which is applied to user equipment. The English place name query method includes the following steps:
S21,获取用户在用户设备上输入的检索关键词;S21: Acquire a search keyword entered by the user on the user equipment;
S22,根据英文地名数据库中预先建立的索引文件查找与所述检索关键词相关的候选地名集合,其中,所述用户设备上存储的索引文件是根据实施例1所述英文地名索引建立方法构建得到;S22. Find a candidate place name set related to the search keyword according to a pre-established index file in the English place name database, wherein the index file stored on the user device is constructed according to the English place name index establishment method described in Embodiment 1. ;
S23,将候选地名集合返回至用户设备上进行显示。S23. Return the set of candidate place names to the user equipment for display.
作为一种优选实施例,见图3,在得到候选地名集合后,该英文地名查询方法还可以包括:As a preferred embodiment, as shown in FIG. 3, after obtaining a set of candidate place names, the English place name query method may further include:
S31,计算所述候选地名集合中各英文地名与检索关键词的相似度值;S31. Calculate a similarity value between each English place name and the search keyword in the candidate place name set;
S32,按照相似度值由大到小的顺序对所述候选地名集合中英文地名进行排序,并将排序结果返回至用户设备上进行显示。S32. Sort the English and Chinese place names in the candidate place name set according to the similarity value in descending order, and return the sorted result to the user device for display.
具体而言,所述候选地名集合中各英文地名与检索关键词的相似度值的计算方法为:Specifically, a method for calculating a similarity value between each English place name and a search keyword in the candidate place name set is:
Figure PCTCN2018109938-appb-000006
Figure PCTCN2018109938-appb-000006
其中,P表示检索关键词的字符串,W表示英文地名的字符串,sim(P,W)为P与W之间的顺序相似度值,len(N)、len(P)与len(W)分别表示N、P与W的字符串长度值,N表示P与W之间顺序相同的字符。Among them, P is the character string of the search keyword, W is the character string of the English place name, sim (P, W) is the order similarity value between P and W, len (N), len (P) and len (W ) Represents the string length values of N, P, and W, respectively, and N represents the same sequence of characters between P and W.
实施例3Example 3
见图4,本实施例提供了一种英文地名查询装置300,应用于用户设备,其具体包括接收模块310、查找模块320和显示模块330,接收模块310用于获取用户在用户设备上输入的检索关键词;查找模块320用于根据英文地名数据库中预先建立的索引文件查找与所述检索关键词相关的候选地名集合,其中,所述用户设备上存储的索引文件是根据权利要求1或2所述英文地名索引建立方法构建得到;显示模块330用于显示返回至用户设备上的候选地名集合。As shown in FIG. 4, this embodiment provides an English place name query device 300, which is applied to user equipment, and specifically includes a receiving module 310, a searching module 320, and a display module 330. The receiving module 310 is configured to obtain a user input on the user device. Search keywords; the search module 320 is configured to search a candidate place name set related to the search keywords according to a pre-established index file in an English place name database, wherein the index file stored on the user equipment is according to claim 1 or 2 The English place name index establishing method is constructed; the display module 330 is configured to display a set of candidate place names returned to the user equipment.
在一优选方案中,见图5,该英文地名查询装置还包括相似度计算模块410和排序模块420,相似度计算模块410用于在得到候选地名集合后,计算所述候选地名集合中各英文地名与检索关键词的相似度值;排序模块420用于按照相似度值由大到小的顺序对所述候选地名集合中英文地名进行排序,并返回至用户设备上;所述显示模块显示所述排序结果。In a preferred solution, as shown in FIG. 5, the English place name query device further includes a similarity calculation module 410 and a ranking module 420. The similarity calculation module 410 is configured to calculate each English word in the candidate place name set after obtaining the set of candidate place names. The similarity value between the place name and the search keyword; the sorting module 420 is used to sort the Chinese and English place names in the candidate place name set in the order of the similarity value and return to the user device; the display module displays the The sorting results are described.
具体而言,所述相似度计算模块中计算所述候选地名集合中各英文地名与检索关键词的相似度值的公式包括:Specifically, the formula for calculating the similarity value of each English place name and the search keyword in the candidate place name set in the similarity calculation module includes:
Figure PCTCN2018109938-appb-000007
Figure PCTCN2018109938-appb-000007
其中,P表示检索关键词的字符串,W表示英文地名的字符串,sim(P,W)为P与W之间的顺序相似度值,len(N)、len(P)与len(W)分别表示N、P与W的字符串长度值,N表示P与W之间顺序相同的字符。Among them, P is the character string of the search keyword, W is the character string of the English place name, sim (P, W) is the order similarity value between P and W, len (N), len (P) and len (W ) Represents the string length values of N, P, and W, respectively, and N represents the same sequence of characters between P and W.
为使领域技术人员能够更加清楚地了解本发明,这里以地名“Aalders Lang Brook”为例,并结合图1,来对上述实施例的内容进行详细的原理性说明,为便于阐述和理解,说明将按照索引生成过程-候选地名查找过程-顺序相似度排序过程的逻辑顺序来展开描述。In order to enable those skilled in the art to understand the present invention more clearly, the place name “Aalders Lang” is taken as an example here, and in conjunction with FIG. 1, the detailed principle description of the content of the above embodiment is described. For the convenience of explanation and understanding, the description The description will be developed in the logical order of the index generation process-candidate place name lookup process-sequential similarity ranking process.
(一)索引生成过程:(I) Index generation process:
步骤11:依次统计英文地名词典中存储全部地名词组的特征值,包括:字母总数、字母部首数、单词总数与单词首字母编码。以地名“Aalders Lang Brook”为例,其字母总数为16。字母部首数方面,“|”、“—”、“/”、“\”、“(”与“)”6个部首出现的个数分别是9、8、3、2、12、9。单词总数为3。单词首字母编码分为1、12、2、0。Step 11: The feature values of all the noun groups stored in the English place-name dictionary are counted in turn, including: the total number of letters, the number of letter radicals, the total number of words, and the first letter code of the word. Taking the place name "Aalders Lang Lang" as an example, the total number of letters is 16. As for the number of alphabetical radicals, the numbers of the six radicals "|", "-", "/", "\", "(", and ")" are 9, 8, 3, 2, 12, and 9, respectively. . The total number of words is 3. The first letter of the word is divided into 1, 12, 2, and 0.
步骤12:构建索引词典文件。索引词典中,每条索引项先后分别记录字母总数、字母部首数、单词总数、单词首字母编码与倒排表位置信息。以地名“Aalders Lang Brook”为例,由于其字母总数为16,字母部首数为9、8、3、2、12、9,单词总数为3,单词首字母编码为1、12、2、0,因此多维特征向量表达为[16,[9,8,3,2,12,9],3,[1,12,2,0]]。再加上其与倒排表的位置映射信息<1001>,在索引词典文件中的索引项结构为([16,[9,8,3,2,12,9],3,[1,12,2,0]],<1001>)。Step 12: Build the index dictionary file. In the index dictionary, each index entry records the total number of letters, the number of initials, the total number of words, the first letter of the word, and the position of the inverted table. Take the place name "Aalders Lang Brook" as an example, since the total number of letters is 16, the initials of the letters are 9, 8, 3, 2, 12, 9, the total number of words is 3, and the first letter of the code is 1, 12, 2, 0, so the multidimensional feature vector is expressed as [16, [9,8,3,2,12,9], 3, [1,12,2,0]]. In addition to its location mapping information <1001> with the inverted table, the index entry structure in the index dictionary file is ([16, [9,8,3,2,12,9], 3, [1,12 , 2,0]], <1001>).
步骤13:构建倒排链文件。出现在词典中的每个索引项对应一个倒排链,倒排链利用一个文档命中记录的数据结构(tf,<p 1,p 2,…,p f>)记录索引项在地名词典中的命中信息。以多维特征向量[16,[9,8,3,2,12,9],3,[1,12,2,0]]为例,其对应的倒排表位置映射信息为<1001>,即在倒排链文件中1001的位置存储了英文地名词典中全部多维特征向量为的词组的存储位置信息。例如:倒排链文件1001位置的记录信息为(<5>,<7>,…,<125>,…),表示相关地名词组在英文地名词典中的存储位置分别是5、7、…、125等。 Step 13: Build the inverted file. Each index entry that appears in the dictionary corresponds to an inverted chain. The inverted chain uses a data structure (tf, <p 1 , p 2 ,…, p f >) to record the index entry in the place name dictionary. Hit information. Taking multi-dimensional feature vectors [16, [9, 8, 3, 2, 12, 9], 3, [1, 12, 2, 0]] as examples, the corresponding inverted table position mapping information is <1001>, That is, the position of 1001 in the inverted chain file stores the storage position information of all the phrases in the English place-name dictionary with multidimensional feature vectors. For example: the record information of the position of the reverse link file 1001 is (<5>, <7>, ..., <125>, ...), which indicates that the storage position of the related noun group in the English place name dictionary is 5, 7, ... 125 and so on.
(二)候选地名查找过程:(B) Candidate Place Name Search Process:
步骤21:对于提交的查询地名,首先进行规范化处理,即将地名词组中单词的转化为首字母大写的形式。以查询地名“Alders langbrook”为例,需要转化为“Alders Lang Brook”。Step 21: For the submitted query place name, first perform normalization processing, that is, convert the words in the place noun group into the form of initial capital letters. Taking the query of the place name "Alders Langbrook" as an example, it needs to be converted to "Alders Lang Lang".
步骤22:按照构建索引时的特征统计规则,对查询地名的各项特征值进行统计,并组织成向量的形式表示为Q=[qf cn,qf ar,qf wn,qf iw]。以查询地名“Alders langbrook”为例,其字母总数为15,字母部首数为9、8、3、2、10、9,单词总数为3,单词首字母编码为1、12、2、0,多维统计向量为[15,[9,8,3,2,10,9],3,[1,12,2,0]]。 Step 22: According to the feature statistics rules when constructing the index, statistics are performed on each feature value of the query place name and organized into a vector form Q = [qf cn , qf ar , qf wn , qf iw ]. Take the place name “Alders langbrook” as an example, the total number of letters is 15, the number of initials is 9, 8, 3, 2, 10, 9, the total number of words is 3, and the first letter of the word is 1, 12, 2, 0 The multi-dimensional statistical vector is [15, [9, 8, 3, 2, 10, 9], 3, [1, 12, 2, 0]].
步骤23:利用Q与索引词典中的索引项进行比较,当满足公式(4)时则该索引项d i为候选项qd iStep 23: Use Q to compare with the index entries in the index dictionary. When formula (4) is satisfied, the index entry d i is a candidate qd i .
Figure PCTCN2018109938-appb-000008
Figure PCTCN2018109938-appb-000008
式中,f cn表示字母总数,f ar表示字母部首数,f wn表示单词总数,f iw表示首字母编码。k cn表示字母总数维度阈值,k ar表示字母部首数维度阈值,k wn表示单词总数维度阈值,k iw表示首字母编码维度阈值。 In the formula, f cn represents the total number of letters, f ar represents the initial number of letters, f wn represents the total number of words, and f iw represents the first letter code. k cn represents the threshold for the total number of letters, k ar represents the threshold for the number of letters in the alphabet, k wn represents the threshold for the number of words, and k iw represents the threshold for the number of letters in the dimension.
步骤24:对于候选项qd i中的索引信息进行逆向解析,根据倒排链中对应的位置偏移信息<p 1,p 2,…,p f>,查询到地名词典中相关存储位置上的地名数据。将查询到的全部地名数据进行结果合并,形成候选地名集合。以查询地名“Alders langbrook”为例,对步骤23查询到索引项([16,[9,8,3,2,12,9],3,[1,12,2,0]],<1001>)为候选项qd i,对qd i中的全部倒排链映射位置信息进行解析,并在倒排链中查找相关记录<1001>。再利用<1001>记录中包含的词典存储位置信息(<5>,<7>,…,<125>,…)进入英文地名词典文件查找相关地名词组,全部地名形成候选地名集合C。 Step 24: Perform reverse analysis on the index information in the candidate qd i , and query the relevant storage locations in the place name dictionary according to the corresponding position offset information <p 1 , p 2 , ..., p f > in the inverted chain. Place name data. The results of all the query of place names are combined to form a set of candidate place names. Taking the query of the place name "Alders langbrook" as an example, the index entry ([16, [9, 8, 3, 2, 12, 9], 3, [1, 12, 2, 0]], <1001 >) Is the candidate qd i , analyze all the inverted chain mapping position information in qd i , and find the related record <1001> in the inverted chain. Then use the dictionary storage location information (<5>, <7>, ..., <125>, ...) contained in the <1001> record to enter the English place name dictionary file to search for related noun groups, and all place names form the candidate place name set C.
(三)顺序相似度排序过程:(3) Order similarity ranking process:
步骤31:通过统计2个字符串间顺序相同的字符数量比例来判定地名之间的相似程度。假设有P=p 1p 2…p n和W=w 1w 2…w m两个地名字符串,N表示P与W之间顺序相同的字符。N顺序相同的判断依据两个原则:(1)局部顺序相同原则。N由局部相似项ls i组成,P与W之间可能存在多个ls i。若P中存在子串q i=p jp j+1…p k,与W中的子串w sw s+1…w t完全相同,则ls i符合局部顺序相同原则,设定ls i为一个局部相似项。(2)整体顺序相同原则。P与W之间顺序相同的ls i组成N。例如P=“Aalders Lang Brook”,W=“Lang Aalders Brook”,按照局部顺序相同原则,“Aalders”、“Lang”、“Brook”分别是局部相似项ls 1、ls 2与ls 3。P中顺序为ls 1ls 2ls 3,W中顺序为ls 2ls 1ls 3。以查询地名P中顺序为基准,则符合整体顺序相同原则的是ls 1ls 3,因此N=ls 1ls 3。P与W的地名相似度计算公式如公式(5)所示。 Step 31: Determine the degree of similarity between place names by counting the proportion of the same number of characters between the two strings. Suppose there are two place name strings, P = p 1 p 2 … p n and W = w 1 w 2 … w m , where N represents the same sequence of characters between P and W. The judgment of the same N order is based on two principles: (1) the same principle of partial order. N similar items ls i by the local composition, there may be a plurality ls i between P and W. If there exists a substring q i = p j p j + 1 … p k in P, which is exactly the same as the sub string w s w s + 1 … w t in W, ls i conforms to the principle of the same local order. Set ls i Is a local similarity. (2) The overall order is the same. Ls i in the same order between P and W constitute N. For example, P = "Aalders Lang Brook" and W = "Lang Aalders Brook". According to the same principle of local order, "Aalders", "Lang", and "Brook" are local similar items ls 1 , ls 2, and ls 3, respectively . The order in P is ls 1 ls 2 ls 3 , and the order in W is ls 2 ls 1 ls 3 . Taking the order of the place names P as a reference, the one that meets the same principle as the overall order is ls 1 ls 3 , so N = ls 1 ls 3 . The calculation formula for the similarity of place names of P and W is shown in formula (5).
Figure PCTCN2018109938-appb-000009
Figure PCTCN2018109938-appb-000009
式中,sim(P,W)为P与W之间的顺序相似度值,len(N)、len(P)与len(W)分别表示N、P与W的字符串长度值。即“Aalders Lang Brook”与“Lang Aalders Brook”的相似度为12/16≈0.75。In the formula, sim (P, W) is the order similarity value between P and W, and len (N), len (P), and len (W) represent the string length values of N, P, and W, respectively. That is, the similarity between "Aalders Lang Brook" and "Lang Aalders Brook" is 12/16 ≈ 0.75.
步骤32:顺序相似度排序。基于步骤31的相似度计算结果,对候选地名集合C中地名C q按照相似度结果由高到低进行排序,并将排名前n位的C q作为查询结果。 Step 32: Sort by similarity. Based on the similarity calculation result in step 31, the place names C q in the candidate place name set C are sorted according to the similarity results from high to low, and the top n C q is used as the query result.
实验分析experiment analysis
为验证本发明的技术效果,本实施例以11.5万条英文地名数据为例构造英文地名词典,从中抽取5409条地名作为标准地名。对标准地名通过人为增加错误的方式构造测试集,错误类型涵盖多种不准确描述方式(例如:多字母;缺字母;字母错误;字母顺序替换等),并依据增加错误后与原有标准地名对比的准确度将其划分为5个等级(如表所示)。其中,准确度定义如公式6所示:In order to verify the technical effect of the present invention, this embodiment uses 115,000 English place name data as an example to construct an English place name dictionary, and extracts 5,409 place names as standard place names. Construct a test set of standard place names by artificially adding errors. The types of errors cover multiple inaccurate description methods (such as: multiple letters; missing letters; letter errors; alphabetical order replacement, etc.); The accuracy of the comparison is divided into 5 levels (as shown in the table). Among them, the definition of accuracy is shown in Equation 6:
Figure PCTCN2018109938-appb-000010
Figure PCTCN2018109938-appb-000010
式中,A表示查询地名P中与目标地名C相比准确的字符数量,N表示查询地名P字符数量,accu(P,C)表示P的准确度。In the formula, A represents the exact number of characters in the query place name P compared with the target place name C, N represents the number of characters in the search place name P, and accu (P, C) represents the accuracy of P.
Figure PCTCN2018109938-appb-000011
Figure PCTCN2018109938-appb-000011
表2实施例测试集划分明细Table 2 Details of test set division in the examples
注:括号中内容为测试地名对应的目标地名,即标准地名形式。Note: The content in brackets is the target place name corresponding to the test place name, which is the standard place name form.
此外,在实验中,本发明对不同准确程度查询地名的查询效果如下表3:In addition, in the experiment, the query effect of the present invention on querying place names with different degrees of accuracy is shown in Table 3:
Figure PCTCN2018109938-appb-000012
Figure PCTCN2018109938-appb-000012
Figure PCTCN2018109938-appb-000013
Figure PCTCN2018109938-appb-000013
表3实验结果评价指标统计Table 3 Evaluation statistics of experimental results
实验结果表明,本发明提出的基于特征统计倒排索引的英文地名词典查询方法不仅在大规模数据环境下保持较高的运行效率,而且能够在查询地名表述不准确的情况下较为准确地查询到目标地名。The experimental results show that the query method for English geographical names dictionary based on the feature statistical inverted index not only maintains a high operating efficiency in a large-scale data environment, but also can be more accurately queried in the case of inaccurate representation of geographical names. The destination place name.
上述实施例仅例示性说明本发明的原理及其功效,而非用于限制本发明。任何熟悉此技术的人士皆可在不违背本发明的精神及范畴下,对上述实施例进行修饰或改变。因此,举凡所属技术领域中具有通常知识者在未脱离本发明所揭示的精神与技术思想下所完成的一切等效修饰或改变,仍应由本发明的权利要求所涵盖。The above-mentioned embodiments merely illustrate the principle of the present invention and its effects, but are not intended to limit the present invention. Anyone familiar with this technology can modify or change the above embodiments without departing from the spirit and scope of the present invention. Therefore, all equivalent modifications or changes made by those with ordinary knowledge in the technical field to which they belong without departing from the spirit and technical ideas disclosed by the present invention should still be covered by the claims of the present invention.

Claims (9)

  1. 一种英文地名索引建立方法,应用于用户设备,其特征在于,所述方法包括:An English place name index establishing method applied to user equipment is characterized in that the method includes:
    统计在英文地名词典文本中存储全部英文地名词组的多个特征值,所述特征值包括字母总数、字母部首数、单词总数与单词首字母编码;Statistically store a plurality of feature values of all English place noun groups in the English place name dictionary text, the feature values including the total number of letters, the number of letter radicals, the total number of words, and the first letter code of the word;
    根据英文地名词组的各所述特征值生成一组对应的多维特征统计向量;Generating a set of corresponding multi-dimensional feature statistical vectors according to the feature values of the English noun group;
    将各英文地名词组的多维特征统计向量及其在倒排表的位置映射信息作为索引项建立倒排索引文件,其中,各所述索引项分别对应一倒排链。An inverted index file is established by using the multidimensional feature statistical vector of each English noun group and its position mapping information in the inverted table as index entries, where each of the index entries corresponds to an inverted chain, respectively.
  2. 根据权利要求1所述的英文地名索引建立方法,其特征在于,所述多维特征统计向量为:The method for establishing an English place name index according to claim 1, wherein the multi-dimensional feature statistical vector is:
    d i=[f cn,f ar,f wn,f iw], d i = [f cn , f ar , f wn , f iw ],
    其中,d i表示英文地名的多维特征统计向量,f cn表示字母总数,f ar表示字母部首数,f wn表示单词总数,f iw表示首字母编码,所述f ar包括6个部首的个数信息,所述f iw包括英文地名词组中前4个单词的首字母编码信息。 Where, d i represents the multidimensional feature vector statistics English names, f cn represents the total number of letters, the letter represents the number f ar radicals, f wn represents the total number of words, f iw represents the first letter code, the f ar of radicals comprising 6 Number information, the f iw includes the first letter encoding information of the first 4 words in the English noun group.
  3. 根据权利要求2所述的英文地名索引建立方法,其特征在于,还包括:The method for establishing an English place name index according to claim 2, further comprising:
    在根据所述索引项查询英文地名时,将检索关键词与所述索引项进行比较,当索引项满足以下条件时,将该索引项作为查询的候选项;When querying an English place name according to the index item, comparing the search key with the index item, and when the index item satisfies the following conditions, using the index item as a candidate for the query;
    所述条件包括:The conditions include:
    Figure PCTCN2018109938-appb-100001
    Figure PCTCN2018109938-appb-100001
    其中,qf cn表示检索关键词中的字母总数,qf ar表示检索关键词中的字母部首数,qf wn表示检索关键词中的单词总数,qf iw表示检索关键词中的首字母编码,k cn表示字母总数维度阈值,k ar表示字母部首数维度阈值,k wn表示单词总数维度阈值,k iw表示首字母编码维度阈值。 Among them, qf cn indicates the total number of letters in the search keyword, qf ar indicates the number of letters in the search keyword, qf wn indicates the total number of words in the search keyword, qf iw indicates the first letter code in the search keyword, k cn represents the threshold of the total number of letters, k ar represents the threshold of the number of letters in the alphabet, k wn represents the threshold of the total number of words, and k iw represents the threshold of the first letter encoding dimension.
  4. 一种英文地名查询方法,应用于用户设备,其特征在于,所述英文地名查询方法包括:An English place name query method applied to user equipment is characterized in that the English place name query method includes:
    获取用户在用户设备上输入的检索关键词;Obtain the search keywords entered by the user on the user device;
    根据英文地名数据库中预先建立的索引文件查找与所述检索关键词相关的候选地名集合,其中,所述用户设备上存储的索引文件是根据权利要求1或2所述英文地名索引 建立方法构建得到;The candidate place name set related to the search keywords is found according to a pre-established index file in the English place name database, wherein the index file stored on the user equipment is constructed according to the English place name index establishment method of claim 1 ;
    将候选地名集合返回至用户设备上进行显示。Return the set of candidate place names to the user device for display.
  5. 根据权利要求4所述的英文地名查询方法,其特征在于,在得到候选地名集合后,还包括:The method for querying English place names according to claim 4, wherein after obtaining the set of candidate place names, the method further comprises:
    计算所述候选地名集合中各英文地名与检索关键词的相似度值;Calculating a similarity value between each English place name and a search keyword in the candidate place name set;
    按照相似度值由大到小的顺序对所述候选地名集合中英文地名进行排序,并将排序结果返回至用户设备上进行显示。The Chinese and English place names in the candidate place name set are sorted according to the similarity value in descending order, and the sorted result is returned to the user device for display.
  6. 根据权利要求4或5所述的英文地名查询方法,其特征在于,所述候选地名集合中各英文地名与检索关键词的相似度值的计算方法为:The method for querying English place names according to claim 4 or 5, wherein the method for calculating the similarity value between each English place name and the search keyword in the candidate place name set is:
    Figure PCTCN2018109938-appb-100002
    Figure PCTCN2018109938-appb-100002
    其中,P表示检索关键词的字符串,W表示英文地名的字符串,sim(P,W)为P与W之间的顺序相似度值,len(N)、len(P)与len(W)分别表示N、P与W的字符串长度值,N表示P与W之间顺序相同的字符。Among them, P is the character string of the search keyword, W is the character string of the English place name, sim (P, W) is the order similarity value between P and W, len (N), len (P) and len (W ) Represents the string length values of N, P, and W, respectively, and N represents the same sequence of characters between P and W.
  7. 一种英文地名查询装置,应用于用户设备,其特征在于,包括:An English place name query device applied to user equipment is characterized in that it includes:
    接收模块,用于获取用户在用户设备上输入的检索关键词;A receiving module, configured to obtain a search keyword entered by a user on a user device;
    查找模块,用于根据英文地名数据库中预先建立的索引文件查找与所述检索关键词相关的候选地名集合,其中,所述用户设备上存储的索引文件是根据权利要求1或2所述英文地名索引建立方法构建得到;A search module, configured to search a set of candidate place names related to the search keywords according to a pre-established index file in the English place name database, wherein the index file stored on the user equipment is the English place name according to claim 1 The indexing method is constructed;
    显示模块,用于显示返回至用户设备上的候选地名集合。A display module for displaying a set of candidate place names returned to the user device.
  8. 根据权利要求7所述的英文地名查询装置,其特征在于,还包括:The English place name query device according to claim 7, further comprising:
    相似度计算模块,用于在得到候选地名集合后,计算所述候选地名集合中各英文地名与检索关键词的相似度值;A similarity calculation module, configured to calculate a similarity value between each English place name and a search keyword in the candidate place name set after obtaining the set of candidate place names;
    排序模块,用户按照相似度值由大到小的顺序对所述候选地名集合中英文地名进行排序,并返回至用户设备上;A sorting module, the user sorts the English and Chinese place names in the candidate place name set according to the similarity value in descending order and returns them to the user device;
    所述显示模块显示所述排序结果。The display module displays the ranking results.
  9. 根据权利要求7或8所述的英文地名查询装置,其特征在于,所述相似度计算模块中计算所述候选地名集合中各英文地名与检索关键词的相似度值的公式包括:The English place name query device according to claim 7 or 8, wherein the formula for calculating the similarity value of each English place name and the search keyword in the candidate place name set in the similarity calculation module comprises:
    Figure PCTCN2018109938-appb-100003
    Figure PCTCN2018109938-appb-100003
    其中,P表示检索关键词的字符串,W表示英文地名的字符串,sim(P,W)为P与W之间的顺序相似度值,len(N)、len(P)与len(W)分别表示N、P与W的字符串长度值,N表示P与W之间顺序相同的字符。Among them, P is the character string of the search keyword, W is the character string of the English place name, sim (P, W) is the order similarity value between P and W, len (N), len (P) and len (W ) Represents the string length values of N, P, and W, respectively, and N represents the same sequence of characters between P and W.
PCT/CN2018/109938 2018-08-20 2018-10-12 Index building method for english geographical name, and query method and apparatus therefor WO2020037794A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810945986.8A CN109165331A (en) 2018-08-20 2018-08-20 A kind of index establishing method and its querying method and device of English place name
CN201810945986.8 2018-08-20

Publications (1)

Publication Number Publication Date
WO2020037794A1 true WO2020037794A1 (en) 2020-02-27

Family

ID=64896023

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/109938 WO2020037794A1 (en) 2018-08-20 2018-10-12 Index building method for english geographical name, and query method and apparatus therefor

Country Status (3)

Country Link
CN (1) CN109165331A (en)
AU (1) AU2018102145A4 (en)
WO (1) WO2020037794A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110309151A (en) * 2019-06-18 2019-10-08 精硕科技(北京)股份有限公司 A kind of index establishing method, device and computer readable storage medium
CN110275970B (en) * 2019-06-21 2022-05-06 北京达佳互联信息技术有限公司 Image retrieval method, device, server and storage medium
CN113011174B (en) * 2020-12-07 2023-08-11 红塔烟草(集团)有限责任公司 Method for identifying purse string based on text analysis
CN113268972B (en) * 2021-05-14 2022-01-11 东莞理工学院城市学院 Intelligent calculation method, system, equipment and medium for appearance similarity of two English words

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101840406A (en) * 2009-03-20 2010-09-22 富士通株式会社 Place name searching device and system
CN101930435A (en) * 2009-10-27 2010-12-29 深圳市北科瑞声科技有限公司 Method and system for retrieving organization names
US20180060936A1 (en) * 2016-08-29 2018-03-01 BloomReach, Inc. Search Ranking
CN108205578A (en) * 2016-12-20 2018-06-26 北大方正集团有限公司 Index generation method and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001043236A (en) * 1999-07-30 2001-02-16 Matsushita Electric Ind Co Ltd Synonym extracting method, document retrieving method and device to be used for the same
CN100452048C (en) * 2006-06-02 2009-01-14 凌阳科技股份有限公司 Method for enquiring electronic dictionary word with letter index table and system thereof
CN101794307A (en) * 2010-03-02 2010-08-04 光庭导航数据(武汉)有限公司 Vehicle navigation POI (Point of Interest) search engine based on internetwork word segmentation idea
CN107133311A (en) * 2017-04-28 2017-09-05 安徽博约信息科技股份有限公司 Network information ownership place index marker method based on regional code

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101840406A (en) * 2009-03-20 2010-09-22 富士通株式会社 Place name searching device and system
CN101930435A (en) * 2009-10-27 2010-12-29 深圳市北科瑞声科技有限公司 Method and system for retrieving organization names
US20180060936A1 (en) * 2016-08-29 2018-03-01 BloomReach, Inc. Search Ranking
CN108205578A (en) * 2016-12-20 2018-06-26 北大方正集团有限公司 Index generation method and device

Also Published As

Publication number Publication date
CN109165331A (en) 2019-01-08
AU2018102145A4 (en) 2019-11-21

Similar Documents

Publication Publication Date Title
US10515090B2 (en) Data extraction and transformation method and system
WO2020037794A1 (en) Index building method for english geographical name, and query method and apparatus therefor
US8745077B2 (en) Searching and matching of data
CN1728142B (en) Phrase identification method and device in an information retrieval system
US9110980B2 (en) Searching and matching of data
US8055498B2 (en) Systems and methods for building an electronic dictionary of multi-word names and for performing fuzzy searches in the dictionary
US20050197829A1 (en) Word collection method and system for use in word-breaking
US20120016663A1 (en) Identifying related names
Hull et al. An integrated algorithm for text recognition: comparison with a cascaded algorithm
CN115563313A (en) Knowledge graph-based document book semantic retrieval system
CN110888946A (en) Entity linking method based on knowledge-driven query
US8682900B2 (en) System, method and computer program product for documents retrieval
CN111767733A (en) Document security classification discrimination method based on statistical word segmentation
JP4486324B2 (en) Similar word search device, method, program, and information search system
Priya et al. Hybrid optimization algorithm using N gram based edit distance
Liang Spell checkers and correctors: A unified treatment
KR101359039B1 (en) Analysis device and method for analysis of compound nouns
WO2006058476A1 (en) Prime number replacing character string search technology
CN112784227A (en) Dictionary generating system and method based on password semantic structure
US12019701B2 (en) Computer architecture for string searching
Pathak et al. Extracting context of math formulae contained inside scientific documents
CN110175268B (en) Longest matching resource mapping method
CN116126893B (en) Data association retrieval method and device and related equipment
WO2022165634A1 (en) Apparatus and method for type matching of a text sample
Kumaran et al. Lexequal: Supporting multiscript matching in database systems

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18930916

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18930916

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 18930916

Country of ref document: EP

Kind code of ref document: A1