CN111753515B - Address information extraction and matching method for realizing entity positioning - Google Patents

Address information extraction and matching method for realizing entity positioning Download PDF

Info

Publication number
CN111753515B
CN111753515B CN202010590590.3A CN202010590590A CN111753515B CN 111753515 B CN111753515 B CN 111753515B CN 202010590590 A CN202010590590 A CN 202010590590A CN 111753515 B CN111753515 B CN 111753515B
Authority
CN
China
Prior art keywords
text
address
level
label
address text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010590590.3A
Other languages
Chinese (zh)
Other versions
CN111753515A (en
Inventor
曾伟英
霍智杰
霍凯亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Kejie Communication Information Technology Co ltd
Original Assignee
Guangdong Kejie Communication Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Kejie Communication Information Technology Co ltd filed Critical Guangdong Kejie Communication Information Technology Co ltd
Priority to CN202010590590.3A priority Critical patent/CN111753515B/en
Publication of CN111753515A publication Critical patent/CN111753515A/en
Application granted granted Critical
Publication of CN111753515B publication Critical patent/CN111753515B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The address information extraction and matching method for realizing entity positioning comprises the steps of constructing a first conditional random field, and determining the state of the first conditional random field according to administrative level keywords; jumping the address text according to the state of the first conditional random field; dividing the address text into a plurality of sub-texts according to the state jump; adding text labels to the divided sub-texts on corresponding administrative levels; constructing and storing a tag library according to the text tags; constructing a second random field comprising: adding a new text label to the second address text according to the text label; fuzzy matching is carried out on the address text, and the fuzzy matching comprises the following steps: acquiring a weight value of a text label of a label library; and matching the address text corresponding to the text label which is most similar to the input address text in the label library according to the weight value. The method solves the problem that the data cannot be directly associated due to different writing habits of two different original data when the address information is input.

Description

Address information extraction and matching method for realizing entity positioning
Technical Field
The invention relates to the technical field of text matching, in particular to an address information extraction and matching method for realizing entity positioning.
Background
Geographic information is the most commonly used social public information resource at present, is closely related to the daily life of the public, and is also a basic resource for government basic administration. The text address refers to a geographical location described by text, such as "Beijing city, the morning sun district, the aster road". However, when data mining work is performed on each item of data containing address text information, the problem that most of address information in the original data is not recorded normally is often faced, so that a bottleneck exists when correlation analysis is carried out on massive address texts.
Disclosure of Invention
Aiming at the defects in the background technology, the invention provides an address information extraction and matching method for realizing entity positioning, which realizes the label extraction conforming to the conventional understanding on massive address texts, can easily realize the association on the data needing address association, and solves the problem that the data cannot be directly associated due to different writing habits of two different original data when the address information is input.
To achieve the purpose, the invention adopts the following technical scheme:
the method for extracting and matching the address information for realizing the entity positioning comprises a first address text containing administrative level keywords and a second address text not containing administrative level keywords, and comprises the following specific steps:
constructing a first conditional random field applicable to the first address text, comprising:
determining the state of the first conditional random field according to the administrative level keywords;
jumping the address text according to the state of the first conditional random field;
dividing the address text into a plurality of sub-texts according to the state jump;
according to the address text which is successfully segmented, adding text labels to the segmented sub-text on the corresponding administrative level;
Constructing and storing a tag library according to the text tags;
Constructing a second random field suitable for use with the second address text, comprising:
adding a new text label to the second address text according to the text label in the label library;
fuzzy matching is carried out on the address text, and the fuzzy matching comprises the following steps:
acquiring a weight value of a text label of a label library;
And matching the address text corresponding to the text label which is most similar to the input address text in the label library according to the weight value.
Preferably, the first address text is ranked according to administrative level keywords of the first address text, and one address text of each level corresponds to one state of the first conditional random field;
Address text at the same level is arranged side by side, and low-level address text is arranged after high-level address text.
Preferably, in the state jump process, the high-level state corresponding to the high-level address text jumps to the low-level state corresponding to the low-level address text, and the jump is irreversible;
when the high-level state jumps to the low-level state, all the low-level states of the column in which the low-level state is located are passed through;
the states of a single lowest level may jump to each other.
Preferably, in each jump, the address text is segmented by using the administrative level keyword corresponding to the level state, and the segmented address text enters the next low level state for re-segmentation;
selecting a path with the largest number of hops, and determining the path as an optimal segmentation path; wherein address text that hops across level states does not count the number of hops.
Preferably, in the address text which is successfully segmented, the sub-text and the administrative vocabulary corresponding to each level state are used as text labels.
Preferably, a dictionary is established, text labels are added to the dictionary according to preset rules, and the dictionary is stored as a two-dimensional data table.
Preferably, the second address text is split word by word, the split previous word and the split next word are combined, the combination is then matched in a tag library, whether a text tag of the combination exists or not is judged, and if yes, the combination is reserved; if not, the combination is not reserved;
After the combination is reserved, combining the combination with the next word to form a new combination, matching the combination with the next word in a tag library, judging whether a text tag of the new combination exists, if so, reserving the new combination, continuing to combine with the next word, and if not, not reserving the new combination;
And so on until all split words can no longer be combined.
Preferably, after the input address text is segmented, all text labels are subjected to weight statistics, each text label corresponds to a weight value, and the weight value is in direct proportion to the importance of the text label.
Preferably, the similarity between each text label in the input address text and the text label in the label library is calculated, the similarity and the weight value are weighted and averaged, and the text label in the label library with the largest value is most similar to the input address text.
Advantageous effects
The method and the device realize the label extraction which accords with the conventional understanding on massive address texts, can easily realize the association on the data needing address association, and solve the problem that the data cannot be directly associated due to different writing habits of two different original data when the address information is input.
Drawings
FIG. 1 is a diagram of a model structure of one embodiment of the present invention;
FIG. 2 is a flow chart of one embodiment of the present invention;
fig. 3 is a conditional random field state jump diagram of one embodiment of the invention.
Detailed Description
The technical scheme of the invention is further described below by the specific embodiments with reference to the accompanying drawings.
In the description of the present invention, it should be understood that the terms "upper," "lower," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like indicate or are based on the orientation or positional relationship shown in the drawings, merely to facilitate description of the present invention and to simplify the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention.
The invention relates to an address information extraction and matching method for realizing entity positioning, which comprises a first address text containing administrative level keywords and a second address text not containing administrative level keywords, as shown in fig. 1 and 2, and comprises the following specific steps:
constructing a first conditional random field applicable to the first address text, comprising:
determining the state of the first conditional random field according to the administrative level keywords;
jumping the address text according to the state of the first conditional random field;
dividing the address text into a plurality of sub-texts according to the state jump;
according to the address text which is successfully segmented, adding text labels to the segmented sub-text on the corresponding administrative level;
Constructing and storing a tag library according to the text tags;
Constructing a second random field suitable for use with the second address text, comprising:
adding a new text label to the second address text according to the text label in the label library;
fuzzy matching is carried out on the address text, and the fuzzy matching comprises the following steps:
acquiring a weight value of a text label of a label library;
And matching the address text corresponding to the text label which is most similar to the input address text in the label library according to the weight value.
The method is mainly divided into two parts, wherein the first part is the construction of a conditional random field for dividing and extracting information; the other part is the construction of a tag library after text extraction. The conditional random fields are further divided into a first conditional random field and a second conditional random field, which are respectively used for dividing different two types of texts, one type is a first address text containing administrative level keywords and a second address text not containing administrative level keywords, wherein the first address text is used for example for "three-way EF elementary school of the AB area CD of the Buddha city in Guangdong province", and the second address text is used for example for "three-way EF elementary school of the CD of the Buddha in Guangdong province". And (3) dividing the massive text by using the first conditional random field to construct a tag library, and constructing a second conditional random field based on the tag library. When a new input text is needed, the two conditional random fields are utilized for segmentation and extraction, fuzzy matching is carried out on the text based on LENVENSHITEIN DISTANCE algorithm, and an approximate text is returned, wherein the approximate text is a similar text which can be understood by people. The method solves the problems that address writing modes input by different data are inconsistent in daily data mining work, but the address texts cannot be associated due to writing habits while the address texts belong to the same address in semantic understanding.
Preferably, the first address text is ranked according to administrative level keywords of the first address text, and one address text of each level corresponds to one state of the first conditional random field;
Address text at the same level is arranged side by side, and low-level address text is arranged after high-level address text.
As shown in fig. 3, the first conditional random field state determination: in China, all addresses may be classified according to administrative level, such as autonomous area, special administrative area, province, city, district, county, town, street office, etc., classified according to administrative level keywords, such as "district" and "county" peers, so they should be arranged side by side, while "street office" and "town" peers but with a level lower than "district, county" so they are arranged side by side after "district, county", and other levels form respective states based on which the first condition is random. In the actual use process, the state corresponding to the administrative level can be increased or decreased according to the actual demand, and each first address text to be extracted passes through the first conditional random field.
Preferably, in the state jump process, the high-level state corresponding to the high-level address text jumps to the low-level state corresponding to the low-level address text, and the jump is irreversible;
when the high-level state jumps to the low-level state, all the low-level states of the column in which the low-level state is located are passed through;
the states of a single lowest level may jump to each other.
State jumps of the first conditional random field: the first conditional random field has to jump from front to back, i.e. high-level address text to low-level administrative address text, while the jump from high to low has to go through every possible state, e.g. a list of "regions" has to jump only and has to jump to a list of "streets", i.e. a list of "regions" has to go through "streets" and "towns" when jumping, and the "streets" has to go through "resident" and "village" when jumping, and there is a loop of only "number" at the lowest level state.
Preferably, in each jump, the address text is segmented by using the administrative level keyword corresponding to the level state, and the segmented address text enters the next low level state for re-segmentation;
selecting a path with the largest number of hops, and determining the path as an optimal segmentation path; wherein address text that hops across level states does not count the number of hops.
In the method, each level of administrative level is regarded as a state, and the jump process can only jump from a high level to a low level and is irreversible. Each jump uses the corresponding administrative grade vocabulary to divide the address text. And entering the segmented sub-text into a new state for segmentation. If the division fails, e.g., part of the text does not write "region" information, but rather writes directly from "city" to "street office", the cross-level state is no longer incorporated into the calculation. After the first conditional random field is divided, the path with the largest number of hops is selected as the optimal dividing path.
For example, "the three-way EF university of the AB area CD in the berg city of the guangdong province", the address text not only passes through the state jump path of "province", "city", "area" and "road", but also passes through the state jump of "autonomous area", "city" and "county" … …, but is not divided by the state of "autonomous area" at the "autonomous area", so that the state of "autonomous area" is directly passed by the pass and enters the state of "city". In the result of the path divided by the "autonomous region", the number of times of dividing success is necessarily smaller than that of the path divided by the "province", so the path divided by the "autonomous region" is necessarily not the optimal path.
Preferably, in the address text which is successfully segmented, the sub-text and the administrative vocabulary corresponding to each level state are used as text labels.
Preferably, a dictionary is established, text labels are added to the dictionary according to preset rules, and the dictionary is stored as a two-dimensional data table.
For the address text which is successfully segmented, text labels can be added to corresponding administrative boundaries, for example, the address text is 'Guangdong Buddha mountain city AB area CD three-way EF primary school', the class segmentation of 'Guangdong' is that the class segmentation of 'city' is that of 'Buddha mountain'.
Building a tag library: creating a blank dictionary, wherein the dictionary is a data structure, and the extracted label text is added in the form of key value pairing, for example, a ' three-way EF university of AB area of Fushan, guangdong, and the segmented ' Guangdong ' sub-text corresponds to a { ' text ': "Guangdong", "grade": "province", "count": "+1" } +1 refers to a gradual count accumulation on the original basis), the remaining labels, and so on.
Expansion of tag library: in the presence of the tag text of "Guangdong" which follows, the count of "Guangdong" is incremented by one in the manner of a key value, and if the tag text does not appear, the new key value pair is added.
And (3) storing a tag library: the updated dictionary is stored periodically as a two-dimensional data table that is highly efficient at accessing internal data in preparation for subsequent second random condition field establishment.
Preferably, the second address text is split word by word, the split previous word and the split next word are combined, the combination is then matched in a tag library, whether a text tag of the combination exists or not is judged, and if yes, the combination is reserved; if not, the combination is not reserved;
After the combination is reserved, combining the combination with the next word to form a new combination, matching the combination with the next word in a tag library, judging whether a text tag of the new combination exists, if so, reserving the new combination, continuing to combine with the next word, and if not, not reserving the new combination;
And so on until all split words can no longer be combined.
The second conditional random field is for text segmentation without filling any administrative keywords, such as "bergamot NH region" written "bergamot NH". Thus, the state of the second conditional random field, depending on the tag extracted at the first random field, we use the tag library to split words by words for such address text, then add each word and its following words as new sub-text, jump into the new state, count the likelihood of each state jump in the tag library, and take the most probable path as the best split. It should be noted that, finally, through the best path, the corresponding administrative label of each state is found.
Word-by-word splitting: for example, "Guangdong Buddha CD three-way EF university", split: "Guang", "Dong", "Buddha" … …
Initializing combined address text: for the disassembled Chinese character groups, merging the 'Guangdong' and the 'east' firstly, if the 'Guangdong' exists in the tag library, keeping the combination, continuing to merge to form the 'Guangdong Buddha', but obviously, not existing in the tag library, skipping the combination, continuing to merge the 'Guangdong Buddha' until the complete address text is traversed, and forming the state of the second conditional random field by the combination.
Selection of the optimal partition: based on all the above-mentioned division combinations, the combination of the product maximum value of the product of the evolution of the number of occurrences and the character string length in the tag library, such as "Guangdong Buddha", is found 10000 times, and "Guangdong" is found 1000000 times, the optimal division is still "Guangdong".
Word forming and division iteration: when "Guangdong" is the best combination, the subsequent other text contains "Guangdong" characters, and the division defaults to "Guangdong" as the best granularity. And (3) for the 'Guangdong Buddha CD three-way EF primary school', after the Guangdong is divided, carrying out optimal division selection on the Chinese character group after the Guangdong until the Chinese character group can not be combined any more, forming a new combination of the rest Chinese characters, and adding the new combination into a label library.
Labeling newly partitioned data: the text is divided into Guangdong, and the corresponding label is "province", and then the Guangdong province is used as one of the labels of the text of the Guangdong Buddha CD three-way EF university, and the other labels are analogized.
Preferably, after the input address text is segmented, all text labels are subjected to weight statistics, each text label corresponds to a weight value, and the weight value is in direct proportion to the importance of the text label.
Preferably, the similarity between each text label in the input address text and the text label in the label library is calculated, the similarity and the weight value are weighted and averaged, and the text label in the label library with the largest value is most similar to the input address text.
As mentioned above, text is subjected to fuzzy matching based on LENVENSHITEIN DISTANCE algorithm, and approximate text is returned, wherein the approximate text is similar text which can be understood by people. According to the method, LENVENSHITEIN DISTANCE is used as fuzzy matching of the address, firstly, weight statistics is carried out on the extracted text labels in a label library, and the weight statistics mode used at this time is TFIDF, which is the prior art and is not described herein. For example, one address text of "GL way 18 No. G city in south China sea of Buddha" is calculated by the prior conditional random field and TFIDF, the weight of "Buddha" is 0.15, "south sea" is 0.18, "GL way" is 0.33, and the like, and the larger the weight is, the higher the importance of the vocabulary is represented.
Every time a new address text is received, it is firstly divided, all administrative level labels are obtained by using the conditional random field division, then TFIDF calculates the weights of text labels of all administrative levels, finally we calculate LENVENSHITEIN DISTANCE of the text labels corresponding to partial address text in the input address text and the label library, and finally the final value is weighted average, and the maximum value is the address text most similar to the input address.
Examples: if the address of "the BG garden X607 house of the south sea area GL of bergamot" cannot be recorded in the above extracted tag library, then the tag of "the BG garden X607 house" cannot appear in the tag library (but the tags of "the bergamot", "the south sea area", "the GL way" and the like can be all obtained in the library by extracting other text tags of the same region) according to the first conditional random field and the second conditional random field, and the text tag may be extracted from the address of "the BG garden Y205 house of the south sea area GL of bergamot". After the addresses are subjected to segmentation weighting, the weight of the 'BG garden X house 607 house' is highest, meaning that the influence on the addresses is the greatest, and carrying out LENVENSHITEIN DISTANCE fuzzy matching on each state according to stages can generate a character string similarity rate of each state, and finally, the weighted average of all the similarity rates and TFIDF is carried out to obtain a result. For two addresses "the south sea area GL road BG garden Y205 house" and "the south sea area GL road UU upper house X607 house" in the berg city, the product result of the character string similarity ratio and the weight of the "BG garden Y205 house", "the UU upper house X607 house" and the "BG garden X607 house" is compared, and then the "who" is more similar is determined. The local similarity matching can be effectively avoided by extracting labels which are segmented according to the levels, errors caused by non-uniform writing habits and similar semantics are avoided, and meanwhile, the situation of excessive fuzzy matching (the situation that the character strings are arranged similarly only to cause that the character strings are extremely similar, but two different addresses are actually used, such as 'Buddha urban LONGJ paths of LGTH Lido K seats 201 houses', 'Buddha urban LVJ paths of LGTH Huafu K seats 201 houses') is avoided. And (3) injection: the addresses are fictional and are convenient for explanation and use.
The technical principle of the present invention is described above in connection with the specific embodiments. The description is made for the purpose of illustrating the general principles of the invention and should not be taken in any way as limiting the scope of the invention. Other embodiments of the invention will be apparent to those skilled in the art from consideration of this specification without undue burden.

Claims (5)

1. An address information extraction and matching method for realizing entity positioning is characterized in that: the method comprises the following specific steps of:
constructing a first conditional random field applicable to the first address text, comprising:
determining the state of the first conditional random field according to the administrative level keywords;
jumping the address text according to the state of the first conditional random field;
dividing the address text into a plurality of sub-texts according to the state jump;
according to the address text which is successfully segmented, adding text labels to the segmented sub-text on the corresponding administrative level;
Constructing and storing a tag library according to the text tags;
Constructing a second random field suitable for use with the second address text, comprising:
adding a new text label to the second address text according to the text label in the label library;
fuzzy matching is carried out on the address text, and the fuzzy matching comprises the following steps:
acquiring a weight value of a text label of a label library;
Matching the address text corresponding to the text label which is most similar to the input address text in the label library according to the weight value;
in the state jump process, the high-level state corresponding to the high-level address text jumps to the low-level state corresponding to the low-level address text, and the jump is irreversible;
when the high-level state jumps to the low-level state, all the low-level states of the column in which the low-level state is located are passed through;
The states of the single lowest level can mutually jump;
in each jump, dividing the address text by using an administrative level keyword corresponding to the level state, and dividing the address text after division into a next low-level state again;
selecting a path with the largest number of hops, and determining the path as an optimal segmentation path; the address text of the cross-level state skip does not count the skip times;
establishing a dictionary, adding the text labels to the dictionary according to preset rules, and storing the dictionary as a two-dimensional data table;
Splitting the second address text word by word, combining the split previous word with the split next word, matching the combined previous word with the split next word in a tag library, judging whether a combined text tag exists, and if yes, reserving the combination; if not, the combination is not reserved;
After the combination is reserved, combining the combination with the next word to form a new combination, matching the combination with the next word in a tag library, judging whether a text tag of the new combination exists, if so, reserving the new combination, continuing to combine with the next word, and if not, not reserving the new combination;
And so on until all split words can no longer be combined.
2. The method for extracting and matching address information for implementing entity positioning according to claim 1, wherein the method comprises the steps of:
Grading the first address text according to administrative level keywords of the first address text, wherein one address text of each level corresponds to one state of the first conditional random field;
Address text at the same level is arranged side by side, and low-level address text is arranged after high-level address text.
3. The method for extracting and matching address information for implementing entity positioning according to claim 1, wherein the method comprises the steps of:
And in the address text which is successfully segmented, taking the sub-text and administrative level vocabulary corresponding to each level state as text labels.
4. The method for extracting and matching address information for implementing entity positioning according to claim 1, wherein the method comprises the steps of:
After the input address text is segmented, all text labels are subjected to weight statistics, each text label corresponds to a weight value, and the weight value is in direct proportion to the importance of the text label.
5. The method for extracting and matching address information for locating entities according to claim 4, wherein:
And calculating the similarity between each text label in the input address text and the text labels in the label library, and carrying out weighted average on the similarity and the weight value, wherein the text label in the label library with the maximum value is most similar to the input address text.
CN202010590590.3A 2020-06-24 2020-06-24 Address information extraction and matching method for realizing entity positioning Active CN111753515B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010590590.3A CN111753515B (en) 2020-06-24 2020-06-24 Address information extraction and matching method for realizing entity positioning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010590590.3A CN111753515B (en) 2020-06-24 2020-06-24 Address information extraction and matching method for realizing entity positioning

Publications (2)

Publication Number Publication Date
CN111753515A CN111753515A (en) 2020-10-09
CN111753515B true CN111753515B (en) 2024-07-02

Family

ID=72677161

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010590590.3A Active CN111753515B (en) 2020-06-24 2020-06-24 Address information extraction and matching method for realizing entity positioning

Country Status (1)

Country Link
CN (1) CN111753515B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112581252A (en) * 2020-12-03 2021-03-30 信用生活(广州)智能科技有限公司 Address fuzzy matching method and system fusing multidimensional similarity and rule set
CN112835899B (en) * 2021-01-29 2024-07-02 上海寻梦信息技术有限公司 Address library indexing method, address matching method and related equipment
CN113656531B (en) * 2021-08-12 2024-06-14 南方电网数字电网研究院有限公司 Power grid address structuring processing method and device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108628811A (en) * 2018-04-10 2018-10-09 北京京东尚科信息技术有限公司 The matching process and device of address text

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101719128B (en) * 2009-12-31 2012-05-23 浙江工业大学 Fuzzy matching-based Chinese geo-code determination method
CN104537062A (en) * 2014-12-29 2015-04-22 北京牡丹电子集团有限责任公司数字电视技术中心 Address information extracting method and system
CN105005577A (en) * 2015-05-08 2015-10-28 裴克铭管理咨询(上海)有限公司 Address matching method
CN106709065B (en) * 2017-01-19 2020-08-04 国家电网公司 Address information standardization processing method and device
CN109033225A (en) * 2018-06-29 2018-12-18 福州大学 Chinese address identifying system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108628811A (en) * 2018-04-10 2018-10-09 北京京东尚科信息技术有限公司 The matching process and device of address text

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于Trie树和有限状态自动机的中文地址解析模型";汪洋;《计算机与现代化》(第第7期期);第60-67页 *

Also Published As

Publication number Publication date
CN111753515A (en) 2020-10-09

Similar Documents

Publication Publication Date Title
CN111753515B (en) Address information extraction and matching method for realizing entity positioning
CN109145169B (en) Address matching method based on statistical word segmentation
US10839156B1 (en) Address normalization using deep learning and address feature vectors
CN101438283B (en) Demographic based classification for local word wheeling/WEB search
US7046827B2 (en) Adapting point geometry for storing address density
CN109145281B (en) Speech recognition method, apparatus and storage medium
CN102810118B (en) A kind of change weighs net K nearest neighbor search method
US9015200B2 (en) Map update scripts with tree edit operations
Hastings Automated conflation of digital gazetteer data
Su et al. Making sense of trajectory data: A partition-and-summarization approach
Zheng et al. Efficient clue-based route search on road networks
Sarawagi et al. Open-domain quantity queries on web tables: annotation, response, and consensus models
EP2836928B1 (en) Full text search using r-trees
CN110765753A (en) Method, system, computer device and storage medium for generating file
CN102737060A (en) Fuzzy search in geocoding application
CN112579921B (en) Track indexing and query method and system based on inverted sorting index and prefix tree
CN105069071A (en) Geographical position information extraction method for microblog data
CN107704524A (en) A kind of subway station function method for digging based on doc2vec
CN102591958B (en) Matching method and matching device of deterministic finite automation based on ternary content addressable memory (TCAM)
US8620947B2 (en) Full text search in navigation systems
EP2783308B1 (en) Full text search based on interwoven string tokens
US9110973B2 (en) Method and apparatus for processing a query
CN110532464A (en) A kind of tourism recommended method based on more tourism context modelings
CN117709331A (en) Global multi-language multi-task multi-mode address analysis middle station system and analysis method thereof
CN114513550B (en) Geographic position information processing method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant