CN108763215B - Address storage method and device based on address word segmentation and computer equipment - Google Patents
Address storage method and device based on address word segmentation and computer equipment Download PDFInfo
- Publication number
- CN108763215B CN108763215B CN201810539670.9A CN201810539670A CN108763215B CN 108763215 B CN108763215 B CN 108763215B CN 201810539670 A CN201810539670 A CN 201810539670A CN 108763215 B CN108763215 B CN 108763215B
- Authority
- CN
- China
- Prior art keywords
- address
- name
- address name
- module
- cosine similarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 38
- 230000011218 segmentation Effects 0.000 title claims abstract description 32
- 238000012216 screening Methods 0.000 claims abstract description 17
- 239000013598 vector Substances 0.000 claims description 43
- 238000002372 labelling Methods 0.000 claims description 20
- 238000004364 calculation method Methods 0.000 claims description 16
- 230000008569 process Effects 0.000 claims description 15
- 238000004590 computer program Methods 0.000 claims description 6
- 238000005457 optimization Methods 0.000 abstract description 2
- 230000009191 jumping Effects 0.000 description 6
- 241000209639 Biancaea sappan Species 0.000 description 3
- 235000015162 Caesalpinia sappan Nutrition 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000000630 rising effect Effects 0.000 description 3
- 241000209219 Hordeum Species 0.000 description 2
- 238000012937 correction Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 241000196324 Embryophyta Species 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to an address storage method, a device and computer equipment based on address participles, wherein the method comprises the following steps: carrying out first segmentation on each address according to a Conditional Random Field (CRF) to obtain a first address name of each address; wherein the address is an address of a geographic location; sequentially reading the first address name of each address one by one, and sequentially comparing the first address name with the address names in the nodes in the standard address tree; marking and screening through a finite state machine; cosine similarity matching is carried out by adopting the screened second address name; if the read current first address name is the same as the address name of the current node, performing cosine similarity matching on the read first address name and a subsequent second address name with a pre-stored address; if the match is the same, it is stored in a set with the pre-stored address. After the optimization of the system, the address storage query efficiency is remarkably improved.
Description
Technical Field
The invention relates to the technical field of Chinese address word segmentation, in particular to an address storage method and device based on address word segmentation and computer equipment.
Background
The Chinese address word segmentation technology plays a critical role in many scenes, one critical problem of the Chinese address word segmentation at present is how to efficiently and accurately segment the address, the Chinese address has unique word segmentation characteristics, and compared with a common Chinese text, the word segmentation difficulty is higher. Meanwhile, when a new address is obtained, how to efficiently find the most similar address from the existing address data is also a difficulty.
Disclosure of Invention
The invention provides an efficient address participle storage method aiming at the technical problems that the traditional Chinese address is difficult to efficiently and accurately segment and the participle difficulty is high.
In a first aspect, the present invention provides an address storage method based on address participles, where the method includes:
carrying out first segmentation on each address according to a Conditional Random Field (CRF) to obtain a first address name of each address; wherein the address is an address of a geographic location;
sequentially reading the first address name of each address one by one, and sequentially comparing the first address name with the address names in the nodes in the standard address tree;
segmenting each address again according to the conditional random field CRF to obtain a second address name of each address, and labeling and screening the second address name by a finite state machine; cosine similarity matching is carried out by adopting the screened second address name;
if the read current first address name is the same as the address name of the current node, performing cosine similarity matching on the read first address name and a subsequent second address name with a pre-stored address;
if the match is the same, it is stored in a set with the pre-stored address.
Further, the standard address tree is generated by: constructing a standard address tree according to the address name of an administrative division and the level size of a corresponding administrative region; and the administrative areas adjacent to the levels are parent and child nodes in the standard address tree.
Further, the process of labeling and correcting by a finite state machine specifically includes:
marking the state of each second address name by keywords for each second address name after being segmented again;
inputting the state of each second address name into a finite state machine, and judging whether the address is a reasonable and effective address;
if the address is not a reasonable and effective address, the address is rejected.
Further, the process of performing cosine similarity matching with the pre-stored address specifically includes:
generating word frequency of each first address name and each second address name to obtain a word frequency vector by using the first address name and the second address name which are segmented again;
calculating a weight value of each second address name according to the word frequency vector;
obtaining a comparison vector according to the calculated weight value;
and calculating the similarity according to a cosine similarity calculation formula by adopting the comparison vector.
Further, the cosine similarity calculation formula is as follows:
in the formula:a first comparison vector V being a first addressAElement of (1), omegaBA second comparison vector V being a second addressBThe address currently read by the first address; the second address is the pre-stored address; the number of the elements is n, and i is 1-n.
In a second aspect, the present invention further provides an address storage apparatus based on address participles, including:
the segmentation module is used for segmenting each address for the first time according to the conditional random field CRF to obtain a first address name of each address; wherein the address is an address of a geographic location;
the comparison module is used for sequentially reading the first address name of each address one by one and comparing the first address name with the address names in the nodes in the standard address tree in sequence;
the labeling and screening module is used for segmenting each address again according to the conditional random field CRF to obtain a second address name of each address, and labeling and screening the addresses through a finite state machine; cosine similarity matching is carried out by adopting the screened second address name;
the matching module is used for matching the read first address name and the subsequent second address name with the prestored address according to the cosine similarity if the read first address name is the same as the address name of the current node;
and the storage module is used for storing the address and the prestored address in a set if the matching is the same.
Further, the standard address tree is generated by: constructing a standard address tree according to the address name of an administrative division and the level size of a corresponding administrative region; and the administrative areas adjacent to the levels are parent and child nodes in the standard address tree.
Further, the label screening module specifically includes:
the labeling module is used for labeling the state of each second address name after being segmented again through keywords;
the judging module is used for inputting the state of each second address name into a finite state machine and judging whether the address is a reasonable and effective address;
and the screening module is used for eliminating the address if the address is not a reasonable and effective address.
Further, the process of the matching module performing cosine similarity matching with the pre-stored address specifically includes:
a word frequency vector generation module, configured to generate a word frequency for each of the first address name and the second address name to obtain a word frequency vector, where the first address name and the second address name are obtained by re-segmenting;
the weight calculation module is used for calculating the weight value of each second address name according to the word frequency vector;
the comparison vector generation module is used for obtaining a comparison vector according to the calculated weight value;
and the cosine similarity calculation module is used for calculating the similarity according to a cosine similarity calculation formula by adopting the comparison vector.
In a third aspect, the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the address storage method when executing the computer program.
The invention has the beneficial effects that: the invention provides a system for address word segmentation and similar address query, and the address query efficiency is remarkably improved after the optimization of the system. Meanwhile, the invention can provide similar address authentication service for each financial institution.
Drawings
Fig. 1 is a schematic flowchart of a local storage method based on address participles according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an example of a five-level administrative division of the present invention;
FIG. 3 is a schematic diagram of the logical relationship of the standard state machine of the present invention;
fig. 4 is a schematic structural diagram of a local storage device based on address word segmentation according to an embodiment of the present invention.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, interfaces, techniques, etc. in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.
Fig. 1 is a schematic flowchart of an address storage method based on address word segmentation according to an embodiment of the present invention.
As shown in fig. 1, the method includes:
s1: carrying out first segmentation on each address according to a Conditional Random Field (CRF) to obtain a first address name of each address; wherein the address is an address of a geographic location;
s2: sequentially reading the first address name of each address one by one, and sequentially comparing the first address name with the address names in the nodes in the standard address tree;
s3: segmenting each address again according to the conditional random field CRF to obtain a second address name of each address, and labeling and screening the second address name by a finite state machine; cosine similarity matching is carried out by adopting the screened second address name;
s4: if the read current first address name is the same as the address name of the current node, performing cosine similarity matching on the read first address name and a subsequent second address name with a pre-stored address;
s5: if the match is the same, it is stored in a set with the pre-stored address.
In some demonstrative embodiments, the standard address tree may be generated by:
constructing a standard address tree according to the address name of an administrative division and the level size of a corresponding administrative region; and the administrative areas adjacent to the levels are parent and child nodes in the standard address tree.
Due to the particularity of the Chinese addresses, each address comprises administrative division information such as province, city, district, county, village and the like, and also comprises information such as road names, cell names, house numbers and the like. Meanwhile, the administrative division information of each address has uniqueness, namely each address can only belong to one administrative division and does not belong to a plurality of administrative areas at the same time. Thus, by constructing a standard address tree, comparison and storage with subsequent addresses may be facilitated.
The standard word address is constructed as follows:
collecting addresses of all provinces, cities, districts, counties and villages in the country and performing artificial word segmentation, e.g. original address
Yellow harbor village of Sun river of the Chaoyang area in Beijing City
Obtaining after cutting:
yellow harbor village of Sun river of the Chaoyang area in Beijing City
At present, five levels of administrative division information in China are as follows:
(first-class administrative district) provincial administrative district name: province, autonomous region, direct municipality and special administrative region;
(second-level administrative district) ground-level administrative district name: region, alliance, autonomous state, city of grade;
(third-level administrative district) county-level administrative district name: county, self-governing county, flag, self-governing flag, county-level city, prefecture area, forest area and special area;
(fourth-level administrative district) country-level administrative district name: countryside, national countryside, town, street, sappan wood, national sappan wood, and district and communal;
village administrative district name (five-level administrative district): villages, communities, houses and examination;
in the standard address tree, the addresses included include all the information of the new divisions that can be found nationwide, and are divided according to the levels. And storing the obtained segmentation address in a tree data structure form. The root node of each tree is a provincial administrative district, the leaf nodes of the root node are two-level administrative divisions, and the lower parts of the administrative divisions are three-level, four-level and five-level administrative divisions, as shown in fig. 2.
The purpose of creating a standard address library is to correct the address. The correction flow for one address is as follows:
a first cut is made by CRF.
The address sequence after segmentation is corrected through a standard tree, for example, one address after segmentation:
firstly, traversing all standard trees to inquire whether a tree using Beijing as a root node exists in east lake bay of the sunny-facing district of Beijing City, if so, searching next terms in an address sequence under the standard trees until leaf nodes at the lowest layer of the standard trees are searched or all terms in the address sequence are searched
If in, the input word is not present in the standard tree, e.g. an address:
and searching the south big street of Guancun in the Haihu area in root nodes of all trees by using the Haihu area, and returning a null result. At this time, searching is carried out in the second layer node under the root node, if a unique address tree can be found, wherein the second layer node contains a sea chest area, the searching is continued under the branch of the tree, otherwise, the searching is finished, and the address is marked as a non-standard address.
After the correction of the standard address tree, the first half of an address can get a standard split, and at this time, we only need to split the subsequent part of the address.
In some demonstrative embodiments, S1: carrying out first segmentation on each address according to a Conditional Random Field (CRF) to obtain a first address name of each address; wherein the address is an address of a geographic location.
The address splitting process can be regarded as an ordered sequence of labeling problems. Conditional Random Fields (CRF) are introduced to process the sequence standard problem, are proposed by Lafferty et al in 2001, combine the characteristics of a maximum entropy model and a hidden Markov model, are undirected graph models, and have good effect in a word segmentation sequence labeling task.
When using CRF for segmentation, a large amount of original address data is provided, manually segmenting the data, and using BI to label the attributes of each word, such as an address:
west shiku street in western city of Beijing City
The manually labeled attributes of this address are as follows:
beijing I city, I West B city, I region, I West B assorted I storehouse, I street
Where B represents the start of a phrase and I represents the middle of a word.
A CRF model with high performance is trained by manually segmenting a large number of addresses. Here a large amount of manually labeled data is used.
The address segmentation is the basis of the whole system, and the work of query, matching and the like can be completed only by accurately segmenting the address. The system adopts a model based on a conditional random field as a frame of word segmentation. The CRF word segmentation process is as follows:
a batch of address data is selected, the form of the address is guaranteed to be diversified, various common address writing methods are covered as much as possible, the addresses are manually segmented, the state is standardized, and the batch of addresses are used as training data of the model.
For arabic numbers appearing in the training data, the letters were replaced with '@'. Since the numbers and letters in the address, in most cases, appear as a whole, it is ensured by the substitution of numbers and letters that they are not separated during the segmentation. After the segmentation is completed, the @ symbol is restored to the original input information.
For example, a manually split address: a seat No. 109 of the first city of the Mingtian of the Qingqing Lu Mingtian in the Chaoyang area of Beijing
After replacement, the following steps are changed:
first city @ seat @ number of open sky of Qingqing Luo Ming-Tian in Chaojing area
The data input format for CRF is a single word and corresponding label for each line of content, separated by spaces or tabs. Converting the data into input data for the CRF:
north B
Jing I
City I
Towards B
Yang I
Zone I
Stand B
Qing I
Road I
@ B
Seat I
@ B
Number I
Where B represents the beginning of a word and I represents the middle of a word. And segmenting the address according to the result marked by the BI, and segmenting the address into different short words.
In some demonstrative embodiments, S3: segmenting each address again according to the conditional random field CRF to obtain a second address name of each address, and labeling and screening the second address name by a finite state machine; and performing cosine similarity matching by using the screened second address name.
Preferably, the process of labeling and correcting by a finite-state machine specifically includes:
marking the state of each second address name by keywords for each second address name after being segmented again;
inputting the state of each second address name into a finite state machine, and judging whether the address is a reasonable and effective address;
if the address is not a reasonable and effective address, the address is rejected.
Finite state machines are mathematical models that represent finite states and the behavior of transitions and actions between these states. In the process of address segmentation, each segmented second address name can be marked with the state of each second address name through keywords, and a string of addresses can be converted into a sequence of finite states. And inputting the state sequence into a state machine, and judging whether the address is a reasonable and effective address or not through the preset state machine. At the same time, the address can be corrected based on the results of the state machine.
State definition in the address key:
state 1 provincial administrative Unit, municipality directly under jurisdiction
1-A of the municipality of direct jurisdiction
Province, autonomous region 1-B
State 2 city level administrative:
grade 2-A
2-B county level City
2-C of union, region of autonomous State
State 3 administrative Unit on county level
Ending with a "zone" 3-A
Ending with "county" 3-B
Ending with a "flag" 3-C
State 4. firm administrative unit:
ending with "street" 4-A
Ending with "Caesalpinia sappan" 4-B
Ending with "town" 4-C
Ending with "Country" 4-D
State 5 village level administrative unit:
ending with a "Community" 5-A
Ending with village, village committee and village committee 5-B
State 6 development area:
ending with 'economic zone' and 'development zone' 6-A
Ending with "Industrial district" and "Industrial park" 6-B
State 7 way name unit:
ending with "way" 7-A
Ending with "street" 7-B
Ending 7-C with a "track
Ending with "Fang" 7-D
Ending with "groups" 7-E
Ending with "Block" 7-G
To "tail" 7-H
Ending with "Hu Tong" 7-J
Ending with "lane" 7-I
State 9 cell unit:
ending with "cell" 9-A
Ending with a "bridge" 9-B
Ending with "Yun" 9-C
Ending with "Square" 9-D
Ending with "Home" 9-E
Ending with "garden" 9-F
Ending with "li" 9-G
9-H by 3-A state transition
Completion of 9-I by rules
Ending with "mansion" 9-J
State 10:
ending with a "number" 10-A
With "haydia" 10-B
State 11 house number information:
ending with a "Unit" 11-A
Ending with "building" 11-B
Ending with "house" 11-C
Ending with a "Chamber" 11-D
Ending with "seat" 11-E
Ending with "building" 11-F
Ending with "house" 11-G
Ending with "layer" 11-H
Ending with "team" 11-I
Ending with "dong" 11-J
Number + sign 11-K
Pure number 11-L
Ending with "dormitory" 11-M
Ending 11-O with a "gate
Ending with "apartment" 11-P
Ending 11-Q with a "zone
With a horn 11-R
State 12
Ending with "company" 12-A
Ending with "Turn" 12-B
Ending with "plant" 12-C
Ending with "receive" 12-D
Additional information in the address, merged by the program 12-E
Ending with "part" 12-F
The standard thought of the state without keywords:
rule 1, if a complete un-partitioned address is in the first state position without key, if the state before the state position is between 1 and 7 and the state after the state position is not between 1 and 7, the state criterion is 9-X
Rule 2, if the rule 1 is not satisfied, the position of the current state is judged, if the position is not the last one and the state after the current position is not an unknown state, the current state is the state + X
Rule 3, if the rules 1 and 2 are not satisfied, the current state is the previous state + X;
an address may be divided into a sequence of states, for example:
the role of the finite state machine is to identify whether an address sequence is valid and reasonable, such as an address: beijing City (1-A) sunny region (3-A), this address sequence is an incomplete address and cannot be used as a valid address. For another address, like Liaoning province (1-B) Dalian city (2-A) Beijing city (1-A) Shahezu city (3-A) Shahezu street (7-B), in the address, the Dalian city and the Beijing city appear simultaneously, which causes ambiguity of the address, thus the address is an unreasonable address.
A finite state machine is introduced here to determine whether an address is reasonable and valid, by which each sequence of addresses is determined to be either a "correct address" or a "wrong address". An address, according to the state definition above, may generate a state sequence, and n may be set to the length of this state sequence. If n <2, directly judging as an error address, otherwise, performing the following judgment process:
s3.1, starting from the start state, judging whether the start state and the first state in the state sequence can be jumped according to the state jump flow chart (as follows), if the keyword in the corresponding state is searched, the jump is considered to be possible, and S3.2 is carried out, otherwise, an error address is returned.
S3.2 sets i ∈ (1, n-1), n is the length of the state sequence, n is a positive integer greater than or equal to 2, starting from i ═ 1 until n-1, and determines whether the jump relationship as described in fig. 3 exists between the i-th state and the i + 1-th state in the state sequence, if yes, continuing, otherwise, returning an error address. If the states up to the n-1 th state are correct, S3.3 is performed.
the 1 st state jumps to the 2 nd state, namely the state 1-A jumps to the state 3-A, and the jump relation in the figure 3 is met;
the 2 nd state jumps to the 3 rd state, namely the state 3-A jumps to the state 7-A, and the jump relation in the figure 3 is met;
the 3 rd state jumps to the 4 th state, namely the state 7-A jumps to the state 9-X, which accords with the jump relation in the figure 3;
the 4 th state jumping to the 5 th state is the state 9-X jumping to the state 10-a, which conforms to the jumping relationship in fig. 3.
S3.3, judging whether the n-th state and the end state in the state sequence can jump or not, if so, returning a correct address, otherwise, returning an error address. In the embodiment, the address "222 # of the first city of tomorrow of qing luo sung, beijing city" is considered as the correct address from the state 10-a to the state end.
The flow chart of state machine jumping is shown in fig. 3, where a wired connection between two states indicates that jumping is allowed, and not indicates that jumping is not allowed:
referring to fig. 3, when the state sequence of an address can satisfy the determination process of fig. 3, a correct address is output, and when the condition is not satisfied, an incorrect address is output.
In some demonstrative embodiments, S4: and if the read current first address name is the same as the address name of the current node, performing cosine similarity matching on the read first address name and the subsequent second address name with the prestored address.
Cosine similarity, also called cosine similarity, is to evaluate the similarity of two vectors by calculating the cosine value of their included angle. The cosine values range between [ -1,1], the closer the value is to 1, the closer the directions of the two vectors are represented; the closer they approach-1, the more opposite their direction; close to 0 means that the two vectors are nearly orthogonal.
Let vector a ═ (a1, a 2.., An), B ═ B1, B2.., Bn), the cosine similarity between the two vectors:
the similarity between two addresses can be calculated by cosine similarity:
s4.1: and generating the word frequency of each first address name and each second address name to obtain a word frequency vector by using the first address name and the second address name which are segmented again.
Segmenting the address:
address A: 110 th building in first region of garden in south lake of rising area in Beijing
And address B: 180 th storied building in south lake and western garden in rising area of Beijing
List all address names, including first address name and second address name:
110 # Lou West garden 180 in south lake Zhongyuan of rising area of Beijing
Calculating word frequency:
address A: beijing City 1, Chaoyang district 1, south lake 1, Zhongyuan 1, Yi district 1, 1101, Hordeum 1, West garden 0, 1800
And address B: beijing City 1, Chaoyang district 1, south lake 1, Zhongyuan 0, Yi district 0, 1100, Hordeum 1, West garden 1, 1801
Writing out a word frequency vector:
address A: [1,1,1,1,1,1,1,0,0]
And address B: [1,1,1,0,0,0,1,1,1]
S4.2: and calculating the weight value of each second address name according to the word frequency vector.
Calculating the weight:
since the address has a certain specificity, for example, in the address a and the address B, the information at the level of the cell and the information at the level of the city have different importance, different weights are assigned to words at different positions. Meanwhile, considering that the place names of different places are not uniform, the same words may have different weights in different cities. Based on the above consideration, a weight dictionary is created in units of all the regions across the country. The calculation method of the weight of each word in the dictionary is omega, omega:
where n is the number of times the word appears in all addresses of this region.
S4.3: and obtaining a comparison vector according to the calculated weight value.
During the similarity matching process of two addresses, the addresses whose first half sections (unit of ending to zone) are not completely consistent are directly regarded as dissimilar addresses. Therefore, the first half section is completely consistent, and only similarity extraction needs to be carried out on the addresses of county-level administration and the subsequent addresses. According to the calculated weight combination, a new truncated vector V to be compared can be generatedAAnd VB:
VA=[ω1,ω2,ω3,ω4,ω5,0,0,]
VB=[ω1,0,0,0,ω5,ω6,ω7,]
For VAAnd VBIf an address name appears in the original sequence, the corresponding position is the weight ω of the address nameiOtherwise, it is 0.
S4.4: and calculating the similarity according to a cosine similarity calculation formula by adopting the comparison vector.
And (3) similarity calculation:
in the formula:a first comparison vector V being a first addressAElement of (1), omegaBA second comparison vector V being a second addressBThe address currently read by the first address; the second address is the pre-stored address; the number of the elements is n, and i is 1-n.
The address segmentation according to the conditional random field CRF in the present application is prior art, and is not described herein.
As shown in fig. 3, the present invention also provides an address storage device based on address participles, comprising:
the segmentation module 100 is configured to segment each address for the first time according to the conditional random field CRF to obtain a first address name of each address; wherein the address is an address of a geographic location;
a comparing module 200, configured to sequentially read the first address name of each address one by one, and sequentially compare the first address name with the address names in the nodes in the standard address tree;
the labeling and screening module 300 is configured to segment each address again according to the conditional random field CRF to obtain a second address name of each address, label and screen the second address name by using a finite state machine; cosine similarity matching is carried out by adopting the screened second address name;
a matching module 400, configured to perform cosine similarity matching on the read first address name and a subsequent second address name with a pre-stored address if the read current first address name is the same as the address name of the current node;
a storage module 500, configured to store the address in a set with the pre-stored address if the match is the same.
Preferably, the standard address tree is generated by: constructing a standard address tree according to the address name of an administrative division and the level size of a corresponding administrative region; and the administrative areas adjacent to the levels are parent and child nodes in the standard address tree.
In some illustrative embodiments, the annotation screening module 300 specifically includes:
the labeling module is used for labeling the state of each second address name after being segmented again through keywords;
the judging module is used for inputting the state of each second address name into a finite state machine and judging whether the address is a reasonable and effective address;
and the screening module is used for eliminating the address if the address is not a reasonable and effective address.
In some illustrative embodiments, the process of performing cosine similarity matching between the matching module 400 and the pre-stored address specifically includes:
a word frequency vector generation module, configured to generate a word frequency for each of the first address name and the second address name to obtain a word frequency vector, where the first address name and the second address name are obtained by re-segmenting;
the weight calculation module is used for calculating the weight value of each second address name according to the word frequency vector;
the comparison vector generation module is used for obtaining a comparison vector according to the calculated weight value;
and the cosine similarity calculation module is used for calculating the similarity according to a cosine similarity calculation formula by adopting the comparison vector.
The functions executed by each component in the device have been described in detail in the address query method based on address participles in the above embodiment, and are not described herein again.
The invention also provides computer equipment which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of the address storage method when executing the computer program.
The reader should understand that in the description of this specification, reference to the description of the terms "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.
Claims (10)
1. An address storage method based on address participles is characterized by comprising the following steps:
according to the conditional random field CRF, carrying out first segmentation on the first half part of each address to obtain a first address name of each address; wherein the address is an address of a geographic location;
sequentially reading the first address name of each address one by one, and sequentially comparing the first address name with the address names in the nodes in the standard address tree;
segmenting the subsequent part of each address again according to the conditional random field CRF to obtain a second address name of each address, and labeling and screening the second address name by using a finite state machine; cosine similarity matching is carried out by adopting the screened second address name;
if the read current first address name is the same as the address name of the current node, performing cosine similarity matching on the read first address name and the screened second address name of the subsequent part with a prestored address;
if the match is the same, the read first address name, and the filtered second address name are stored in a set with the pre-stored address.
2. The method of claim 1, wherein the standard address tree is generated by: constructing a standard address tree according to the address name of an administrative division and the level size of a corresponding administrative region; and the administrative areas adjacent to the levels are parent and child nodes in the standard address tree.
3. The method according to claim 1, wherein the labeling and screening process by the finite state machine specifically comprises:
marking the state of each second address name by keywords for each second address name after being segmented again;
inputting the state of each second address name into a finite state machine, and judging whether the address is a reasonable and effective address;
if the address is not a reasonable and effective address, the address is removed; and if the address meets the rules of all the states in the finite state machine, using the address in the subsequent steps.
4. The method according to claim 1, wherein the process of matching cosine similarity with the pre-stored address specifically comprises:
generating word frequency of each first address name and each second address name to obtain a word frequency vector by using the first address name and the second address name which are segmented again;
calculating a weight value of each second address name according to the word frequency vector;
obtaining a comparison vector according to the calculated weight value;
and calculating the similarity according to a cosine similarity calculation formula by adopting the comparison vector.
5. The method of claim 4, wherein the cosine similarity calculation formula is:
in the formula:a first comparison vector V being a first addressAElement of (1), omegaBiA second comparison vector V being a second addressBThe elements of (1); the first address is the currently read address, and the second address isThe pre-stored address; the number of the elements is n, and i is 1-n.
6. An address storage apparatus based on address participles, comprising:
the segmentation module is used for segmenting the first half part of each address for the first time according to the conditional random field CRF to obtain a first address name of each address; wherein the address is an address of a geographic location;
the comparison module is used for sequentially reading the first address name of each address one by one and comparing the first address name with the address names in the nodes in the standard address tree in sequence;
the labeling and screening module is used for segmenting the subsequent part of each address again according to the conditional random field CRF to obtain a second address name of each address, and labeling and screening the second address name through a finite state machine; cosine similarity matching is carried out by adopting the screened second address name;
the matching module is used for matching the cosine similarity of the read first address name and the screened second address name of the subsequent part with the prestored address if the read current first address name is the same as the address name of the current node;
and the storage module is used for storing the read first address name, the screened second address name and a prestored address in a set if the first address name, the screened second address name and the prestored address are matched with each other.
7. The apparatus of claim 6, wherein the standard address tree is generated by: constructing a standard address tree according to the address name of an administrative division and the level size of a corresponding administrative region; and the administrative areas adjacent to the levels are parent and child nodes in the standard address tree.
8. The apparatus of claim 6, wherein the label screening module specifically comprises:
the labeling module is used for labeling the state of each second address name after being segmented again through keywords;
the judging module is used for inputting the state of each second address name into a finite state machine and judging whether the address is a reasonable and effective address;
the screening module is used for eliminating the address if the address is not a reasonable and effective address; and if the address meets the rules of all the states in the finite state machine, using the address in the subsequent steps.
9. The apparatus of claim 6, wherein the process of the matching module performing cosine similarity matching with the pre-stored address specifically comprises:
the word frequency vector generating module is used for generating the word frequency of each first address name and each second address name by using the first address name and the second address name which are segmented again to obtain a word frequency vector;
the weight calculation module is used for calculating the weight value of each second address name according to the word frequency vector;
the comparison vector generation module is used for obtaining a comparison vector according to the calculated weight value;
and the cosine similarity calculation module is used for calculating the similarity according to a cosine similarity calculation formula by adopting the comparison vector.
10. A computer arrangement comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the address storing method as claimed in any one of claims 1 to 5 are implemented by the processor when executing the computer program.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810539670.9A CN108763215B (en) | 2018-05-30 | 2018-05-30 | Address storage method and device based on address word segmentation and computer equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810539670.9A CN108763215B (en) | 2018-05-30 | 2018-05-30 | Address storage method and device based on address word segmentation and computer equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108763215A CN108763215A (en) | 2018-11-06 |
CN108763215B true CN108763215B (en) | 2022-04-29 |
Family
ID=64004169
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810539670.9A Expired - Fee Related CN108763215B (en) | 2018-05-30 | 2018-05-30 | Address storage method and device based on address word segmentation and computer equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108763215B (en) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109684440B (en) * | 2018-12-13 | 2023-02-28 | 北京惠盈金科技术有限公司 | Address similarity measurement method based on hierarchical annotation |
CN110442603B (en) * | 2019-07-03 | 2024-01-19 | 平安科技(深圳)有限公司 | Address matching method, device, computer equipment and storage medium |
CN110866083B (en) * | 2019-12-04 | 2023-11-07 | 国网浙江省电力有限公司 | Address auditing method for electric power standard structured address library |
CN111353309A (en) * | 2019-12-25 | 2020-06-30 | 北京合力亿捷科技股份有限公司 | Method and system for processing communication quality complaint address based on text analysis |
CN112256817A (en) * | 2020-11-05 | 2021-01-22 | 中国科学院深圳先进技术研究院 | Geocoding method, system, terminal and storage medium |
CN112256932B (en) * | 2020-12-22 | 2021-04-09 | 中博信息技术研究院有限公司 | Word segmentation method and device for address character string |
CN113761909B (en) * | 2021-01-18 | 2023-11-07 | 北京京东振世信息技术有限公司 | Address identification method and device |
CN113220670A (en) * | 2021-03-16 | 2021-08-06 | 航天精一(广东)信息科技有限公司 | Method and device for correcting address data |
CN115081449B (en) * | 2022-08-23 | 2022-11-04 | 北京睿企信息科技有限公司 | Address identification method and system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103678708A (en) * | 2013-12-30 | 2014-03-26 | 小米科技有限责任公司 | Method and device for recognizing preset addresses |
CN103914544A (en) * | 2014-04-03 | 2014-07-09 | 浙江大学 | Method for quickly matching Chinese addresses in multi-level manner on basis of address feature words |
CN107577744A (en) * | 2017-08-28 | 2018-01-12 | 苏州科技大学 | Nonstandard Address automatic matching model, matching process and method for establishing model |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102737060B (en) * | 2011-04-14 | 2017-09-12 | 商业对象软件有限公司 | Searching for generally in geocoding application |
-
2018
- 2018-05-30 CN CN201810539670.9A patent/CN108763215B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103678708A (en) * | 2013-12-30 | 2014-03-26 | 小米科技有限责任公司 | Method and device for recognizing preset addresses |
CN103914544A (en) * | 2014-04-03 | 2014-07-09 | 浙江大学 | Method for quickly matching Chinese addresses in multi-level manner on basis of address feature words |
CN107577744A (en) * | 2017-08-28 | 2018-01-12 | 苏州科技大学 | Nonstandard Address automatic matching model, matching process and method for establishing model |
Non-Patent Citations (1)
Title |
---|
基于条件随机场和空间推理的地理编码方法;周海;《中国优秀博硕士学位论文全文数据库(硕士)基础科学辑》;20160715(第07期);论文第25、38、43-44、58-59、63、66页 * |
Also Published As
Publication number | Publication date |
---|---|
CN108763215A (en) | 2018-11-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108763215B (en) | Address storage method and device based on address word segmentation and computer equipment | |
CN111797182B (en) | Address code analysis method and system | |
CN106909611B (en) | Hotel automatic matching method based on text information extraction | |
CN107145577A (en) | Address standardization method, device, storage medium and computer | |
CN104679801B (en) | A kind of interest point search method and device | |
CN103324609A (en) | Text proofreading apparatus and text proofreading method | |
CN102169591B (en) | Line selecting method and drawing method of text note in drawing | |
CN107368471A (en) | The extracting method of place name address in a kind of web page text | |
CN106557574B (en) | Target address matching method and system based on tree structure | |
CN107577744A (en) | Nonstandard Address automatic matching model, matching process and method for establishing model | |
CN114780680A (en) | Retrieval and completion method and system based on place name and address database | |
CN106777118B (en) | A kind of quick abstracting method of geographical vocabulary based on fuzzy dictionary tree | |
CN112256821A (en) | Method, device, equipment and storage medium for complementing Chinese address | |
CN109165331A (en) | A kind of index establishing method and its querying method and device of English place name | |
CN104008205A (en) | Content routing inquiry method and system | |
CN115563409A (en) | Address administrative division identification method, device, equipment and medium | |
CN114936627A (en) | Improved segmentation inference address matching method | |
CN106407221B (en) | Address data retrieval method and device | |
CN104615782A (en) | Address matching method based on sliding window maximum matching algorithm | |
CN115455315B (en) | Address matching model training method based on comparison learning | |
CN103455964A (en) | Case clue analyzing system and method based on case information | |
CN116467410A (en) | Address matching method and device, electronic equipment and computer readable storage medium | |
CN115270774A (en) | Big data keyword dictionary construction method for semi-supervised learning | |
CN115146635A (en) | Address segmentation method based on domain knowledge enhancement | |
CN115062108A (en) | Method for obtaining standardized house address |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20220429 |
|
CF01 | Termination of patent right due to non-payment of annual fee |