CN108763215B

CN108763215B - Address storage method and device based on address word segmentation and computer equipment

Info

Publication number: CN108763215B
Application number: CN201810539670.9A
Authority: CN
Inventors: 张斌; 李萱; 吴景壮; 夏冰
Original assignee: Intellicredit Inc
Current assignee: Intellicredit Inc
Priority date: 2018-05-30
Filing date: 2018-05-30
Publication date: 2022-04-29
Anticipated expiration: 2038-05-30
Also published as: CN108763215A

Abstract

The invention relates to an address storage method, a device and computer equipment based on address participles, wherein the method comprises the following steps: carrying out first segmentation on each address according to a Conditional Random Field (CRF) to obtain a first address name of each address; wherein the address is an address of a geographic location; sequentially reading the first address name of each address one by one, and sequentially comparing the first address name with the address names in the nodes in the standard address tree; marking and screening through a finite state machine; cosine similarity matching is carried out by adopting the screened second address name; if the read current first address name is the same as the address name of the current node, performing cosine similarity matching on the read first address name and a subsequent second address name with a pre-stored address; if the match is the same, it is stored in a set with the pre-stored address. After the optimization of the system, the address storage query efficiency is remarkably improved.

Description

Address storage method and device based on address word segmentation and computer equipment

Technical Field

The invention relates to the technical field of Chinese address word segmentation, in particular to an address storage method and device based on address word segmentation and computer equipment.

Background

The Chinese address word segmentation technology plays a critical role in many scenes, one critical problem of the Chinese address word segmentation at present is how to efficiently and accurately segment the address, the Chinese address has unique word segmentation characteristics, and compared with a common Chinese text, the word segmentation difficulty is higher. Meanwhile, when a new address is obtained, how to efficiently find the most similar address from the existing address data is also a difficulty.

Disclosure of Invention

The invention provides an efficient address participle storage method aiming at the technical problems that the traditional Chinese address is difficult to efficiently and accurately segment and the participle difficulty is high.

In a first aspect, the present invention provides an address storage method based on address participles, where the method includes:

carrying out first segmentation on each address according to a Conditional Random Field (CRF) to obtain a first address name of each address; wherein the address is an address of a geographic location;

sequentially reading the first address name of each address one by one, and sequentially comparing the first address name with the address names in the nodes in the standard address tree;

segmenting each address again according to the conditional random field CRF to obtain a second address name of each address, and labeling and screening the second address name by a finite state machine; cosine similarity matching is carried out by adopting the screened second address name;

if the read current first address name is the same as the address name of the current node, performing cosine similarity matching on the read first address name and a subsequent second address name with a pre-stored address;

if the match is the same, it is stored in a set with the pre-stored address.

Further, the standard address tree is generated by: constructing a standard address tree according to the address name of an administrative division and the level size of a corresponding administrative region; and the administrative areas adjacent to the levels are parent and child nodes in the standard address tree.

Further, the process of labeling and correcting by a finite state machine specifically includes:

marking the state of each second address name by keywords for each second address name after being segmented again;

inputting the state of each second address name into a finite state machine, and judging whether the address is a reasonable and effective address;

if the address is not a reasonable and effective address, the address is rejected.

Further, the process of performing cosine similarity matching with the pre-stored address specifically includes:

generating word frequency of each first address name and each second address name to obtain a word frequency vector by using the first address name and the second address name which are segmented again;

calculating a weight value of each second address name according to the word frequency vector;

obtaining a comparison vector according to the calculated weight value;

and calculating the similarity according to a cosine similarity calculation formula by adopting the comparison vector.

Further, the cosine similarity calculation formula is as follows:

in the formula:

a first comparison vector V being a first address_AElement of (1), omega_BA second comparison vector V being a second address_BThe address currently read by the first address; the second address is the pre-stored address; the number of the elements is n, and i is 1-n.

In a second aspect, the present invention further provides an address storage apparatus based on address participles, including:

the segmentation module is used for segmenting each address for the first time according to the conditional random field CRF to obtain a first address name of each address; wherein the address is an address of a geographic location;

the comparison module is used for sequentially reading the first address name of each address one by one and comparing the first address name with the address names in the nodes in the standard address tree in sequence;

the labeling and screening module is used for segmenting each address again according to the conditional random field CRF to obtain a second address name of each address, and labeling and screening the addresses through a finite state machine; cosine similarity matching is carried out by adopting the screened second address name;

the matching module is used for matching the read first address name and the subsequent second address name with the prestored address according to the cosine similarity if the read first address name is the same as the address name of the current node;

and the storage module is used for storing the address and the prestored address in a set if the matching is the same.

Further, the label screening module specifically includes:

the labeling module is used for labeling the state of each second address name after being segmented again through keywords;

the judging module is used for inputting the state of each second address name into a finite state machine and judging whether the address is a reasonable and effective address;

and the screening module is used for eliminating the address if the address is not a reasonable and effective address.

Further, the process of the matching module performing cosine similarity matching with the pre-stored address specifically includes:

a word frequency vector generation module, configured to generate a word frequency for each of the first address name and the second address name to obtain a word frequency vector, where the first address name and the second address name are obtained by re-segmenting;

the weight calculation module is used for calculating the weight value of each second address name according to the word frequency vector;

the comparison vector generation module is used for obtaining a comparison vector according to the calculated weight value;

and the cosine similarity calculation module is used for calculating the similarity according to a cosine similarity calculation formula by adopting the comparison vector.

In a third aspect, the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the address storage method when executing the computer program.

The invention has the beneficial effects that: the invention provides a system for address word segmentation and similar address query, and the address query efficiency is remarkably improved after the optimization of the system. Meanwhile, the invention can provide similar address authentication service for each financial institution.

Drawings

Fig. 1 is a schematic flowchart of a local storage method based on address participles according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an example of a five-level administrative division of the present invention;

FIG. 3 is a schematic diagram of the logical relationship of the standard state machine of the present invention;

fig. 4 is a schematic structural diagram of a local storage device based on address word segmentation according to an embodiment of the present invention.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, interfaces, techniques, etc. in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

Fig. 1 is a schematic flowchart of an address storage method based on address word segmentation according to an embodiment of the present invention.

As shown in fig. 1, the method includes:

s1: carrying out first segmentation on each address according to a Conditional Random Field (CRF) to obtain a first address name of each address; wherein the address is an address of a geographic location;

s2: sequentially reading the first address name of each address one by one, and sequentially comparing the first address name with the address names in the nodes in the standard address tree;

s3: segmenting each address again according to the conditional random field CRF to obtain a second address name of each address, and labeling and screening the second address name by a finite state machine; cosine similarity matching is carried out by adopting the screened second address name;

s4: if the read current first address name is the same as the address name of the current node, performing cosine similarity matching on the read first address name and a subsequent second address name with a pre-stored address;

s5: if the match is the same, it is stored in a set with the pre-stored address.

In some demonstrative embodiments, the standard address tree may be generated by:

constructing a standard address tree according to the address name of an administrative division and the level size of a corresponding administrative region; and the administrative areas adjacent to the levels are parent and child nodes in the standard address tree.

Due to the particularity of the Chinese addresses, each address comprises administrative division information such as province, city, district, county, village and the like, and also comprises information such as road names, cell names, house numbers and the like. Meanwhile, the administrative division information of each address has uniqueness, namely each address can only belong to one administrative division and does not belong to a plurality of administrative areas at the same time. Thus, by constructing a standard address tree, comparison and storage with subsequent addresses may be facilitated.

The standard word address is constructed as follows:

collecting addresses of all provinces, cities, districts, counties and villages in the country and performing artificial word segmentation, e.g. original address

Yellow harbor village of Sun river of the Chaoyang area in Beijing City

Obtaining after cutting:

yellow harbor village of Sun river of the Chaoyang area in Beijing City

At present, five levels of administrative division information in China are as follows:

(first-class administrative district) provincial administrative district name: province, autonomous region, direct municipality and special administrative region;

(second-level administrative district) ground-level administrative district name: region, alliance, autonomous state, city of grade;

(third-level administrative district) county-level administrative district name: county, self-governing county, flag, self-governing flag, county-level city, prefecture area, forest area and special area;

(fourth-level administrative district) country-level administrative district name: countryside, national countryside, town, street, sappan wood, national sappan wood, and district and communal;

village administrative district name (five-level administrative district): villages, communities, houses and examination;

in the standard address tree, the addresses included include all the information of the new divisions that can be found nationwide, and are divided according to the levels. And storing the obtained segmentation address in a tree data structure form. The root node of each tree is a provincial administrative district, the leaf nodes of the root node are two-level administrative divisions, and the lower parts of the administrative divisions are three-level, four-level and five-level administrative divisions, as shown in fig. 2.

The purpose of creating a standard address library is to correct the address. The correction flow for one address is as follows:

a first cut is made by CRF.

The address sequence after segmentation is corrected through a standard tree, for example, one address after segmentation:

firstly, traversing all standard trees to inquire whether a tree using Beijing as a root node exists in east lake bay of the sunny-facing district of Beijing City, if so, searching next terms in an address sequence under the standard trees until leaf nodes at the lowest layer of the standard trees are searched or all terms in the address sequence are searched

If in, the input word is not present in the standard tree, e.g. an address:

and searching the south big street of Guancun in the Haihu area in root nodes of all trees by using the Haihu area, and returning a null result. At this time, searching is carried out in the second layer node under the root node, if a unique address tree can be found, wherein the second layer node contains a sea chest area, the searching is continued under the branch of the tree, otherwise, the searching is finished, and the address is marked as a non-standard address.

After the correction of the standard address tree, the first half of an address can get a standard split, and at this time, we only need to split the subsequent part of the address.

In some demonstrative embodiments, S1: carrying out first segmentation on each address according to a Conditional Random Field (CRF) to obtain a first address name of each address; wherein the address is an address of a geographic location.

The address splitting process can be regarded as an ordered sequence of labeling problems. Conditional Random Fields (CRF) are introduced to process the sequence standard problem, are proposed by Lafferty et al in 2001, combine the characteristics of a maximum entropy model and a hidden Markov model, are undirected graph models, and have good effect in a word segmentation sequence labeling task.

When using CRF for segmentation, a large amount of original address data is provided, manually segmenting the data, and using BI to label the attributes of each word, such as an address:

west shiku street in western city of Beijing City

The manually labeled attributes of this address are as follows:

beijing I city, I West B city, I region, I West B assorted I storehouse, I street

Where B represents the start of a phrase and I represents the middle of a word.

A CRF model with high performance is trained by manually segmenting a large number of addresses. Here a large amount of manually labeled data is used.

The address segmentation is the basis of the whole system, and the work of query, matching and the like can be completed only by accurately segmenting the address. The system adopts a model based on a conditional random field as a frame of word segmentation. The CRF word segmentation process is as follows:

a batch of address data is selected, the form of the address is guaranteed to be diversified, various common address writing methods are covered as much as possible, the addresses are manually segmented, the state is standardized, and the batch of addresses are used as training data of the model.

For arabic numbers appearing in the training data, the letters were replaced with '@'. Since the numbers and letters in the address, in most cases, appear as a whole, it is ensured by the substitution of numbers and letters that they are not separated during the segmentation. After the segmentation is completed, the @ symbol is restored to the original input information.

For example, a manually split address: a seat No. 109 of the first city of the Mingtian of the Qingqing Lu Mingtian in the Chaoyang area of Beijing

After replacement, the following steps are changed:

first city @ seat @ number of open sky of Qingqing Luo Ming-Tian in Chaojing area

The data input format for CRF is a single word and corresponding label for each line of content, separated by spaces or tabs. Converting the data into input data for the CRF:

north B

Jing I

City I

Towards B

Yang I

Zone I

Stand B

Qing I

Road I

@ B

Seat I

@ B

Number I

Where B represents the beginning of a word and I represents the middle of a word. And segmenting the address according to the result marked by the BI, and segmenting the address into different short words.

In some demonstrative embodiments, S3: segmenting each address again according to the conditional random field CRF to obtain a second address name of each address, and labeling and screening the second address name by a finite state machine; and performing cosine similarity matching by using the screened second address name.

Preferably, the process of labeling and correcting by a finite-state machine specifically includes:

Finite state machines are mathematical models that represent finite states and the behavior of transitions and actions between these states. In the process of address segmentation, each segmented second address name can be marked with the state of each second address name through keywords, and a string of addresses can be converted into a sequence of finite states. And inputting the state sequence into a state machine, and judging whether the address is a reasonable and effective address or not through the preset state machine. At the same time, the address can be corrected based on the results of the state machine.

State definition in the address key:

state 1 provincial administrative Unit, municipality directly under jurisdiction

1-A of the municipality of direct jurisdiction

Province, autonomous region 1-B

State 2 city level administrative:

grade 2-A

2-B county level City

2-C of union, region of autonomous State

State 3 administrative Unit on county level

Ending with a "zone" 3-A

Ending with "county" 3-B

Ending with a "flag" 3-C

State 4. firm administrative unit:

ending with "street" 4-A

Ending with "Caesalpinia sappan" 4-B

Ending with "town" 4-C

Ending with "Country" 4-D

State 5 village level administrative unit:

ending with a "Community" 5-A

Ending with village, village committee and village committee 5-B

State 6 development area:

ending with 'economic zone' and 'development zone' 6-A

Ending with "Industrial district" and "Industrial park" 6-B

State 7 way name unit:

ending with "way" 7-A

Ending with "street" 7-B

Ending 7-C with a "track

Ending with "Fang" 7-D

Ending with "groups" 7-E

Ending with "Block" 7-G

To "tail" 7-H

Ending with "Hu Tong" 7-J

Ending with "lane" 7-I

State 9 cell unit:

ending with "cell" 9-A

Ending with a "bridge" 9-B

Ending with "Yun" 9-C

Ending with "Square" 9-D

Ending with "Home" 9-E

Ending with "garden" 9-F

Ending with "li" 9-G

9-H by 3-A state transition

Completion of 9-I by rules

Ending with "mansion" 9-J

State 10:

ending with a "number" 10-A

With "haydia" 10-B

State 11 house number information:

ending with a "Unit" 11-A

Ending with "building" 11-B

Ending with "house" 11-C

Ending with a "Chamber" 11-D

Ending with "seat" 11-E

Ending with "building" 11-F

Ending with "house" 11-G

Ending with "layer" 11-H

Ending with "team" 11-I

Ending with "dong" 11-J

Number + sign 11-K

Pure number 11-L

Ending with "dormitory" 11-M

Ending 11-O with a "gate

Ending with "apartment" 11-P

Ending 11-Q with a "zone

With a horn 11-R

State 12

Ending with "company" 12-A

Ending with "Turn" 12-B

Ending with "plant" 12-C

Ending with "receive" 12-D

Additional information in the address, merged by the program 12-E

Ending with "part" 12-F

The standard thought of the state without keywords:

rule 1, if a complete un-partitioned address is in the first state position without key, if the state before the state position is between 1 and 7 and the state after the state position is not between 1 and 7, the state criterion is 9-X

Rule 2, if the rule 1 is not satisfied, the position of the current state is judged, if the position is not the last one and the state after the current position is not an unknown state, the current state is the state + X

Rule 3, if the rules 1 and 2 are not satisfied, the current state is the previous state + X;

an address may be divided into a sequence of states, for example:

the role of the finite state machine is to identify whether an address sequence is valid and reasonable, such as an address: beijing City (1-A) sunny region (3-A), this address sequence is an incomplete address and cannot be used as a valid address. For another address, like Liaoning province (1-B) Dalian city (2-A) Beijing city (1-A) Shahezu city (3-A) Shahezu street (7-B), in the address, the Dalian city and the Beijing city appear simultaneously, which causes ambiguity of the address, thus the address is an unreasonable address.

A finite state machine is introduced here to determine whether an address is reasonable and valid, by which each sequence of addresses is determined to be either a "correct address" or a "wrong address". An address, according to the state definition above, may generate a state sequence, and n may be set to the length of this state sequence. If n <2, directly judging as an error address, otherwise, performing the following judgment process:

s3.1, starting from the start state, judging whether the start state and the first state in the state sequence can be jumped according to the state jump flow chart (as follows), if the keyword in the corresponding state is searched, the jump is considered to be possible, and S3.2 is carried out, otherwise, an error address is returned.

S3.2 sets i ∈ (1, n-1), n is the length of the state sequence, n is a positive integer greater than or equal to 2, starting from i ═ 1 until n-1, and determines whether the jump relationship as described in fig. 3 exists between the i-th state and the i + 1-th state in the state sequence, if yes, continuing, otherwise, returning an error address. If the states up to the n-1 th state are correct, S3.3 is performed.

For example:

the 1 st state jumps to the 2 nd state, namely the state 1-A jumps to the state 3-A, and the jump relation in the figure 3 is met;

the 2 nd state jumps to the 3 rd state, namely the state 3-A jumps to the state 7-A, and the jump relation in the figure 3 is met;

the 3 rd state jumps to the 4 th state, namely the state 7-A jumps to the state 9-X, which accords with the jump relation in the figure 3;

the 4 th state jumping to the 5 th state is the state 9-X jumping to the state 10-a, which conforms to the jumping relationship in fig. 3.

S3.3, judging whether the n-th state and the end state in the state sequence can jump or not, if so, returning a correct address, otherwise, returning an error address. In the embodiment, the address "222 # of the first city of tomorrow of qing luo sung, beijing city" is considered as the correct address from the state 10-a to the state end.

The flow chart of state machine jumping is shown in fig. 3, where a wired connection between two states indicates that jumping is allowed, and not indicates that jumping is not allowed:

referring to fig. 3, when the state sequence of an address can satisfy the determination process of fig. 3, a correct address is output, and when the condition is not satisfied, an incorrect address is output.

In some demonstrative embodiments, S4: and if the read current first address name is the same as the address name of the current node, performing cosine similarity matching on the read first address name and the subsequent second address name with the prestored address.

Cosine similarity, also called cosine similarity, is to evaluate the similarity of two vectors by calculating the cosine value of their included angle. The cosine values range between [ -1,1], the closer the value is to 1, the closer the directions of the two vectors are represented; the closer they approach-1, the more opposite their direction; close to 0 means that the two vectors are nearly orthogonal.

Let vector a ═ (a1, a 2.., An), B ═ B1, B2.., Bn), the cosine similarity between the two vectors:

the similarity between two addresses can be calculated by cosine similarity:

s4.1: and generating the word frequency of each first address name and each second address name to obtain a word frequency vector by using the first address name and the second address name which are segmented again.

Segmenting the address:

address A: 110 th building in first region of garden in south lake of rising area in Beijing

And address B: 180 th storied building in south lake and western garden in rising area of Beijing

List all address names, including first address name and second address name:

110 # Lou West garden 180 in south lake Zhongyuan of rising area of Beijing

Calculating word frequency:

address A: beijing City 1, Chaoyang district 1, south lake 1, Zhongyuan 1, Yi district 1, 1101, Hordeum 1, West garden 0, 1800

And address B: beijing City 1, Chaoyang district 1, south lake 1, Zhongyuan 0, Yi district 0, 1100, Hordeum 1, West garden 1, 1801

Writing out a word frequency vector:

address A: [1,1,1,1,1,1,1,0,0]

And address B: [1,1,1,0,0,0,1,1,1]

S4.2: and calculating the weight value of each second address name according to the word frequency vector.

Calculating the weight:

since the address has a certain specificity, for example, in the address a and the address B, the information at the level of the cell and the information at the level of the city have different importance, different weights are assigned to words at different positions. Meanwhile, considering that the place names of different places are not uniform, the same words may have different weights in different cities. Based on the above consideration, a weight dictionary is created in units of all the regions across the country. The calculation method of the weight of each word in the dictionary is omega, omega:

where n is the number of times the word appears in all addresses of this region.

S4.3: and obtaining a comparison vector according to the calculated weight value.

During the similarity matching process of two addresses, the addresses whose first half sections (unit of ending to zone) are not completely consistent are directly regarded as dissimilar addresses. Therefore, the first half section is completely consistent, and only similarity extraction needs to be carried out on the addresses of county-level administration and the subsequent addresses. According to the calculated weight combination, a new truncated vector V to be compared can be generated_AAnd V_B：

V_A＝[ω₁，ω₂，ω₃，ω₄，ω₅，0，0，]

V_B＝[ω₁，0，0，0，ω₅，ω₆，ω₇，]

For V_AAnd V_BIf an address name appears in the original sequence, the corresponding position is the weight ω of the address name_iOtherwise, it is 0.

S4.4: and calculating the similarity according to a cosine similarity calculation formula by adopting the comparison vector.

And (3) similarity calculation:

in the formula:

The address segmentation according to the conditional random field CRF in the present application is prior art, and is not described herein.

As shown in fig. 3, the present invention also provides an address storage device based on address participles, comprising:

the segmentation module 100 is configured to segment each address for the first time according to the conditional random field CRF to obtain a first address name of each address; wherein the address is an address of a geographic location;

a comparing module 200, configured to sequentially read the first address name of each address one by one, and sequentially compare the first address name with the address names in the nodes in the standard address tree;

the labeling and screening module 300 is configured to segment each address again according to the conditional random field CRF to obtain a second address name of each address, label and screen the second address name by using a finite state machine; cosine similarity matching is carried out by adopting the screened second address name;

a matching module 400, configured to perform cosine similarity matching on the read first address name and a subsequent second address name with a pre-stored address if the read current first address name is the same as the address name of the current node;

a storage module 500, configured to store the address in a set with the pre-stored address if the match is the same.

Preferably, the standard address tree is generated by: constructing a standard address tree according to the address name of an administrative division and the level size of a corresponding administrative region; and the administrative areas adjacent to the levels are parent and child nodes in the standard address tree.

In some illustrative embodiments, the annotation screening module 300 specifically includes:

In some illustrative embodiments, the process of performing cosine similarity matching between the matching module 400 and the pre-stored address specifically includes:

The functions executed by each component in the device have been described in detail in the address query method based on address participles in the above embodiment, and are not described herein again.

The invention also provides computer equipment which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of the address storage method when executing the computer program.

The reader should understand that in the description of this specification, reference to the description of the terms "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. An address storage method based on address participles is characterized by comprising the following steps:

according to the conditional random field CRF, carrying out first segmentation on the first half part of each address to obtain a first address name of each address; wherein the address is an address of a geographic location;

segmenting the subsequent part of each address again according to the conditional random field CRF to obtain a second address name of each address, and labeling and screening the second address name by using a finite state machine; cosine similarity matching is carried out by adopting the screened second address name;

if the read current first address name is the same as the address name of the current node, performing cosine similarity matching on the read first address name and the screened second address name of the subsequent part with a prestored address;

if the match is the same, the read first address name, and the filtered second address name are stored in a set with the pre-stored address.

2. The method of claim 1, wherein the standard address tree is generated by: constructing a standard address tree according to the address name of an administrative division and the level size of a corresponding administrative region; and the administrative areas adjacent to the levels are parent and child nodes in the standard address tree.

3. The method according to claim 1, wherein the labeling and screening process by the finite state machine specifically comprises:

if the address is not a reasonable and effective address, the address is removed; and if the address meets the rules of all the states in the finite state machine, using the address in the subsequent steps.

4. The method according to claim 1, wherein the process of matching cosine similarity with the pre-stored address specifically comprises:

obtaining a comparison vector according to the calculated weight value;

5. The method of claim 4, wherein the cosine similarity calculation formula is:

in the formula:

a first comparison vector V being a first address_AElement of (1), omega_BiA second comparison vector V being a second address_BThe elements of (1); the first address is the currently read address, and the second address isThe pre-stored address; the number of the elements is n, and i is 1-n.

6. An address storage apparatus based on address participles, comprising:

the segmentation module is used for segmenting the first half part of each address for the first time according to the conditional random field CRF to obtain a first address name of each address; wherein the address is an address of a geographic location;

the labeling and screening module is used for segmenting the subsequent part of each address again according to the conditional random field CRF to obtain a second address name of each address, and labeling and screening the second address name through a finite state machine; cosine similarity matching is carried out by adopting the screened second address name;

the matching module is used for matching the cosine similarity of the read first address name and the screened second address name of the subsequent part with the prestored address if the read current first address name is the same as the address name of the current node;

and the storage module is used for storing the read first address name, the screened second address name and a prestored address in a set if the first address name, the screened second address name and the prestored address are matched with each other.

7. The apparatus of claim 6, wherein the standard address tree is generated by: constructing a standard address tree according to the address name of an administrative division and the level size of a corresponding administrative region; and the administrative areas adjacent to the levels are parent and child nodes in the standard address tree.

8. The apparatus of claim 6, wherein the label screening module specifically comprises:

the screening module is used for eliminating the address if the address is not a reasonable and effective address; and if the address meets the rules of all the states in the finite state machine, using the address in the subsequent steps.

9. The apparatus of claim 6, wherein the process of the matching module performing cosine similarity matching with the pre-stored address specifically comprises:

the word frequency vector generating module is used for generating the word frequency of each first address name and each second address name by using the first address name and the second address name which are segmented again to obtain a word frequency vector;

10. A computer arrangement comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the address storing method as claimed in any one of claims 1 to 5 are implemented by the processor when executing the computer program.