CN111753515B

CN111753515B - Address information extraction and matching method for realizing entity positioning

Info

Publication number: CN111753515B
Application number: CN202010590590.3A
Authority: CN
Inventors: 曾伟英; 霍智杰; 霍凯亮
Original assignee: Guangdong Kejie Communication Information Technology Co ltd
Current assignee: Guangdong Kejie Communication Information Technology Co ltd
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2024-07-02
Anticipated expiration: 2040-06-24
Also published as: CN111753515A

Abstract

The address information extraction and matching method for realizing entity positioning comprises the steps of constructing a first conditional random field, and determining the state of the first conditional random field according to administrative level keywords; jumping the address text according to the state of the first conditional random field; dividing the address text into a plurality of sub-texts according to the state jump; adding text labels to the divided sub-texts on corresponding administrative levels; constructing and storing a tag library according to the text tags; constructing a second random field comprising: adding a new text label to the second address text according to the text label; fuzzy matching is carried out on the address text, and the fuzzy matching comprises the following steps: acquiring a weight value of a text label of a label library; and matching the address text corresponding to the text label which is most similar to the input address text in the label library according to the weight value. The method solves the problem that the data cannot be directly associated due to different writing habits of two different original data when the address information is input.

Description

Address information extraction and matching method for realizing entity positioning

Technical Field

The invention relates to the technical field of text matching, in particular to an address information extraction and matching method for realizing entity positioning.

Background

Geographic information is the most commonly used social public information resource at present, is closely related to the daily life of the public, and is also a basic resource for government basic administration. The text address refers to a geographical location described by text, such as "Beijing city, the morning sun district, the aster road". However, when data mining work is performed on each item of data containing address text information, the problem that most of address information in the original data is not recorded normally is often faced, so that a bottleneck exists when correlation analysis is carried out on massive address texts.

Disclosure of Invention

Aiming at the defects in the background technology, the invention provides an address information extraction and matching method for realizing entity positioning, which realizes the label extraction conforming to the conventional understanding on massive address texts, can easily realize the association on the data needing address association, and solves the problem that the data cannot be directly associated due to different writing habits of two different original data when the address information is input.

To achieve the purpose, the invention adopts the following technical scheme:

the method for extracting and matching the address information for realizing the entity positioning comprises a first address text containing administrative level keywords and a second address text not containing administrative level keywords, and comprises the following specific steps:

constructing a first conditional random field applicable to the first address text, comprising:

determining the state of the first conditional random field according to the administrative level keywords;

jumping the address text according to the state of the first conditional random field;

dividing the address text into a plurality of sub-texts according to the state jump;

according to the address text which is successfully segmented, adding text labels to the segmented sub-text on the corresponding administrative level;

Constructing and storing a tag library according to the text tags;

Constructing a second random field suitable for use with the second address text, comprising:

adding a new text label to the second address text according to the text label in the label library;

fuzzy matching is carried out on the address text, and the fuzzy matching comprises the following steps:

acquiring a weight value of a text label of a label library;

And matching the address text corresponding to the text label which is most similar to the input address text in the label library according to the weight value.

Preferably, the first address text is ranked according to administrative level keywords of the first address text, and one address text of each level corresponds to one state of the first conditional random field;

Address text at the same level is arranged side by side, and low-level address text is arranged after high-level address text.

Preferably, in the state jump process, the high-level state corresponding to the high-level address text jumps to the low-level state corresponding to the low-level address text, and the jump is irreversible;

when the high-level state jumps to the low-level state, all the low-level states of the column in which the low-level state is located are passed through;

the states of a single lowest level may jump to each other.

Preferably, in each jump, the address text is segmented by using the administrative level keyword corresponding to the level state, and the segmented address text enters the next low level state for re-segmentation;

selecting a path with the largest number of hops, and determining the path as an optimal segmentation path; wherein address text that hops across level states does not count the number of hops.

Preferably, in the address text which is successfully segmented, the sub-text and the administrative vocabulary corresponding to each level state are used as text labels.

Preferably, a dictionary is established, text labels are added to the dictionary according to preset rules, and the dictionary is stored as a two-dimensional data table.

Preferably, the second address text is split word by word, the split previous word and the split next word are combined, the combination is then matched in a tag library, whether a text tag of the combination exists or not is judged, and if yes, the combination is reserved; if not, the combination is not reserved;

After the combination is reserved, combining the combination with the next word to form a new combination, matching the combination with the next word in a tag library, judging whether a text tag of the new combination exists, if so, reserving the new combination, continuing to combine with the next word, and if not, not reserving the new combination;

And so on until all split words can no longer be combined.

Preferably, after the input address text is segmented, all text labels are subjected to weight statistics, each text label corresponds to a weight value, and the weight value is in direct proportion to the importance of the text label.

Preferably, the similarity between each text label in the input address text and the text label in the label library is calculated, the similarity and the weight value are weighted and averaged, and the text label in the label library with the largest value is most similar to the input address text.

Advantageous effects

The method and the device realize the label extraction which accords with the conventional understanding on massive address texts, can easily realize the association on the data needing address association, and solve the problem that the data cannot be directly associated due to different writing habits of two different original data when the address information is input.

Drawings

FIG. 1 is a diagram of a model structure of one embodiment of the present invention;

FIG. 2 is a flow chart of one embodiment of the present invention;

fig. 3 is a conditional random field state jump diagram of one embodiment of the invention.

Detailed Description

The technical scheme of the invention is further described below by the specific embodiments with reference to the accompanying drawings.

In the description of the present invention, it should be understood that the terms "upper," "lower," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like indicate or are based on the orientation or positional relationship shown in the drawings, merely to facilitate description of the present invention and to simplify the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention.

The invention relates to an address information extraction and matching method for realizing entity positioning, which comprises a first address text containing administrative level keywords and a second address text not containing administrative level keywords, as shown in fig. 1 and 2, and comprises the following specific steps:

Constructing and storing a tag library according to the text tags;

acquiring a weight value of a text label of a label library;

The method is mainly divided into two parts, wherein the first part is the construction of a conditional random field for dividing and extracting information; the other part is the construction of a tag library after text extraction. The conditional random fields are further divided into a first conditional random field and a second conditional random field, which are respectively used for dividing different two types of texts, one type is a first address text containing administrative level keywords and a second address text not containing administrative level keywords, wherein the first address text is used for example for "three-way EF elementary school of the AB area CD of the Buddha city in Guangdong province", and the second address text is used for example for "three-way EF elementary school of the CD of the Buddha in Guangdong province". And (3) dividing the massive text by using the first conditional random field to construct a tag library, and constructing a second conditional random field based on the tag library. When a new input text is needed, the two conditional random fields are utilized for segmentation and extraction, fuzzy matching is carried out on the text based on LENVENSHITEIN DISTANCE algorithm, and an approximate text is returned, wherein the approximate text is a similar text which can be understood by people. The method solves the problems that address writing modes input by different data are inconsistent in daily data mining work, but the address texts cannot be associated due to writing habits while the address texts belong to the same address in semantic understanding.

As shown in fig. 3, the first conditional random field state determination: in China, all addresses may be classified according to administrative level, such as autonomous area, special administrative area, province, city, district, county, town, street office, etc., classified according to administrative level keywords, such as "district" and "county" peers, so they should be arranged side by side, while "street office" and "town" peers but with a level lower than "district, county" so they are arranged side by side after "district, county", and other levels form respective states based on which the first condition is random. In the actual use process, the state corresponding to the administrative level can be increased or decreased according to the actual demand, and each first address text to be extracted passes through the first conditional random field.

the states of a single lowest level may jump to each other.

State jumps of the first conditional random field: the first conditional random field has to jump from front to back, i.e. high-level address text to low-level administrative address text, while the jump from high to low has to go through every possible state, e.g. a list of "regions" has to jump only and has to jump to a list of "streets", i.e. a list of "regions" has to go through "streets" and "towns" when jumping, and the "streets" has to go through "resident" and "village" when jumping, and there is a loop of only "number" at the lowest level state.

In the method, each level of administrative level is regarded as a state, and the jump process can only jump from a high level to a low level and is irreversible. Each jump uses the corresponding administrative grade vocabulary to divide the address text. And entering the segmented sub-text into a new state for segmentation. If the division fails, e.g., part of the text does not write "region" information, but rather writes directly from "city" to "street office", the cross-level state is no longer incorporated into the calculation. After the first conditional random field is divided, the path with the largest number of hops is selected as the optimal dividing path.

For example, "the three-way EF university of the AB area CD in the berg city of the guangdong province", the address text not only passes through the state jump path of "province", "city", "area" and "road", but also passes through the state jump of "autonomous area", "city" and "county" … …, but is not divided by the state of "autonomous area" at the "autonomous area", so that the state of "autonomous area" is directly passed by the pass and enters the state of "city". In the result of the path divided by the "autonomous region", the number of times of dividing success is necessarily smaller than that of the path divided by the "province", so the path divided by the "autonomous region" is necessarily not the optimal path.

For the address text which is successfully segmented, text labels can be added to corresponding administrative boundaries, for example, the address text is 'Guangdong Buddha mountain city AB area CD three-way EF primary school', the class segmentation of 'Guangdong' is that the class segmentation of 'city' is that of 'Buddha mountain'.

Building a tag library: creating a blank dictionary, wherein the dictionary is a data structure, and the extracted label text is added in the form of key value pairing, for example, a ' three-way EF university of AB area of Fushan, guangdong, and the segmented ' Guangdong ' sub-text corresponds to a { ' text ': "Guangdong", "grade": "province", "count": "+1" } +1 refers to a gradual count accumulation on the original basis), the remaining labels, and so on.

Expansion of tag library: in the presence of the tag text of "Guangdong" which follows, the count of "Guangdong" is incremented by one in the manner of a key value, and if the tag text does not appear, the new key value pair is added.

And (3) storing a tag library: the updated dictionary is stored periodically as a two-dimensional data table that is highly efficient at accessing internal data in preparation for subsequent second random condition field establishment.

And so on until all split words can no longer be combined.

The second conditional random field is for text segmentation without filling any administrative keywords, such as "bergamot NH region" written "bergamot NH". Thus, the state of the second conditional random field, depending on the tag extracted at the first random field, we use the tag library to split words by words for such address text, then add each word and its following words as new sub-text, jump into the new state, count the likelihood of each state jump in the tag library, and take the most probable path as the best split. It should be noted that, finally, through the best path, the corresponding administrative label of each state is found.

Word-by-word splitting: for example, "Guangdong Buddha CD three-way EF university", split: "Guang", "Dong", "Buddha" … …

Initializing combined address text: for the disassembled Chinese character groups, merging the 'Guangdong' and the 'east' firstly, if the 'Guangdong' exists in the tag library, keeping the combination, continuing to merge to form the 'Guangdong Buddha', but obviously, not existing in the tag library, skipping the combination, continuing to merge the 'Guangdong Buddha' until the complete address text is traversed, and forming the state of the second conditional random field by the combination.

Selection of the optimal partition: based on all the above-mentioned division combinations, the combination of the product maximum value of the product of the evolution of the number of occurrences and the character string length in the tag library, such as "Guangdong Buddha", is found 10000 times, and "Guangdong" is found 1000000 times, the optimal division is still "Guangdong".

Word forming and division iteration: when "Guangdong" is the best combination, the subsequent other text contains "Guangdong" characters, and the division defaults to "Guangdong" as the best granularity. And (3) for the 'Guangdong Buddha CD three-way EF primary school', after the Guangdong is divided, carrying out optimal division selection on the Chinese character group after the Guangdong until the Chinese character group can not be combined any more, forming a new combination of the rest Chinese characters, and adding the new combination into a label library.

Labeling newly partitioned data: the text is divided into Guangdong, and the corresponding label is "province", and then the Guangdong province is used as one of the labels of the text of the Guangdong Buddha CD three-way EF university, and the other labels are analogized.

As mentioned above, text is subjected to fuzzy matching based on LENVENSHITEIN DISTANCE algorithm, and approximate text is returned, wherein the approximate text is similar text which can be understood by people. According to the method, LENVENSHITEIN DISTANCE is used as fuzzy matching of the address, firstly, weight statistics is carried out on the extracted text labels in a label library, and the weight statistics mode used at this time is TFIDF, which is the prior art and is not described herein. For example, one address text of "GL way 18 No. G city in south China sea of Buddha" is calculated by the prior conditional random field and TFIDF, the weight of "Buddha" is 0.15, "south sea" is 0.18, "GL way" is 0.33, and the like, and the larger the weight is, the higher the importance of the vocabulary is represented.

Every time a new address text is received, it is firstly divided, all administrative level labels are obtained by using the conditional random field division, then TFIDF calculates the weights of text labels of all administrative levels, finally we calculate LENVENSHITEIN DISTANCE of the text labels corresponding to partial address text in the input address text and the label library, and finally the final value is weighted average, and the maximum value is the address text most similar to the input address.

Examples: if the address of "the BG garden X607 house of the south sea area GL of bergamot" cannot be recorded in the above extracted tag library, then the tag of "the BG garden X607 house" cannot appear in the tag library (but the tags of "the bergamot", "the south sea area", "the GL way" and the like can be all obtained in the library by extracting other text tags of the same region) according to the first conditional random field and the second conditional random field, and the text tag may be extracted from the address of "the BG garden Y205 house of the south sea area GL of bergamot". After the addresses are subjected to segmentation weighting, the weight of the 'BG garden X house 607 house' is highest, meaning that the influence on the addresses is the greatest, and carrying out LENVENSHITEIN DISTANCE fuzzy matching on each state according to stages can generate a character string similarity rate of each state, and finally, the weighted average of all the similarity rates and TFIDF is carried out to obtain a result. For two addresses "the south sea area GL road BG garden Y205 house" and "the south sea area GL road UU upper house X607 house" in the berg city, the product result of the character string similarity ratio and the weight of the "BG garden Y205 house", "the UU upper house X607 house" and the "BG garden X607 house" is compared, and then the "who" is more similar is determined. The local similarity matching can be effectively avoided by extracting labels which are segmented according to the levels, errors caused by non-uniform writing habits and similar semantics are avoided, and meanwhile, the situation of excessive fuzzy matching (the situation that the character strings are arranged similarly only to cause that the character strings are extremely similar, but two different addresses are actually used, such as 'Buddha urban LONGJ paths of LGTH Lido K seats 201 houses', 'Buddha urban LVJ paths of LGTH Huafu K seats 201 houses') is avoided. And (3) injection: the addresses are fictional and are convenient for explanation and use.

The technical principle of the present invention is described above in connection with the specific embodiments. The description is made for the purpose of illustrating the general principles of the invention and should not be taken in any way as limiting the scope of the invention. Other embodiments of the invention will be apparent to those skilled in the art from consideration of this specification without undue burden.

Claims

1. An address information extraction and matching method for realizing entity positioning is characterized in that: the method comprises the following specific steps of:

Constructing and storing a tag library according to the text tags;

acquiring a weight value of a text label of a label library;

Matching the address text corresponding to the text label which is most similar to the input address text in the label library according to the weight value;

in the state jump process, the high-level state corresponding to the high-level address text jumps to the low-level state corresponding to the low-level address text, and the jump is irreversible;

The states of the single lowest level can mutually jump;

in each jump, dividing the address text by using an administrative level keyword corresponding to the level state, and dividing the address text after division into a next low-level state again;

selecting a path with the largest number of hops, and determining the path as an optimal segmentation path; the address text of the cross-level state skip does not count the skip times;

establishing a dictionary, adding the text labels to the dictionary according to preset rules, and storing the dictionary as a two-dimensional data table;

Splitting the second address text word by word, combining the split previous word with the split next word, matching the combined previous word with the split next word in a tag library, judging whether a combined text tag exists, and if yes, reserving the combination; if not, the combination is not reserved;

And so on until all split words can no longer be combined.

2. The method for extracting and matching address information for implementing entity positioning according to claim 1, wherein the method comprises the steps of:

Grading the first address text according to administrative level keywords of the first address text, wherein one address text of each level corresponds to one state of the first conditional random field;

3. The method for extracting and matching address information for implementing entity positioning according to claim 1, wherein the method comprises the steps of:

And in the address text which is successfully segmented, taking the sub-text and administrative level vocabulary corresponding to each level state as text labels.

4. The method for extracting and matching address information for implementing entity positioning according to claim 1, wherein the method comprises the steps of:

After the input address text is segmented, all text labels are subjected to weight statistics, each text label corresponds to a weight value, and the weight value is in direct proportion to the importance of the text label.

5. The method for extracting and matching address information for locating entities according to claim 4, wherein:

And calculating the similarity between each text label in the input address text and the text labels in the label library, and carrying out weighted average on the similarity and the weight value, wherein the text label in the label library with the maximum value is most similar to the input address text.