CN112597362A

CN112597362A - Address matching method and system based on big data

Info

Publication number: CN112597362A
Application number: CN202011418486.2A
Authority: CN
Inventors: 黄瑜丹
Original assignee: Jilin Agricultural Science and Technology College
Current assignee: Jilin Agricultural Science and Technology College
Priority date: 2020-12-07
Filing date: 2020-12-07
Publication date: 2021-04-02

Abstract

The invention relates to an address matching method and system based on big data, which comprises constructing a big database composed of an address locator and a street network; inputting an unstructured address; normalizing and standardizing unstructured addresses; carrying out address matching based on a big data matching algorithm; positioning based on the address reference; outputting the address coordinates and the address object. The invention considers the difference between the regular street segment and the irregular street segment, and adopts other parameters such as the distance and the angle with the street segment and the offset with the tail end of the street segment, thereby improving the accuracy of calculating the position; compared with the traditional rule-based address resolution technology, the neural network has obvious advantages of being dependent on a mode rather than data, independent of data during operation, fault-tolerant, high in precision of a data set for benchmark testing and high in throughput for a large data scene.

Description

Address matching method and system based on big data

Technical Field

The present application relates to the field of big data, and in particular, to an address matching method and system based on big data.

Background

Address matching is a process of establishing correspondence between a literal description address and address location coordinates of its space. The address matching service looks up matching objects for addresses according to specific steps: firstly, standardizing an address; then the server searches the address matching reference data and searches a potential position; each candidate location is assigned a score based on its proximity to the address, and the address is finally matched with the highest score. In entering an address, not only postal addresses are supported, but also different types of descriptive data. For different types of descriptive input data, matching is usually performed by simple operations between database elements. For situations where the amount of data is not large, a quick match is typically possible. With the development of the internet and big data, higher requirements are provided for the time consumption and the accuracy of address matching, and the existing address matching method cannot meet the requirements. In addition, predetermined rules are typically based when normalizing addresses, however, when errors and semantic ambiguities may exist that may require processing multiple addresses or address inputs, resulting in inaccurate identification.

Disclosure of Invention

In order to solve the above problems in the prior art, the present invention provides an address matching method based on big data and a system thereof.

The invention relates to an address matching method based on big data, which comprises the following steps:

s1, constructing a big database consisting of an address locator and a street network;

s2, inputting an unstructured address;

s3, normalizing and standardizing the unstructured address;

s4, performing address matching based on the big data matching algorithm;

s5, positioning based on the address reference;

and S6, outputting the address coordinates and the address object.

Preferably, in step S4, the big data based matching algorithm further includes the following steps:

s41, verifying the address type;

s42, if the address street group, returning the barycentric position of the street group;

s43, if the address is a street, searching the street segments for the street segment matched with the code of the address;

s44, verifying whether the address code of the group of segments is in the range between the minimum value and the maximum value;

s45, if the distance is within the range between the minimum value and the maximum value, obtaining a street segment;

s46, if there are multiple street segments, selecting one street segment according to the weight;

s47, after selecting a street segment, judging the type of the street segment,

s48, if the street segment is irregular, returning to the center of gravity of the street segment;

and S49, if the street segment is a rule, performing interpolation according to the parity check of the address number.

Preferably, in step S46, the selecting a street segment according to the weight further includes the following steps:

s461, selecting an address locator related to a street;

s462, generating street segments and calculating a numerical range;

s463, calculating the relative position of each address locator to the corresponding street segment;

s464, calculating the percentage of address locators for each address segment;

s465, calculating the percentage of street segments with all odd numbers on one side and even numbers on the other side;

s466, calculating the percentage of unaddressed locators belonging to the street segment that comply with the parity condition;

s467, the percentage of address locators that belong to only one street segment interval is calculated.

Preferably, the step S464 of calculating the percentage of address locators for each address segment includes:

pOL: the address locator has an odd number and percentage on the left;

pOR: the address locator has an odd number and a percentage on the right;

pEL: the percentage of address locators with even numbers and on the left;

pER: the address locator has an even number and a percentage to the right.

Preferably, for segment S, the probability that the odd address is to the left and the even address is to the right is:

P(OL&ER)＝pOL×pER (1)

P(OR&EL)＝pOR×pEL (2)

in summary, a street portion that complies with the parity check condition has:

p (OL & ER) ═ 1 OR P (OR & EL) ═ 1 (3)

Among these address locators, those that comply with the numerical range precondition are address locators that belong to one and only one range.

Preferably, in step S3, normalizing and standardizing the unstructured address includes: a standardization stage, a pre-machine learning stage, a machine learning stage and a post-machine learning stage;

wherein the normalization stage is configured to process redundant representations in the address, comprising:

(1) change treatment: large changes in address labels are mapped to a common representation to reduce problem feature space and to speed processing by reducing redundancy. For example, "street" may be expressed by the same semantic terms, such as "road, avenue, road, and road. The normalization phase converts these multiple representations into a common representation.

(2) Noise removal: noise in the data is reduced by deleting portions of lesser information content. For example, redundant punctuation marks, such as "…" and "! ".

(3) Automatic label correction: during the training phase, normalization corrects errors in the label data by applying partial rules. For example, "Hubei" rarely occurs in city columns, and the normalization phase detects this and corrects errors in the data during the training phase.

Wherein the pre-machine learning phase makes the address structure similar to the structure used in training, including:

(1) pre-marking: at runtime, there may be some fields in an address that the machine learning model does not train. Pre-machine learning detects these fields and processes them using regular expressions and passes the information to the machine learning stage so that no false results are produced.

For example, if an address contains a field "postal number" that is not trained on a machine learning model, regular expressions are used to identify the postal number, label it according to the corresponding field, and filter it out before reaching the machine learning classifier.

(2) And (4) sequential correction: the fields in the address may be located in different locations. For example, in some cases, the zip code may be located in the middle portion of the address, rather than the last portion. Pre-machine learning detects such deviations and uses partial rules to make the address structure similar to that trained by machine learning models.

And the machine learning stage is used for classifying the addresses into corresponding identifications according to the processing result of the pre-machine learning. This is the classification of the identification level. Each identity is passed to normalization, pre-machine learning, feature extraction, feature encoding, classification, and finally to post-machine learning. Each identified prediction is also used for the next identified prediction. The machine learning phase uses a neural network to classify the address identification into the address component to which it belongs. For each address, a prediction for each identification is obtained, and then similar predictions are combined to form the components that they represent.

Wherein the post-machine learning phase is used to automatically correct potential errors of the predictive machine learning phase. This phase takes as input a prediction of machine learning and validates it according to a set of rules. For example, it is ensured that the predicted zip code has a significant number of digits. When the classifier incorrectly predicts other numbers as zip codes, and the post-machine learning component validates the regular expressions, a mismatch is found here and an error is detected.

Preferably, the machine learning adopts a fully-connected neural network with a single hidden layer, and the learning process adopts random gradient descent with a back propagation algorithm; the input layer of the neural network corresponds to the number of input features; the output layer corresponds to the total number of classes to predict;

suppose n_LEqual to the number of layers, L_iCorresponding to layer i of the neural network. The neural network parameters are expressed as:

(W，b)＝((W¹，b¹)，(W²，b²)...)，

where W represents the weight matrix and b corresponds to the bias matrix. Here, W^L _ijRepresenting the weight associated with the connection between cell j in layer L and cell i in layer L + 1. a is^L _iRepresenting the activation value of cell i in the L layer. The learning phase comprises two phases of forward propagation and backward propagation, wherein:

in forward propagation, the weighted sum (x) of the inputs, for a particular layer, is calculated as z:

z＝(∑W_ix_i+b) (4)

let f (z) denote the activation function for a given layer. The output of the activation function computed at layer L is provided to layer L + 1. This process continues to the last layer as shown by the following equation:

a⁽¹⁾＝x (5)

z⁽²⁾＝W⁽¹⁾x+b⁽¹⁾ (6)

a⁽²⁾＝f(z⁽²⁾) (7)

z⁽³⁾＝W⁽²⁾x+b⁽²⁾ (8)

h_w，b(x)＝a⁽³⁾＝f(z⁽³⁾) (9)

wherein h is_W，b(x) Is the prediction output.

In back propagation, the output of the last layer is calculated, compared to the actual value (y), and the difference between the predicted and actual values (y) is calculated as the error term δ.

Let J be the cost function minimized using random gradient descent,

is a calculated gradient of a specific layer, which helps to optimize the weight momentArray, as shown by the following equation:

δ⁽³⁾＝-(y-a⁽³⁾.f′(z⁽³⁾)) (10)

δ⁽²⁾＝-(W^(2)T).δ⁽³⁾.f′(z⁽²⁾) (11)

preferably, the method further comprises the following steps: a feature extraction and encoding step, wherein:

feature extraction involves extracting features that convey information that neural network models use to predict components. The feature set consists of some context (previous label, next phrase, previous phrase), some grammar (number of bits, suffix), some location features (index of current identification, relative index of current identification) and some list-based features (e.g., street name).

Feature coding uses binary coding of the classification features based on phrase frequency. A vocabulary/dictionary (the most common set of phrases in the training data) is formed based on the "frequency" of occurrence in the training data, and each phrase is checked for the presence. Each phrase present in the dictionary is converted into a binary vector equal to the number of tokens in the dictionary. In each phrase vocabulary, the vector consists of all zeros, but there is a "1" at a particular position in a particular phrase. All non-lexical words are encoded into the same representation. The resulting matrix is sparse for a batch of training samples as a result of the feature encoding process.

Preferably, a training phase is also included, wherein: the training data is profiled to obtain frequent words that are further consumed in the feature extraction step, forming a dictionary/vocabulary. The data is then divided into batches from which features are extracted, encoded, and the batches are then trained and the accuracy of the training is measured and visualized to determine if training is to continue.

To efficiently process performance evaluations, in each batch, the model may issue some information or diagnostics as it is trained. This information includes the verification accuracy, test accuracy, log loss, and other performance indicators of the model. After each batch of training, an accuracy index is obtained. In neural network architectures, there are typically many parameters that need to be adjusted. Furthermore, not all training data is useful. To obtain the optimal portion of the training data and the optimal hyperparameters, the framework compiles a diagnosis for each batch of data and plots a real-time curve describing how the accuracy fluctuates as the training progresses. This provides an early feedback and model that can be easily adjusted accordingly to obtain the best parameters and data set. With the increase of the number of batch or indirect samples, the accuracy of training, testing and verification is improved. However, at some point, the accuracy of the sequence continues to improve while the verification decreases, at which point the training is stopped and the parameters are adjusted.

Preferably, the address matching parameters include: min _ E and Max _ E: a minimum and maximum of an even number; q1_ E and q2_ E: and the offset of the two ends of the even-number value road section side minimum value and the maximum value. d _ E and α _ E: distance and angle to the even number value link side; min _ O and Max _ O: odd minimum and maximum values. q1_ O and q2_ O: and carrying out offset on both ends of the minimum value and the maximum value of the odd-number value road section side. d _ O and α _ O: distance and angle to odd number value road section side;

preferably, the range between the minimum and maximum values includes the odd and even minimum and maximum values.

In addition, the address matching system based on the big data runs the address matching method based on the big data.

Under the condition of linear reference, firstly, determining street segments matched with an address description list based on an address range in street segment attributes, wherein the left/right sides of the address are obtained by parity check types of each side of the street segments and parity check numbers in the address, calculating the accurate positions of the corresponding sides in proportion, considering the difference between a regular street segment and an irregular street segment, and improving the accuracy of calculating the positions by adopting other parameters such as the distance and the angle with the street segments, the offset with the tail ends of the street segments and the like; compared with the traditional rule-based address resolution technology, the neural network has obvious advantages of being dependent on a mode rather than data, independent of data during operation, fault-tolerant, high in precision of a data set for benchmark testing and high in throughput for a large data scene.

Drawings

FIG. 1 is a flow chart of a big data based address matching method of the present invention.

FIG. 2 is a flow chart of a big data based address matching method of the present invention.

FIG. 3 is a flow chart of a big data based address matching method of the present invention.

FIG. 4 is a flow chart of a big data based address matching method of the present invention.

FIG. 5 is a flow chart of a big data based address matching method of the present invention.

FIG. 6 is a flow chart of a big data based address matching method of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention.

As shown in fig. 1 to 6, the address matching method based on big data of the present invention includes the following steps:

s2, inputting an unstructured address;

s3, normalizing and standardizing the unstructured address;

s4, performing address matching based on the big data matching algorithm;

s5, positioning based on the address reference;

and S6, outputting the address coordinates and the address object.

s41, verifying the address type;

s47, after selecting a street segment, judging the type of the street segment,

s461, selecting an address locator related to a street;

s462, generating street segments and calculating a numerical range;

s464, calculating the percentage of address locators for each address segment;

pOL: the address locator has an odd number and percentage on the left;

pOR: the address locator has an odd number and a percentage on the right;

pEL: the percentage of address locators with even numbers and on the left;

pER: the address locator has an even number and a percentage to the right.

P(OL&ER)＝pOL×pER (1)

P(OR&EL)＝pOR×pEL (2)

in summary, a street portion that complies with the parity check condition has:

p (OL & ER) ═ 1 OR P (OR & EL) ═ 1 (3)

Among these address locators, subject to the number range precondition is an address locator belonging to one and only one interval.

(W，b)＝((W¹，b¹)，(W²，b²)...)，

z＝(∑W_ix_i+b) (4)

a⁽¹⁾＝x (5)

z⁽²⁾＝W⁽¹⁾x+b⁽¹⁾ (6)

a⁽²⁾＝f(z⁽²⁾) (7)

z⁽³⁾＝W⁽²⁾x+b⁽²⁾ (8)

h_w，b(x)＝a⁽³⁾＝f(z⁽³⁾) (9)

whereinh_W，b(x) Is the prediction output.

Let J be the cost function minimized using random gradient descent,

is the calculated gradient of a particular layer, helps to optimize the weight matrix, as shown by the following equation:

δ⁽³⁾＝-(y-a⁽³⁾.f′(z⁽³⁾)) (10)

δ⁽²⁾＝-(W^(2)T).δ⁽³⁾.f′(z⁽²⁾) (11)

According to the address matching method and system based on the big data, under the condition of linear reference, a street segment matched with an address description list is determined based on an address range in a street segment attribute. Once the street segment is determined, the left/right sides of the address are obtained by the parity check type of each side of the street segment and the parity check number in the address, the accurate position of the corresponding side is calculated proportionally, the difference between the regular street segment and the irregular street segment is considered, and the distance and the angle between the street segment and the tail end of the street segment and other parameters such as the offset between the street segment and the tail end of the street segment are adopted, so that the position calculation precision is improved; compared with the traditional rule-based address resolution technology, the neural network has obvious advantages of being dependent on a mode rather than data, independent of data during operation, fault-tolerant, high in precision of a data set for benchmark testing and high in throughput for a large data scene.

Finally, the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered in the protection scope of the present invention.

Claims

1. The invention discloses an address matching method based on big data, which is characterized by comprising the following steps:

s2, inputting an unstructured address;

s3, normalizing and standardizing the unstructured address;

s4, performing address matching based on the big data matching algorithm;

s5, positioning based on the address reference;

s6, outputting the address coordinates and the address object;

in step S4, the big data based matching algorithm includes the following steps:

s41, verifying the address type;

s44, verifying whether the address code of the street segment is in the range between the minimum value and the maximum value;

s47, after selecting a street segment, judging the type of the street segment,

s49, if the street segment is a rule, carrying out interpolation according to the parity check of the address number;

in step S46, the selecting a street segment according to the weight includes the following steps:

s461, selecting an address locator related to a street;

s462, generating street segments and calculating a numerical range;

s464, calculating the percentage of address locators for each address segment;

s467, calculating the percentage of address locators belonging to only one street segment interval;

the step S464 of calculating the percentage of address locators for each address segment includes:

pOL: the address locator has an odd number and percentage on the left;

pOR: the address locator has an odd number and a percentage on the right;

pEL: the percentage of address locators with even numbers and on the left;

pER: the address locator has an even number and a percentage to the right;

for street segment S, the probability of an odd address to the left and an even address to the right is:

P(OL&ER)＝pOL×pER (1)

P(OR&EL)＝pOR×pEL (2)

in summary, a street portion that complies with the parity check condition has:

p (OL & ER) ═ 1 OR P (OR & EL) ═ 1 (3)

2. The big-data-based address matching method as claimed in claim 1, wherein in step S3, normalizing and standardizing the unstructured address comprises: a standardization stage, a pre-machine learning stage, a machine learning stage and a post-machine learning stage;

(1) change treatment: mapping large changes in address labels to a common representation to reduce problem feature space and speed processing by reducing redundancy;

(2) noise removal: reducing noise in the data by deleting portions of lesser information content;

(3) automatic label correction: in the training phase, normalizing corrects errors in the label data by applying partial rules;

wherein the pre-machine learning phase is used to make the address structure similar to that used in training, including:

(1) pre-marking: at runtime, there may be some fields in an address that the machine learning model does not train; pre-machine learning detects the fields, processes the fields by using a regular expression, and transmits information to a machine learning stage;

(2) and (4) sequential correction: fields in the address are located in different locations; pre-machine learning detects such deviations and uses rules to make the address structure similar to that trained by the machine learning model;

the machine learning stage is used for classifying the addresses into corresponding identifications according to the processing result of the pre-machine learning; each identifier is transmitted to normalization, pre-machine learning, feature extraction, feature coding and classification, and finally transmitted to post-machine learning; the prediction of each identity is used for the prediction of the next identity; the machine learning stage uses a neural network to classify the address identification into the address component to which the address identification belongs; for each address, obtaining a prediction for each identification, and combining similar predictions to form the components that they represent;

wherein, the post machine learning stage is used for automatically correcting and predicting potential errors of the machine learning stage; this stage takes the prediction of machine learning as input and verifies it according to a set of rules; when the classifier incorrectly predicts other numbers as zip codes, and the post-machine learning component validates the regular expressions, a mismatch is found here and an error is detected.

3. The big data-based address matching method as claimed in claims 1-2, wherein machine learning employs a fully connected neural network with a single hidden layer, and learning process employs a stochastic gradient descent with a back propagation algorithm; the input layer of the neural network corresponds to the number of input features; the output layer corresponds to the total number of classes to predict;

suppose n_LEqual to the number of layers, L_iAn ith layer corresponding to a neural network; the neural network parameters are expressed as:

(W，b)＝((W¹，b¹)，(W²，b²)…)，

wherein W represents a weight matrix and b corresponds to a bias matrix; here, W^L _ijRepresents a weight associated with a connection between cell j in layer L and cell i in layer L + 1; a is^L _iRepresents the activation value of the unit i in the L layer; the learning phase comprises two phases of forward propagation and backward propagation, wherein:

z＝(∑W_ix_i+b) (4)

let f (z) represent the activation function for a given layer; the output of the activation function calculated at layer L is provided to layer L + 1; this process continues to the last layer as shown by the following equation:

a⁽¹⁾＝x (5)

z⁽²⁾＝W⁽¹⁾x+b⁽¹⁾ (6)

a⁽²⁾＝f(z⁽²⁾) (7)

z⁽³⁾＝W⁽²⁾x+b⁽²⁾ (8)

h_W，b(x)＝a⁽³⁾＝f(z⁽³⁾) (9)

wherein h is_W，b(x) Is the predicted output;

in back propagation, the output of the last layer is calculated, compared with the actual value (y), and the difference between the predicted value and the actual value (y) is calculated as an error term delta;

let J be the cost function minimized using random gradient descent,

is the calculated gradient of a particular layer, used to optimize the weight matrix, as shown in the following equation:

δ⁽³⁾＝-(y-a⁽³⁾.f′(z⁽³⁾)) (10)

δ⁽²⁾＝-(W^(2)T).δ⁽³⁾.f′(z⁽²⁾) (11)

4. a big data based address matching method as claimed in claims 1-3, further comprising: a feature extraction and encoding step, wherein:

feature extraction includes extracting features that convey information used by the neural network model to predict the component; the feature set consists of context, syntax, location features, and list-based features;

the feature coding uses binary coding of classification features based on phrase frequencies; forming a vocabulary/dictionary according to the frequency of occurrence in the training data, and checking whether each phrase exists; each phrase appearing in the dictionary is converted into a binary vector which is equal to the number of marks in the dictionary; in each phrase vocabulary, the vector consists of all zeros, but there is a "1" at a specific position in a specific phrase; all non-lexical words are encoded into the same representation; the resulting matrix is sparse for a batch of training samples as a result of the feature encoding process.

5. The big-data based address matching method as claimed in claims 1-4, further comprising a training phase, wherein: performing contouring on the training data to obtain frequent words further consumed in the step of feature extraction to form a dictionary/vocabulary; the data is then divided into batches from which features are extracted, encoded, and the batches are then trained and the accuracy of the training is measured and visualized to determine if training is to continue.

6. The big-data-based address matching method as claimed in claims 1-5, wherein the address matching parameters include: min _ E and Max _ E: a minimum and maximum of an even number; q1_ E and q2_ E: offset of two ends of the even-number value road section side minimum value and the maximum value; d _ E and α _ E: distance and angle to the even number value link side; min _ O and Max _ O: odd minimum and maximum values; q1_ O and q2_ O: carrying out offset on two ends of the minimum value and the maximum value of the odd-number value road section side; d _ O and α _ O: distance and angle to odd number value road section side; the range between the minimum and maximum values includes the odd and even minimum and maximum values.

7. A big data based address matching system, characterized in that the address matching system runs the big data based address matching method of claims 1-6.