CN112597362A - Address matching method and system based on big data - Google Patents

Address matching method and system based on big data Download PDF

Info

Publication number
CN112597362A
CN112597362A CN202011418486.2A CN202011418486A CN112597362A CN 112597362 A CN112597362 A CN 112597362A CN 202011418486 A CN202011418486 A CN 202011418486A CN 112597362 A CN112597362 A CN 112597362A
Authority
CN
China
Prior art keywords
address
street
machine learning
layer
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011418486.2A
Other languages
Chinese (zh)
Inventor
黄瑜丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jilin Agricultural Science and Technology College
Original Assignee
Jilin Agricultural Science and Technology College
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jilin Agricultural Science and Technology College filed Critical Jilin Agricultural Science and Technology College
Priority to CN202011418486.2A priority Critical patent/CN112597362A/en
Publication of CN112597362A publication Critical patent/CN112597362A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to an address matching method and system based on big data, which comprises constructing a big database composed of an address locator and a street network; inputting an unstructured address; normalizing and standardizing unstructured addresses; carrying out address matching based on a big data matching algorithm; positioning based on the address reference; outputting the address coordinates and the address object. The invention considers the difference between the regular street segment and the irregular street segment, and adopts other parameters such as the distance and the angle with the street segment and the offset with the tail end of the street segment, thereby improving the accuracy of calculating the position; compared with the traditional rule-based address resolution technology, the neural network has obvious advantages of being dependent on a mode rather than data, independent of data during operation, fault-tolerant, high in precision of a data set for benchmark testing and high in throughput for a large data scene.

Description

Address matching method and system based on big data
Technical Field
The present application relates to the field of big data, and in particular, to an address matching method and system based on big data.
Background
Address matching is a process of establishing correspondence between a literal description address and address location coordinates of its space. The address matching service looks up matching objects for addresses according to specific steps: firstly, standardizing an address; then the server searches the address matching reference data and searches a potential position; each candidate location is assigned a score based on its proximity to the address, and the address is finally matched with the highest score. In entering an address, not only postal addresses are supported, but also different types of descriptive data. For different types of descriptive input data, matching is usually performed by simple operations between database elements. For situations where the amount of data is not large, a quick match is typically possible. With the development of the internet and big data, higher requirements are provided for the time consumption and the accuracy of address matching, and the existing address matching method cannot meet the requirements. In addition, predetermined rules are typically based when normalizing addresses, however, when errors and semantic ambiguities may exist that may require processing multiple addresses or address inputs, resulting in inaccurate identification.
Disclosure of Invention
In order to solve the above problems in the prior art, the present invention provides an address matching method based on big data and a system thereof.
The invention relates to an address matching method based on big data, which comprises the following steps:
s1, constructing a big database consisting of an address locator and a street network;
s2, inputting an unstructured address;
s3, normalizing and standardizing the unstructured address;
s4, performing address matching based on the big data matching algorithm;
s5, positioning based on the address reference;
and S6, outputting the address coordinates and the address object.
Preferably, in step S4, the big data based matching algorithm further includes the following steps:
s41, verifying the address type;
s42, if the address street group, returning the barycentric position of the street group;
s43, if the address is a street, searching the street segments for the street segment matched with the code of the address;
s44, verifying whether the address code of the group of segments is in the range between the minimum value and the maximum value;
s45, if the distance is within the range between the minimum value and the maximum value, obtaining a street segment;
s46, if there are multiple street segments, selecting one street segment according to the weight;
s47, after selecting a street segment, judging the type of the street segment,
s48, if the street segment is irregular, returning to the center of gravity of the street segment;
and S49, if the street segment is a rule, performing interpolation according to the parity check of the address number.
Preferably, in step S46, the selecting a street segment according to the weight further includes the following steps:
s461, selecting an address locator related to a street;
s462, generating street segments and calculating a numerical range;
s463, calculating the relative position of each address locator to the corresponding street segment;
s464, calculating the percentage of address locators for each address segment;
s465, calculating the percentage of street segments with all odd numbers on one side and even numbers on the other side;
s466, calculating the percentage of unaddressed locators belonging to the street segment that comply with the parity condition;
s467, the percentage of address locators that belong to only one street segment interval is calculated.
Preferably, the step S464 of calculating the percentage of address locators for each address segment includes:
pOL: the address locator has an odd number and percentage on the left;
pOR: the address locator has an odd number and a percentage on the right;
pEL: the percentage of address locators with even numbers and on the left;
pER: the address locator has an even number and a percentage to the right.
Preferably, for segment S, the probability that the odd address is to the left and the even address is to the right is:
P(OL&ER)=pOL×pER (1)
P(OR&EL)=pOR×pEL (2)
in summary, a street portion that complies with the parity check condition has:
p (OL & ER) ═ 1 OR P (OR & EL) ═ 1 (3)
Among these address locators, those that comply with the numerical range precondition are address locators that belong to one and only one range.
Preferably, in step S3, normalizing and standardizing the unstructured address includes: a standardization stage, a pre-machine learning stage, a machine learning stage and a post-machine learning stage;
wherein the normalization stage is configured to process redundant representations in the address, comprising:
(1) change treatment: large changes in address labels are mapped to a common representation to reduce problem feature space and to speed processing by reducing redundancy. For example, "street" may be expressed by the same semantic terms, such as "road, avenue, road, and road. The normalization phase converts these multiple representations into a common representation.
(2) Noise removal: noise in the data is reduced by deleting portions of lesser information content. For example, redundant punctuation marks, such as "…" and "! ".
(3) Automatic label correction: during the training phase, normalization corrects errors in the label data by applying partial rules. For example, "Hubei" rarely occurs in city columns, and the normalization phase detects this and corrects errors in the data during the training phase.
Wherein the pre-machine learning phase makes the address structure similar to the structure used in training, including:
(1) pre-marking: at runtime, there may be some fields in an address that the machine learning model does not train. Pre-machine learning detects these fields and processes them using regular expressions and passes the information to the machine learning stage so that no false results are produced.
For example, if an address contains a field "postal number" that is not trained on a machine learning model, regular expressions are used to identify the postal number, label it according to the corresponding field, and filter it out before reaching the machine learning classifier.
(2) And (4) sequential correction: the fields in the address may be located in different locations. For example, in some cases, the zip code may be located in the middle portion of the address, rather than the last portion. Pre-machine learning detects such deviations and uses partial rules to make the address structure similar to that trained by machine learning models.
And the machine learning stage is used for classifying the addresses into corresponding identifications according to the processing result of the pre-machine learning. This is the classification of the identification level. Each identity is passed to normalization, pre-machine learning, feature extraction, feature encoding, classification, and finally to post-machine learning. Each identified prediction is also used for the next identified prediction. The machine learning phase uses a neural network to classify the address identification into the address component to which it belongs. For each address, a prediction for each identification is obtained, and then similar predictions are combined to form the components that they represent.
Wherein the post-machine learning phase is used to automatically correct potential errors of the predictive machine learning phase. This phase takes as input a prediction of machine learning and validates it according to a set of rules. For example, it is ensured that the predicted zip code has a significant number of digits. When the classifier incorrectly predicts other numbers as zip codes, and the post-machine learning component validates the regular expressions, a mismatch is found here and an error is detected.
Preferably, the machine learning adopts a fully-connected neural network with a single hidden layer, and the learning process adopts random gradient descent with a back propagation algorithm; the input layer of the neural network corresponds to the number of input features; the output layer corresponds to the total number of classes to predict;
suppose nLEqual to the number of layers, LiCorresponding to layer i of the neural network. The neural network parameters are expressed as:
(W,b)=((W1,b1),(W2,b2)...),
where W represents the weight matrix and b corresponds to the bias matrix. Here, WL ijRepresenting the weight associated with the connection between cell j in layer L and cell i in layer L + 1. a isL iRepresenting the activation value of cell i in the L layer. The learning phase comprises two phases of forward propagation and backward propagation, wherein:
in forward propagation, the weighted sum (x) of the inputs, for a particular layer, is calculated as z:
z=(∑Wixi+b) (4)
let f (z) denote the activation function for a given layer. The output of the activation function computed at layer L is provided to layer L + 1. This process continues to the last layer as shown by the following equation:
a(1)=x (5)
z(2)=W(1)x+b(1) (6)
a(2)=f(z(2)) (7)
z(3)=W(2)x+b(2) (8)
hw,b(x)=a(3)=f(z(3)) (9)
wherein h isW,b(x) Is the prediction output.
In back propagation, the output of the last layer is calculated, compared to the actual value (y), and the difference between the predicted and actual values (y) is calculated as the error term δ.
Let J be the cost function minimized using random gradient descent,
Figure BDA0002821113140000041
is a calculated gradient of a specific layer, which helps to optimize the weight momentArray, as shown by the following equation:
δ(3)=-(y-a(3).f′(z(3))) (10)
δ(2)=-(W(2)T).δ(3).f′(z(2)) (11)
Figure BDA0002821113140000051
Figure BDA0002821113140000052
preferably, the method further comprises the following steps: a feature extraction and encoding step, wherein:
feature extraction involves extracting features that convey information that neural network models use to predict components. The feature set consists of some context (previous label, next phrase, previous phrase), some grammar (number of bits, suffix), some location features (index of current identification, relative index of current identification) and some list-based features (e.g., street name).
Feature coding uses binary coding of the classification features based on phrase frequency. A vocabulary/dictionary (the most common set of phrases in the training data) is formed based on the "frequency" of occurrence in the training data, and each phrase is checked for the presence. Each phrase present in the dictionary is converted into a binary vector equal to the number of tokens in the dictionary. In each phrase vocabulary, the vector consists of all zeros, but there is a "1" at a particular position in a particular phrase. All non-lexical words are encoded into the same representation. The resulting matrix is sparse for a batch of training samples as a result of the feature encoding process.
Preferably, a training phase is also included, wherein: the training data is profiled to obtain frequent words that are further consumed in the feature extraction step, forming a dictionary/vocabulary. The data is then divided into batches from which features are extracted, encoded, and the batches are then trained and the accuracy of the training is measured and visualized to determine if training is to continue.
To efficiently process performance evaluations, in each batch, the model may issue some information or diagnostics as it is trained. This information includes the verification accuracy, test accuracy, log loss, and other performance indicators of the model. After each batch of training, an accuracy index is obtained. In neural network architectures, there are typically many parameters that need to be adjusted. Furthermore, not all training data is useful. To obtain the optimal portion of the training data and the optimal hyperparameters, the framework compiles a diagnosis for each batch of data and plots a real-time curve describing how the accuracy fluctuates as the training progresses. This provides an early feedback and model that can be easily adjusted accordingly to obtain the best parameters and data set. With the increase of the number of batch or indirect samples, the accuracy of training, testing and verification is improved. However, at some point, the accuracy of the sequence continues to improve while the verification decreases, at which point the training is stopped and the parameters are adjusted.
Preferably, the address matching parameters include: min _ E and Max _ E: a minimum and maximum of an even number; q1_ E and q2_ E: and the offset of the two ends of the even-number value road section side minimum value and the maximum value. d _ E and α _ E: distance and angle to the even number value link side; min _ O and Max _ O: odd minimum and maximum values. q1_ O and q2_ O: and carrying out offset on both ends of the minimum value and the maximum value of the odd-number value road section side. d _ O and α _ O: distance and angle to odd number value road section side;
preferably, the range between the minimum and maximum values includes the odd and even minimum and maximum values.
In addition, the address matching system based on the big data runs the address matching method based on the big data.
Under the condition of linear reference, firstly, determining street segments matched with an address description list based on an address range in street segment attributes, wherein the left/right sides of the address are obtained by parity check types of each side of the street segments and parity check numbers in the address, calculating the accurate positions of the corresponding sides in proportion, considering the difference between a regular street segment and an irregular street segment, and improving the accuracy of calculating the positions by adopting other parameters such as the distance and the angle with the street segments, the offset with the tail ends of the street segments and the like; compared with the traditional rule-based address resolution technology, the neural network has obvious advantages of being dependent on a mode rather than data, independent of data during operation, fault-tolerant, high in precision of a data set for benchmark testing and high in throughput for a large data scene.
Drawings
FIG. 1 is a flow chart of a big data based address matching method of the present invention.
FIG. 2 is a flow chart of a big data based address matching method of the present invention.
FIG. 3 is a flow chart of a big data based address matching method of the present invention.
FIG. 4 is a flow chart of a big data based address matching method of the present invention.
FIG. 5 is a flow chart of a big data based address matching method of the present invention.
FIG. 6 is a flow chart of a big data based address matching method of the present invention.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention.
As shown in fig. 1 to 6, the address matching method based on big data of the present invention includes the following steps:
s1, constructing a big database consisting of an address locator and a street network;
s2, inputting an unstructured address;
s3, normalizing and standardizing the unstructured address;
s4, performing address matching based on the big data matching algorithm;
s5, positioning based on the address reference;
and S6, outputting the address coordinates and the address object.
Preferably, in step S4, the big data based matching algorithm further includes the following steps:
s41, verifying the address type;
s42, if the address street group, returning the barycentric position of the street group;
s43, if the address is a street, searching the street segments for the street segment matched with the code of the address;
s44, verifying whether the address code of the group of segments is in the range between the minimum value and the maximum value;
s45, if the distance is within the range between the minimum value and the maximum value, obtaining a street segment;
s46, if there are multiple street segments, selecting one street segment according to the weight;
s47, after selecting a street segment, judging the type of the street segment,
s48, if the street segment is irregular, returning to the center of gravity of the street segment;
and S49, if the street segment is a rule, performing interpolation according to the parity check of the address number.
Preferably, in step S46, the selecting a street segment according to the weight further includes the following steps:
s461, selecting an address locator related to a street;
s462, generating street segments and calculating a numerical range;
s463, calculating the relative position of each address locator to the corresponding street segment;
s464, calculating the percentage of address locators for each address segment;
s465, calculating the percentage of street segments with all odd numbers on one side and even numbers on the other side;
s466, calculating the percentage of unaddressed locators belonging to the street segment that comply with the parity condition;
s467, the percentage of address locators that belong to only one street segment interval is calculated.
Preferably, the step S464 of calculating the percentage of address locators for each address segment includes:
pOL: the address locator has an odd number and percentage on the left;
pOR: the address locator has an odd number and a percentage on the right;
pEL: the percentage of address locators with even numbers and on the left;
pER: the address locator has an even number and a percentage to the right.
Preferably, for segment S, the probability that the odd address is to the left and the even address is to the right is:
P(OL&ER)=pOL×pER (1)
P(OR&EL)=pOR×pEL (2)
in summary, a street portion that complies with the parity check condition has:
p (OL & ER) ═ 1 OR P (OR & EL) ═ 1 (3)
Among these address locators, subject to the number range precondition is an address locator belonging to one and only one interval.
Preferably, in step S3, normalizing and standardizing the unstructured address includes: a standardization stage, a pre-machine learning stage, a machine learning stage and a post-machine learning stage;
wherein the normalization stage is configured to process redundant representations in the address, comprising:
(1) change treatment: large changes in address labels are mapped to a common representation to reduce problem feature space and to speed processing by reducing redundancy. For example, "street" may be expressed by the same semantic terms, such as "road, avenue, road, and road. The normalization phase converts these multiple representations into a common representation.
(2) Noise removal: noise in the data is reduced by deleting portions of lesser information content. For example, redundant punctuation marks, such as "…" and "! ".
(3) Automatic label correction: during the training phase, normalization corrects errors in the label data by applying partial rules. For example, "Hubei" rarely occurs in city columns, and the normalization phase detects this and corrects errors in the data during the training phase.
Wherein the pre-machine learning phase makes the address structure similar to the structure used in training, including:
(1) pre-marking: at runtime, there may be some fields in an address that the machine learning model does not train. Pre-machine learning detects these fields and processes them using regular expressions and passes the information to the machine learning stage so that no false results are produced.
For example, if an address contains a field "postal number" that is not trained on a machine learning model, regular expressions are used to identify the postal number, label it according to the corresponding field, and filter it out before reaching the machine learning classifier.
(2) And (4) sequential correction: the fields in the address may be located in different locations. For example, in some cases, the zip code may be located in the middle portion of the address, rather than the last portion. Pre-machine learning detects such deviations and uses partial rules to make the address structure similar to that trained by machine learning models.
And the machine learning stage is used for classifying the addresses into corresponding identifications according to the processing result of the pre-machine learning. This is the classification of the identification level. Each identity is passed to normalization, pre-machine learning, feature extraction, feature encoding, classification, and finally to post-machine learning. Each identified prediction is also used for the next identified prediction. The machine learning phase uses a neural network to classify the address identification into the address component to which it belongs. For each address, a prediction for each identification is obtained, and then similar predictions are combined to form the components that they represent.
Wherein the post-machine learning phase is used to automatically correct potential errors of the predictive machine learning phase. This phase takes as input a prediction of machine learning and validates it according to a set of rules. For example, it is ensured that the predicted zip code has a significant number of digits. When the classifier incorrectly predicts other numbers as zip codes, and the post-machine learning component validates the regular expressions, a mismatch is found here and an error is detected.
Preferably, the machine learning adopts a fully-connected neural network with a single hidden layer, and the learning process adopts random gradient descent with a back propagation algorithm; the input layer of the neural network corresponds to the number of input features; the output layer corresponds to the total number of classes to predict;
suppose nLEqual to the number of layers, LiCorresponding to layer i of the neural network. The neural network parameters are expressed as:
(W,b)=((W1,b1),(W2,b2)...),
where W represents the weight matrix and b corresponds to the bias matrix. Here, WL ijRepresenting the weight associated with the connection between cell j in layer L and cell i in layer L + 1. a isL iRepresenting the activation value of cell i in the L layer. The learning phase comprises two phases of forward propagation and backward propagation, wherein:
in forward propagation, the weighted sum (x) of the inputs, for a particular layer, is calculated as z:
z=(∑Wixi+b) (4)
let f (z) denote the activation function for a given layer. The output of the activation function computed at layer L is provided to layer L + 1. This process continues to the last layer as shown by the following equation:
a(1)=x (5)
z(2)=W(1)x+b(1) (6)
a(2)=f(z(2)) (7)
z(3)=W(2)x+b(2) (8)
hw,b(x)=a(3)=f(z(3)) (9)
whereinhW,b(x) Is the prediction output.
In back propagation, the output of the last layer is calculated, compared to the actual value (y), and the difference between the predicted and actual values (y) is calculated as the error term δ.
Let J be the cost function minimized using random gradient descent,
Figure BDA0002821113140000101
is the calculated gradient of a particular layer, helps to optimize the weight matrix, as shown by the following equation:
δ(3)=-(y-a(3).f′(z(3))) (10)
δ(2)=-(W(2)T).δ(3).f′(z(2)) (11)
Figure BDA0002821113140000102
Figure BDA0002821113140000103
preferably, the method further comprises the following steps: a feature extraction and encoding step, wherein:
feature extraction involves extracting features that convey information that neural network models use to predict components. The feature set consists of some context (previous label, next phrase, previous phrase), some grammar (number of bits, suffix), some location features (index of current identification, relative index of current identification) and some list-based features (e.g., street name).
Feature coding uses binary coding of the classification features based on phrase frequency. A vocabulary/dictionary (the most common set of phrases in the training data) is formed based on the "frequency" of occurrence in the training data, and each phrase is checked for the presence. Each phrase present in the dictionary is converted into a binary vector equal to the number of tokens in the dictionary. In each phrase vocabulary, the vector consists of all zeros, but there is a "1" at a particular position in a particular phrase. All non-lexical words are encoded into the same representation. The resulting matrix is sparse for a batch of training samples as a result of the feature encoding process.
Preferably, a training phase is also included, wherein: the training data is profiled to obtain frequent words that are further consumed in the feature extraction step, forming a dictionary/vocabulary. The data is then divided into batches from which features are extracted, encoded, and the batches are then trained and the accuracy of the training is measured and visualized to determine if training is to continue.
To efficiently process performance evaluations, in each batch, the model may issue some information or diagnostics as it is trained. This information includes the verification accuracy, test accuracy, log loss, and other performance indicators of the model. After each batch of training, an accuracy index is obtained. In neural network architectures, there are typically many parameters that need to be adjusted. Furthermore, not all training data is useful. To obtain the optimal portion of the training data and the optimal hyperparameters, the framework compiles a diagnosis for each batch of data and plots a real-time curve describing how the accuracy fluctuates as the training progresses. This provides an early feedback and model that can be easily adjusted accordingly to obtain the best parameters and data set. With the increase of the number of batch or indirect samples, the accuracy of training, testing and verification is improved. However, at some point, the accuracy of the sequence continues to improve while the verification decreases, at which point the training is stopped and the parameters are adjusted.
Preferably, the address matching parameters include: min _ E and Max _ E: a minimum and maximum of an even number; q1_ E and q2_ E: and the offset of the two ends of the even-number value road section side minimum value and the maximum value. d _ E and α _ E: distance and angle to the even number value link side; min _ O and Max _ O: odd minimum and maximum values. q1_ O and q2_ O: and carrying out offset on both ends of the minimum value and the maximum value of the odd-number value road section side. d _ O and α _ O: distance and angle to odd number value road section side;
preferably, the range between the minimum and maximum values includes the odd and even minimum and maximum values.
According to the address matching method and system based on the big data, under the condition of linear reference, a street segment matched with an address description list is determined based on an address range in a street segment attribute. Once the street segment is determined, the left/right sides of the address are obtained by the parity check type of each side of the street segment and the parity check number in the address, the accurate position of the corresponding side is calculated proportionally, the difference between the regular street segment and the irregular street segment is considered, and the distance and the angle between the street segment and the tail end of the street segment and other parameters such as the offset between the street segment and the tail end of the street segment are adopted, so that the position calculation precision is improved; compared with the traditional rule-based address resolution technology, the neural network has obvious advantages of being dependent on a mode rather than data, independent of data during operation, fault-tolerant, high in precision of a data set for benchmark testing and high in throughput for a large data scene.
Finally, the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered in the protection scope of the present invention.

Claims (7)

1. The invention discloses an address matching method based on big data, which is characterized by comprising the following steps:
s1, constructing a big database consisting of an address locator and a street network;
s2, inputting an unstructured address;
s3, normalizing and standardizing the unstructured address;
s4, performing address matching based on the big data matching algorithm;
s5, positioning based on the address reference;
s6, outputting the address coordinates and the address object;
in step S4, the big data based matching algorithm includes the following steps:
s41, verifying the address type;
s42, if the address street group, returning the barycentric position of the street group;
s43, if the address is a street, searching the street segments for the street segment matched with the code of the address;
s44, verifying whether the address code of the street segment is in the range between the minimum value and the maximum value;
s45, if the distance is within the range between the minimum value and the maximum value, obtaining a street segment;
s46, if there are multiple street segments, selecting one street segment according to the weight;
s47, after selecting a street segment, judging the type of the street segment,
s48, if the street segment is irregular, returning to the center of gravity of the street segment;
s49, if the street segment is a rule, carrying out interpolation according to the parity check of the address number;
in step S46, the selecting a street segment according to the weight includes the following steps:
s461, selecting an address locator related to a street;
s462, generating street segments and calculating a numerical range;
s463, calculating the relative position of each address locator to the corresponding street segment;
s464, calculating the percentage of address locators for each address segment;
s465, calculating the percentage of street segments with all odd numbers on one side and even numbers on the other side;
s466, calculating the percentage of unaddressed locators belonging to the street segment that comply with the parity condition;
s467, calculating the percentage of address locators belonging to only one street segment interval;
the step S464 of calculating the percentage of address locators for each address segment includes:
pOL: the address locator has an odd number and percentage on the left;
pOR: the address locator has an odd number and a percentage on the right;
pEL: the percentage of address locators with even numbers and on the left;
pER: the address locator has an even number and a percentage to the right;
for street segment S, the probability of an odd address to the left and an even address to the right is:
P(OL&ER)=pOL×pER (1)
P(OR&EL)=pOR×pEL (2)
in summary, a street portion that complies with the parity check condition has:
p (OL & ER) ═ 1 OR P (OR & EL) ═ 1 (3)
Among these address locators, subject to the number range precondition is an address locator belonging to one and only one interval.
2. The big-data-based address matching method as claimed in claim 1, wherein in step S3, normalizing and standardizing the unstructured address comprises: a standardization stage, a pre-machine learning stage, a machine learning stage and a post-machine learning stage;
wherein the normalization stage is configured to process redundant representations in the address, comprising:
(1) change treatment: mapping large changes in address labels to a common representation to reduce problem feature space and speed processing by reducing redundancy;
(2) noise removal: reducing noise in the data by deleting portions of lesser information content;
(3) automatic label correction: in the training phase, normalizing corrects errors in the label data by applying partial rules;
wherein the pre-machine learning phase is used to make the address structure similar to that used in training, including:
(1) pre-marking: at runtime, there may be some fields in an address that the machine learning model does not train; pre-machine learning detects the fields, processes the fields by using a regular expression, and transmits information to a machine learning stage;
(2) and (4) sequential correction: fields in the address are located in different locations; pre-machine learning detects such deviations and uses rules to make the address structure similar to that trained by the machine learning model;
the machine learning stage is used for classifying the addresses into corresponding identifications according to the processing result of the pre-machine learning; each identifier is transmitted to normalization, pre-machine learning, feature extraction, feature coding and classification, and finally transmitted to post-machine learning; the prediction of each identity is used for the prediction of the next identity; the machine learning stage uses a neural network to classify the address identification into the address component to which the address identification belongs; for each address, obtaining a prediction for each identification, and combining similar predictions to form the components that they represent;
wherein, the post machine learning stage is used for automatically correcting and predicting potential errors of the machine learning stage; this stage takes the prediction of machine learning as input and verifies it according to a set of rules; when the classifier incorrectly predicts other numbers as zip codes, and the post-machine learning component validates the regular expressions, a mismatch is found here and an error is detected.
3. The big data-based address matching method as claimed in claims 1-2, wherein machine learning employs a fully connected neural network with a single hidden layer, and learning process employs a stochastic gradient descent with a back propagation algorithm; the input layer of the neural network corresponds to the number of input features; the output layer corresponds to the total number of classes to predict;
suppose nLEqual to the number of layers, LiAn ith layer corresponding to a neural network; the neural network parameters are expressed as:
(W,b)=((W1,b1),(W2,b2)…),
wherein W represents a weight matrix and b corresponds to a bias matrix; here, WL ijRepresents a weight associated with a connection between cell j in layer L and cell i in layer L + 1; a isL iRepresents the activation value of the unit i in the L layer; the learning phase comprises two phases of forward propagation and backward propagation, wherein:
in forward propagation, the weighted sum (x) of the inputs, for a particular layer, is calculated as z:
z=(∑Wixi+b) (4)
let f (z) represent the activation function for a given layer; the output of the activation function calculated at layer L is provided to layer L + 1; this process continues to the last layer as shown by the following equation:
a(1)=x (5)
z(2)=W(1)x+b(1) (6)
a(2)=f(z(2)) (7)
z(3)=W(2)x+b(2) (8)
hW,b(x)=a(3)=f(z(3)) (9)
wherein h isW,b(x) Is the predicted output;
in back propagation, the output of the last layer is calculated, compared with the actual value (y), and the difference between the predicted value and the actual value (y) is calculated as an error term delta;
let J be the cost function minimized using random gradient descent,
Figure FDA0002821113130000031
is the calculated gradient of a particular layer, used to optimize the weight matrix, as shown in the following equation:
δ(3)=-(y-a(3).f′(z(3))) (10)
δ(2)=-(W(2)T).δ(3).f′(z(2)) (11)
Figure FDA0002821113130000041
Figure FDA0002821113130000042
4. a big data based address matching method as claimed in claims 1-3, further comprising: a feature extraction and encoding step, wherein:
feature extraction includes extracting features that convey information used by the neural network model to predict the component; the feature set consists of context, syntax, location features, and list-based features;
the feature coding uses binary coding of classification features based on phrase frequencies; forming a vocabulary/dictionary according to the frequency of occurrence in the training data, and checking whether each phrase exists; each phrase appearing in the dictionary is converted into a binary vector which is equal to the number of marks in the dictionary; in each phrase vocabulary, the vector consists of all zeros, but there is a "1" at a specific position in a specific phrase; all non-lexical words are encoded into the same representation; the resulting matrix is sparse for a batch of training samples as a result of the feature encoding process.
5. The big-data based address matching method as claimed in claims 1-4, further comprising a training phase, wherein: performing contouring on the training data to obtain frequent words further consumed in the step of feature extraction to form a dictionary/vocabulary; the data is then divided into batches from which features are extracted, encoded, and the batches are then trained and the accuracy of the training is measured and visualized to determine if training is to continue.
6. The big-data-based address matching method as claimed in claims 1-5, wherein the address matching parameters include: min _ E and Max _ E: a minimum and maximum of an even number; q1_ E and q2_ E: offset of two ends of the even-number value road section side minimum value and the maximum value; d _ E and α _ E: distance and angle to the even number value link side; min _ O and Max _ O: odd minimum and maximum values; q1_ O and q2_ O: carrying out offset on two ends of the minimum value and the maximum value of the odd-number value road section side; d _ O and α _ O: distance and angle to odd number value road section side; the range between the minimum and maximum values includes the odd and even minimum and maximum values.
7. A big data based address matching system, characterized in that the address matching system runs the big data based address matching method of claims 1-6.
CN202011418486.2A 2020-12-07 2020-12-07 Address matching method and system based on big data Pending CN112597362A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011418486.2A CN112597362A (en) 2020-12-07 2020-12-07 Address matching method and system based on big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011418486.2A CN112597362A (en) 2020-12-07 2020-12-07 Address matching method and system based on big data

Publications (1)

Publication Number Publication Date
CN112597362A true CN112597362A (en) 2021-04-02

Family

ID=75188637

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011418486.2A Pending CN112597362A (en) 2020-12-07 2020-12-07 Address matching method and system based on big data

Country Status (1)

Country Link
CN (1) CN112597362A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113568845A (en) * 2021-07-29 2021-10-29 北京大学 Memory address mapping method based on reinforcement learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113568845A (en) * 2021-07-29 2021-10-29 北京大学 Memory address mapping method based on reinforcement learning
CN113568845B (en) * 2021-07-29 2023-07-25 北京大学 Memory address mapping method based on reinforcement learning

Similar Documents

Publication Publication Date Title
US11055327B2 (en) Unstructured data parsing for structured information
CN113642316B (en) Chinese text error correction method and device, electronic equipment and storage medium
CN109903099B (en) Model construction method and system for score prediction
CN110413319A (en) A kind of code function taste detection method based on deep semantic
CN113505225B (en) Small sample medical relation classification method based on multi-layer attention mechanism
CN113778894B (en) Method, device, equipment and storage medium for constructing test cases
CN113657098B (en) Text error correction method, device, equipment and storage medium
CN111368066B (en) Method, apparatus and computer readable storage medium for obtaining dialogue abstract
CN114997169A (en) Entity word recognition method and device, electronic equipment and readable storage medium
CN117151222B (en) Domain knowledge guided emergency case entity attribute and relation extraction method thereof, electronic equipment and storage medium
CN112100374A (en) Text clustering method and device, electronic equipment and storage medium
Buys et al. Neural syntactic generative models with exact marginalization
CN112597362A (en) Address matching method and system based on big data
CN118152381A (en) Entity error correction method, device, equipment and medium for structured data
CN111581957B (en) Nested entity detection method based on pyramid hierarchical network
CN113344125A (en) Long text matching identification method and device, electronic equipment and storage medium
CN117009223A (en) Software testing method, system, storage medium and terminal based on abstract grammar
CN116485185A (en) Enterprise risk analysis system and method based on comparison data
CN113469237B (en) User intention recognition method, device, electronic equipment and storage medium
CN114610882A (en) Abnormal equipment code detection method and system based on electric power short text classification
CN110599230B (en) Second-hand car pricing model construction method, pricing method and device
CN114201953A (en) Keyword extraction and model training method, device, equipment and storage medium
CN113011162A (en) Reference resolution method, device, electronic equipment and medium
US20230315769A1 (en) A method for address matching and related electronic device
CN118627507A (en) Method and system for detecting association sensitive words

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210402