CN111209362A

CN111209362A - Address data analysis method based on deep learning

Info

Publication number: CN111209362A
Application number: CN202010011871.9A
Authority: CN
Inventors: 张磊; 陶虹; 张旭方
Original assignee: Suzhou Chengfang Information Technology Co ltd
Current assignee: Suzhou Chengfang Information Technology Co ltd
Priority date: 2020-01-07
Filing date: 2020-01-07
Publication date: 2020-05-29

Abstract

The invention relates to an address data analysis method based on deep learning, which maps address data to corresponding key parcel information according to address analysis requirements to carry out multi-dimensional data labeling, wherein the labeled key parcel information data have different types of label address name content texts; performing word segmentation processing on the multi-dimensional labeled address name content text to generate address training data; constructing a BilSTM-CNN-CRF model for training. The invention starts from the problems encountered in the actual service of the address resolution of the place name, constructs the corresponding abstract modeling of the address resolution and the multidimensional data marking, liberates the complex processes of word segmentation, matching and identification in the service and realizes the end-to-end fusion processing mode.

Description

Address data analysis method based on deep learning

Technical Field

The invention belongs to the technical field of geographical name address resolution, and particularly relates to an address data resolution method based on deep learning.

Background

Today in the information age, each department in a city stores a large amount of geographical location information related to addresses, most of the data is non-spatial information, and data sharing between industries cannot be realized through a geographical information system. Therefore, the spatialization of the urban address information is an important component of digital urban construction.

The geocoding technology is a method for realizing the spatialization of the urban address information, provides a mode for converting the address information described by a text into geographic coordinates, and determines the corresponding geographic entity position of the address data on an electronic map by the coding technology and address matching. Through the geocoding technology, a large amount of social and economic data are changed into coordinated spatial information, so that faster and effective spatial analysis is performed, and support is provided for government decision.

Natural Language Processing (NLP) is a technique that enables computers to understand human languages. Among them, word segmentation technology is a basic task. In the international commonly used NLP algorithm, deep syntax semantic analysis usually uses words as basic units, and word segmentation is usually the primary task of NLP. When a model in the NLP domain is built, modeling personnel are often required to master certain linguistic knowledge to extract appropriate features. The deep learning has excellent generalization capability, can extract features based on data without supervision, and has the advantages that context information features are learned from training data, and the part of experimenters needed to do is to design the structure of a neural network, so that high-quality training data is provided. The quick inquiry and matching of the address and the spatialization of social and economic data are realized by utilizing a geocoding technology, and the unified management of a database is established, so that the sharing of data of all departments and industries in a city is realized. The existing address word segmentation model is needed, and the word segmentation accuracy is greatly improved. According to the invention, the address resolution algorithm based on deep learning is constructed, so that the resolution success rate of two types of fuzzy addresses, namely address deformity and ambiguity, is improved.

Disclosure of Invention

The technical problem is as follows: the invention provides an address data analysis method based on deep learning, aiming at the problems that the traditional place name address analysis uses a database full-scale retrieval matching mode (word segmentation-matching-recognition) and has slow analysis speed and low success rate. The invention starts from the problems encountered in the actual service of the address resolution of the place name, constructs the corresponding abstract modeling of the address resolution and the multidimensional data marking, liberates the complex processes of word segmentation, matching and identification in the service and realizes the end-to-end fusion processing mode.

The invention models the address into a process for extracting the key parcel information in the address data, and further abstracts the process for extracting the information into a multi-class classification problem of the parcel information. When a deep learning model of address resolution is established, address data are continuously marked with multi-dimensional data according to the requirement of address resolution, the marked address data have different label contents, specifically, administrative division, roads, plots, doorplates, buildings, rooms and interference information in the address data are marked with multiple categories, wherein the important point is that the incomplete and ambiguous addresses are marked with the multi-dimensional data according to the same marking mode. The trained model can identify corresponding parcel information in the address, can automatically eliminate interference and useless information in address data, and greatly improves the accuracy and speed of analysis.

The technical scheme is as follows: the invention discloses an address data analysis method based on deep learning, which comprises the following steps:

mapping the address data to corresponding key parcel information according to address resolution requirements for carrying out multi-dimensional data marking, wherein the marked key parcel information data have different types of label address name content texts; and performing word segmentation processing on the multi-dimension labeled address name content text to generate address training data.

The method comprises the steps that address information is split and labeled to obtain a sequence word segment text, the sequence word segment text is used as training data, each word is assigned with a word vector through word embedding to be used as expression of the address text, and a computer reads in the training data; setting a threshold value for the address length of the Chinese language, and if address data exceeding the threshold value of the address length exist, deleting and filtering; the whole process of building the deep learning model labels the address information, which is the most time-consuming work in the model training process, and the labeled training data is expressed as the address text by a word embedding technology, so that a computer can read and understand the input data. Secondly, the expressed data is learned through a model consisting of a BilSTM + CNN + CRF layer. And finally, outputting the learning result of the model, and extracting the key information in the address according to the labeling result.

Such as: 'the region rocchy yi fenchy 1 house 109' is labeled 'OOA 1A2C1C 2F1F2E1E2E 2', where O denotes garbage, the end of C1 to C2 is xx information, F1 to F2 are xx information, and E1 to E2 are xx information, and extraction is performed for address resolution based on the labeled result.

Constructing a BiLSTM-CNN-CRF model for training; and arranging the training data in sequence, determining word segment structure relevance through word vectors and part-of-speech characteristics, and outputting tensor characteristics formed by splicing the word vectors and the part-of-speech characteristics. The word embedding technology is mainly used for overcoming the difficulties of uneven text length and the incorporation of word-to-word relations into the model. In short, each word is assigned with a word vector, the vector represents a point in the space, words with close meanings are also close to the word vector, and thus, the operation on the word can be converted into the operation on the vector, which is called a Tensor (Tensor) in deep learning. The tensor of the text implies the combined meaning among a plurality of words, which can be regarded as the characteristic engineering of the text, and further passes the foundation for the machine learning and the deep learning text analysis.

Arranging the address training data in sequence, determining word segment structure relevance through word embedding, and outputting corresponding word vectors; the word embedding technology is mainly used for overcoming the difficulties of uneven text length and the incorporation of word-to-word relations into the model. In short, each word is given a reasonable vector expression, the vector represents a point in the space, the words with close meanings are close, the word vectors are also close, and thus, the operation on the words can be converted into the operation on the vectors, which is called a Tensor (Tensor) in deep learning. The tensor of the text implies the combined meaning among a plurality of words, which can be regarded as a preprocessing process of the text, and further provides a basis for machine learning and deep learning text analysis.

The word vector is respectively combined with context associated information fusion learning according to a forward sequence and a reverse sequence through a BilSTM model and a CNN model to obtain a state vector, the state vector is extracted again into the BilSTM model and then is trained and then is conveyed into a CRF model, and the CRF model automatically extracts a sequence rule and outputs key address sequence information after finishing correction; in the sequence tagging task (Chinese word segmentation CWS, part of speech tagging POS, named entity recognition NER, etc.), the currently mainstream deep learning framework is BiLSTM + CRF. The BilSTM integrates two groups of learning directions which are opposite (one is in sentence sequence, and the other is in reverse sentence sequence), theoretically, the mutual relation between the front to the back and the back to the front in the current address information can be captured, and simply, key information can be better grasped after the context is known, so that the BilSTM model is more favorable for labeling the current word.

During model training, adjusting the influence of the complexity of the model on a loss function to prevent overfitting of the model; in the training process, the learning rate of the training is adjusted to be half of the original learning rate every 5 rounds, so that the model can be trained better, and the optimal address key information extraction model is obtained. For example, a dropout code layer and an earlystopping function in a keras algorithm are used to prevent model overfitting, a learning rate is adjusted by using a learngrateschandler function in the keras algorithm, and the learning rate is reduced to half of the original learning rate every 5 epochs during training.

The problem of uneven length of characters can be solved by expressing words through tensor, because if each word has a corresponding word vector, for a text with the length of N, the tensor is input as long as the vectors represented by the corresponding N words are selected and arranged together according to the sequence of the words in the text, wherein the dimensionality of each word vector is the same. In addition, the words themselves cannot form features, but the tensor is the quantification of abstraction, which is computed from layer-to-layer abstraction of a multi-layer neural network. Also text is composed of words, and features of text may be combined by tensors of words.

Has the advantages that: the invention provides an address data analysis method based on deep learning, which solves the problem of uneven character length by adopting an address analysis abstract modeling and data multi-dimensional labeling and a word embedding technology. In addition, the words themselves cannot form features, but the tensor is the quantification of abstraction, which is computed from layer-to-layer abstraction of a multi-layer neural network.

Experimental data prove that under the condition that training samples are sufficient, the accuracy of the method on the test set reaches 0.9997, because the rule of extracting address word segmentation data by adopting threshold screening and repeated training is simple, the accuracy is high. Because the input address has the condition of deformity and ambiguity, the model can effectively extract the deformity and the ambiguity, for example: when the Suzhou industrial park and the Suzhou public park are extracted by using the models, the Suzhou industrial park and the Suzhou public park are taken as a whole, and the accuracy of extracting information from addresses is guaranteed.

In order to improve the matching success rate of two fuzzy addresses of address defects and ambiguities, the invention constructs a Chinese word segmentation model based on a word-embedded bidirectional long-short term memory network (BilSTM), a one-dimensional Convolutional Neural Network (CNN) and a Conditional Random Field (CRF). The model firstly marks address information and sets a threshold value to delete and filter address data; and tensor expression words are adopted, the state tensor secondary BilSTM model is repeatedly trained and transmitted to a CRF model for automatic correction, and then key address sequence information is output, so that the word segmentation accuracy is realized.

Drawings

FIG. 1 is a block diagram of the overall process of the present invention.

Detailed Description

In order that the technical objects and features of the present invention can be more clearly understood, the present invention will be described in detail with reference to specific embodiments.

As shown in fig. 1, the present invention discloses an address data parsing method based on deep learning, which includes:

mapping the address data to corresponding key parcel information according to address resolution requirements for carrying out multi-dimensional data marking, wherein the marked key parcel information data have different types of label address name content texts;

performing word segmentation processing on the multi-dimensional labeled address name content text to generate address training data; splitting and labeling the address information to obtain a sequence word segment text, wherein the sequence word segment text is used as training data, each word is assigned with a word vector through a word embedding technology to express the address text, and a computer can identify the training data; the method comprises the steps of setting a threshold value for the address length of the Chinese language, and deleting and filtering if address data exceeding the threshold value of the address length exists.

Constructing a BiLSTM-CNN-CRF model for training; the address modeling becomes a process for extracting the key parcel information in the address data, and the process for extracting the information is further abstracted to be a multi-class classification problem of the parcel information. When a deep learning model of address resolution is established, address data are continuously marked with multi-dimensional data according to the requirement of address resolution, the marked address data have different label contents, specifically, administrative division, roads, plots, doorplates, buildings, rooms and interference information in the address data are marked with multiple categories, wherein the important point is that the incomplete and ambiguous addresses are marked with the multi-dimensional data according to the same marking mode. The trained model can identify corresponding parcel information in the address, can automatically eliminate interference and useless information in address data, and greatly improves the accuracy and speed of analysis.

And respectively combining the word vector with context associated information fusion learning according to the forward sequence and the reverse sequence through a BilSTM model and a CNN model to obtain a state vector, extracting the state vector into the BilSTM model again, training the state vector, and then conveying the state vector into a CRF model, automatically extracting sequence rules by the CRF model, and outputting key address sequence information after finishing correction.

During model training, adjusting the influence of the complexity of the model on a loss function to prevent overfitting of the model; wherein, the learning rate of the training is adjusted to be half of the original learning rate every 5 rounds in the training process. The model can be trained better, and the optimal address key information extraction model is obtained.

If the input sentence is composed of 32 words, each word is represented by a 128-dimensional word vector, the input corresponding to the model is (32, 128), the hidden vector quantity is changed into T1(32, 128) after the BilSTM, wherein 128 is the output dimension of the BilSTM in the model. If the CRF layer is not used, a full-connection layer can be added at the end of the model for 13 classification, and finally a label with high probability is taken as a prediction label. Through a large amount of labeled data and model continuous iterative optimization, the method can learn a good key address information extraction model.

However, although relying on the powerful nonlinear fitting capability of neural networks, good models can be theoretically learned. However, the above model only takes into account the contextual information on the tag. For the sequence labeling task, the label L _ t of the current position has potential relation with the previous position L _ t-1 and the next position L _ t + 1. For example, "clock/B1 garden/B2 way/B21/D1/D2" is labeled as "clock/B1 garden/E2 way/B21/D1/D2", and as can be seen from the labeling rule of information extraction, B1 labels can only be connected with B2, so the model utilizes the context information between such labels. Thus, researchers in the field of natural language processing have proposed a CRF layer following the model for learning the optimal tag sequence over the entire sequence. The addition of the CRF layer can reduce some unnecessary errors in labeling, such as: 1. b1 is followed by a note other than B2; 2. e2 appears in the first series of questions; in short, the errors are errors which cannot occur in data labeling, and are not practical, and in order to process the problems, a CRF layer is added into a BilSTM model, so that some unrealistic results can be avoided, and the accuracy of the model is effectively improved.

So far, the BilSTM-CRF model has been generally known. For the address key information extraction task, the labels of the current words are basically associated with only the first few words and several words. BilSTM adds a CNN layer in the model for extracting the local features of the current word because some important information is discarded due to the problem of model capacity when learning a longer sentence.

Let sentence input dimension be (32, 100), get T2(32, 50) after equal length convolution, where 50 is the number of convolution kernels. The 50-dimensional vector corresponding to the current word contains its local context information. We splice T1 and T2 to get T3(32, 178), T3 gets T4(32, 13) through the full connection layer, T4 is input to the CRF layer, and the final optimal sequence is calculated.

In the processes of machine learning and deep learning, the time spent on data processing is indispensable, because the result of data preparation directly affects the result of the model, and the process of preprocessing the data is often referred to as feature engineering. And introducing the data processing process of the model.

Considering the problem of the length of Chinese addresses, each address is almost less than 32, address data with more than 32 bits is deleted, only 8 addresses with more than 32 bits are marked in 175W data, addresses with less than 32 bits are marked with a category at the later position, and it is noted that useless information in the addresses is also represented by the same category. In short, the address information is labeled with 13 categories in total, the numbers of 0-12 are correspondingly labeled for the 13 categories, and the corresponding category numbers are subjected to One _ Hot function transformation, so that the label data is processed into a form meeting the model input, and the original address data is processed by using a bag-of-words model. The processing of characters which do not appear in the new address is considered, the characters which do not appear in the word bag are uniformly marked into the numbers which do not appear in the word bag, and errors are avoided when the test data are represented by the word bag; thus, the whole data preprocessing process is completely finished.

When a deep learning model is trained, the largest problem is the overfitting problem, so that the network can be stably trained on the provided data, multiple modes are used for preventing the overfitting problem of the model during model training, the learning rate of the training is adjusted to be half of the original learning rate every 5 rounds in the training process, the model can be better trained, and the optimal address key information extraction model is obtained.

And processing the data on the test set and the data on the training set to obtain the model accuracy. The accuracy rate on the test set reaches 0.9997, probably because the address data is simple in rule, and the accuracy rate is high. Because the input address has the condition of deformity and ambiguity, the model can effectively extract the deformity and the ambiguity, for example: when the Suzhou industrial park and the Suzhou public park are extracted by using the models, the Suzhou industrial park and the Suzhou public park are taken as a whole, and the accuracy of extracting information from addresses is guaranteed.

For a training model, in complex address data, errors may occur in the result predicted by the address model, and the wrong word segmentation result can be adjusted by adopting a retraining mode, so that the practicability and accuracy of the training model are improved.

Claims

1. An address data parsing method based on deep learning is characterized by comprising the following steps:

performing word segmentation processing on the multi-dimensional labeled address name content text to generate address training data;

constructing a BiLSTM-CNN-CRF model for training;

arranging the address training data in sequence, determining word segment structure relevance through word embedding, and outputting corresponding word vectors;

the word vector is respectively combined with context associated information fusion learning according to a forward sequence and a reverse sequence through a BilSTM model and a CNN model to obtain a state vector, the state vector is extracted again into the BilSTM model and then is trained and then is conveyed into a CRF model, and the CRF model automatically extracts a sequence rule and outputs key address sequence information after finishing correction;

during model training, adjusting the influence of the complexity of the model on a loss function to prevent overfitting of the model; wherein, the learning rate of the training is adjusted to be half of the original learning rate every 5 rounds in the training process.