WO2022134592A1

WO2022134592A1 - Address information resolution method, apparatus and device, and storage medium

Info

Publication number: WO2022134592A1
Application number: PCT/CN2021/109698
Authority: WO
Inventors: 赵焕丽; 徐国强
Original assignee: 深圳壹账通智能科技有限公司
Priority date: 2020-12-23
Filing date: 2021-07-30
Publication date: 2022-06-30
Also published as: CN112612940A

Abstract

An address information resolution method, apparatus and device, and a storage medium, which are used for converting, into standard address text, address text to be identified that is uploaded by a user, and relate to the field of artificial intelligence. The method comprises: crawling original address data from a preset data source by using a web crawler tool (101); selecting, from the original address data, address representation data with a character length within a preset length interval, and labeling same to obtain model training data (102); according to the model training data and a preset neural network, performing training to obtain an address resolution model (103); acquiring address text to be identified that is uploaded by a user, and inputting the address text to be identified into the address resolution model, so as to obtain an administrative division label of each character in the address text to be identified (104); and converting the address text to be identified into standard address text according to the administrative division label of each character in the address text to be identified (105). In addition, the present invention further relates to blockchain technology, and the address text to be identified can be stored in a blockchain.

Description

Address information analysis method, device, device and storage medium

This application claims the priority of the Chinese patent application with the application number 202011544487.1 and the invention title "Address Information Resolution Method, Apparatus, Equipment and Storage Medium" filed with the China Patent Office on December 23, 2020, the entire contents of which are incorporated by reference in application.

technical field

The present application relates to the field of artificial intelligence, and in particular, to an address information resolution method, apparatus, device and storage medium.

Background technique

Services based on location information are more and more widely used in people's lives, and there is an increasing demand for finding their geographic coordinates quickly and accurately based on textual address expressions. A standard Chinese address should contain complete administrative divisions and be expressed in the order of administrative divisions (province/city/county/township/village), roads, streets, grades, buildings, and households. The sub-algorithm is parseable so that it can accurately correspond to the geographic location of the address.

However, the inventor realizes that the non-standardized representation of Chinese addresses causes the ambiguity or ambiguity of the location semantic information, which prevents the computer from directly understanding the geographic location described by the address information, so that such Chinese address information cannot be directly used by the computer for location information. Serve. Existing address resolution algorithms (Chinese address element segmentation method, thesaurus matching method, feature word segmentation method, etc.) rely on address specification, feature words and address dictionaries, and cannot well solve the problem of non-standard Chinese addresses. Therefore, such Chinese address information cannot be directly used by computers for location services.

SUMMARY OF THE INVENTION

The main purpose of this application is to solve the technical problem of low accuracy in parsing non-standard Chinese addresses due to the existing address parsing algorithms relying on address canonicality, feature words and address dictionaries.

In order to achieve the above purpose, a first aspect of the present application provides an address information parsing method, including: using a web crawler tool to crawl original address data from a preset data source; address expression data within a preset length interval, and annotate the address expression data to obtain model training data; according to the model training data and the preset neural network, train to obtain an address parsing model; obtain the to-be-identified uploaded by the user address text, input the address text to be recognized into the address parsing model, and obtain the administrative division labeling of each character in the address text to be recognized; according to the administrative division labeling of each character in the address text to be recognized, Convert the address text to be recognized into standard address text.

A second aspect of the present application provides an address information parsing device, including: a data crawling module for crawling original address data from a preset data source by using a web crawler tool; a screening module for extracting data from the original address The address expression data whose character length is within the preset length range is screened out from the data, and the address expression data is marked to obtain model training data; the model training module is used for training data according to the model and the preset neural network. , training to obtain an address parsing model; the model input module is used to obtain the address text to be recognized uploaded by the user, input the address text to be recognized into the address parsing model, and obtain the address text of each character in the address text to be recognized. Administrative division labeling; a standard conversion module, configured to convert the address text to be recognized into standard address text according to the administrative division labeling of each character in the address text to be recognized.

A third aspect of the present application provides an address information parsing device, comprising: a memory and at least one processor, wherein instructions are stored in the memory, and the memory and the at least one processor are interconnected by a line; the at least one processor The processor invokes the instructions in the memory, so that the address information parsing device executes the steps of the address information parsing method as follows: using a web crawler tool to crawl original address data from a preset data source; Filter out the address expression data whose character length is within the preset length interval from the original address data, and mark the address expression data to obtain model training data; according to the model training data and the preset neural network, training Obtain an address parsing model; acquire the address text to be recognized uploaded by the user, input the address text to be recognized into the address parsing model, and obtain the administrative division labels of each character in the address text to be recognized; Recognize the administrative division labeling of each character in the address text, and convert the address text to be recognized into standard address text.

A fourth aspect of the present application provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium runs on a computer, the computer performs the steps of the address information parsing method as follows: using The web crawler tool crawls the original address data from a preset data source; screen out the address expression data whose character length is within the preset length interval from the original address data, and annotate the address expression data to obtain a model training data; according to the model training data and the preset neural network, an address parsing model is obtained by training; the address text to be identified uploaded by the user is acquired, and the address text to be identified is input into the address parsing model to obtain the address parsing model. Describe the administrative division labeling of each character in the address text to be recognized; convert the address text to be recognized into standard address text according to the administrative division labeling of each character in the address text to be recognized.

In the technical solution provided by this application, the original address data is crawled from a preset data source by using a web crawler tool; the address expression data whose character length is within the preset length range is screened out from the original address data, and the The address expression data is marked to obtain model training data; an address parsing model is obtained by training according to the model training data and a preset neural network; the address text to be recognized uploaded by the user is acquired, and the address text to be recognized is input In the address parsing model, the administrative division labeling of each character in the address text to be recognized is obtained; according to the administrative division labeling of each character in the address text to be recognized, the address text to be recognized is converted into a standard address text . Through the method, the computer can extract the semantic features of the entire address, and consider the division results of the administrative divisions of the characters before and after, so as to realize the multi-level administrative division analysis of the non-standardized address. Compared with the existing address resolution algorithms, this scheme does not rely on address canonicality, feature words and address dictionaries, so it can handle diverse non-canonical expressions. The method based on the deep model can also learn the naming and segmentation rules in the existing data, and apply it to model inference, which can improve the effect of non-standard Chinese address resolution, so that such Chinese address information can be directly used by computers for location services . In addition, this application also relates to blockchain technology, and the original address data can be stored in the blockchain.

Description of drawings

1 is a schematic diagram of a first embodiment of a method for resolving address information in an embodiment of the present application;

2 is a schematic diagram of a second embodiment of a method for resolving address information in an embodiment of the present application;

3 is a schematic diagram of a third embodiment of a method for resolving address information in an embodiment of the present application;

4 is a schematic diagram of a fourth embodiment of a method for resolving address information in an embodiment of the present application;

FIG. 5 is a schematic diagram of an embodiment of an address information parsing apparatus in an embodiment of the present application;

FIG. 6 is a schematic diagram of another embodiment of an address information parsing apparatus in an embodiment of the present application;

FIG. 7 is a schematic diagram of an embodiment of an address information parsing device in an embodiment of the present application.

Detailed ways

The present application provides an address information parsing method, which solves the technical problem of low accuracy in parsing non-standard Chinese addresses due to the existing address parsing algorithms relying on address standardization, feature words and address dictionaries.

In order to make those skilled in the art better understand the solutions of the present application, the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.

The terms "first", "second", "third", "fourth", etc. (if any) in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is to be understood that data so used may be interchanged under appropriate circumstances so that the embodiments described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" or "having" and any variations thereof are intended to cover non-exclusive inclusion, for example, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those expressly listed steps or units, but may include other steps or units not expressly listed or inherent to these processes, methods, products or devices.

Referring to FIG. 1, a flowchart of an address information parsing method provided by an embodiment of the present application specifically includes:

It can be understood that the execution body of the present application may be an address information parsing device, or may be a terminal or a server, which is not specifically limited here. The embodiments of the present application take the server as an execution subject as an example for description.

It should be emphasized that, in order to ensure the privacy and security of the data, the above-mentioned original address data can be stored in a node of a blockchain.

In this embodiment, the preset data source may be some official information websites or public address databases, and the address data in the data source is crawled as original address data, and most of these original address data are Chinese addresses , these Chinese addresses may have irregularities, which are different from the standard administrative division. For example, the administrative division characteristic word "district" is omitted in "Xuhui Kaibin Road", and part of the administrative division "Xuhui District" is omitted in "Shanghai Kaibin Road". , the administrative division information level is messy, and the word "district" in "Wumei Kindergarten" causes the non-administrative division part of the address to have the same name as the administrative division.

In this embodiment, after millions of pieces of original address data are crawled out through the data source, the first step of screening is performed, mainly by judging whether the characters in the original address data are UTF-8 encoded characters, and the non-UTF-8 characters among them are sorted out. -8 encoded characters, such as emoticons, are removed to get standard raw address data.

102. Filter out address expression data whose character length is within a preset length interval from the original address data, and mark the address expression data to obtain model training data;

In this embodiment, the preset length interval is related to specific application scenarios, and is generally set to be between 7 and 20. For application scenarios with more detailed and complete address requirements, the range of the interval can be appropriately adjusted. The character length is a configurable parameter. The length of the character has no effect on the subsequent model training process. Therefore, you only need to modify the configuration before different application scenarios. Generally, the model requires a maximum of 128 characters.

In this embodiment, the labeling is mainly performed manually, and the labeling labels are mainly administrative divisions, including "province", "city", "district and county", "township", "street", "road", "House number", "Village", "Building name", "Other" and other 10 levels, among which, "province" includes: province, municipality, autonomous region, special administrative region; "city" includes prefecture-level city, region, autonomous prefecture, league ; "Districts and counties" include municipal districts, county-level cities, counties, flags, special zones, and forest areas; "townships" include towns, townships, ethnic townships, Sumu, ethnic Sumu, county-administered districts, and district offices; "streets" and "Township" also belongs to the township-level administrative area; "Road": road, street, alley, other labels and standard names are the same.

In this embodiment, each character in the address expression data is marked by manual annotation. For example, for "Shenzhen, Guangdong Province", each character can be marked with "province, province, city, city", and the model training data can be Organized into the following format, "Guangdong Province/Province Shenzhen/City Bao'an District/District Xixiang Street/Street Nanchang Second New Village/Village X Lane/Road X/House Number", each character in the model training data has corresponding labels.

103. According to the model training data and the preset neural network, train to obtain an address resolution model;

In this embodiment, the preset neural network is a Bi-LSTM-CRF neural network, and the Bi-LSTM-CRF includes three layers of neural networks, which are an Embedding layer, a Bi-LSTM layer and a CRF layer, wherein the Embedding layer The layer is the embedding layer. Through the Embedding layer, each character in the input model training data can be mapped into a vector in the low-dimensional space. The word vector is a distributed representation of each character in the text. The computer conveys semantics. The Bi-LSTM layer is a bidirectional long-term and short-term memory network layer. The bi-directional long-term and short-term memory network includes two groups of modules: forward LSTM and backward LSTM, which can obtain long-term and long-range association dependencies of context and capture context entities. feature, to obtain more spatiotemporal correlations between entities, and to eliminate the influence of noise such as interfering entities on the neural network model from two directions, greatly assisting the mining of long-term dependencies, conditional random fields (conditional random fields) It is a discriminative probability model and a type of random field. It is often used to label or analyze sequence data, such as natural language characters or biological sequences. The conditional random field is an undirected graph model. The vertices in the graph represent random variables, and the lines between the vertices represent the dependencies between the random variables. In the conditional random field, the distribution of the random variable Y is the conditional probability. Given The observed value is the random variable X. In principle, the graphical model layout of the conditional random field can be given arbitrarily, and the commonly used layout is the linked architecture, whether in training, inference, or decoding. On the above, there are more efficient algorithms for calculus. The advantage of Bi-LSTM is that it can remember context information, which greatly facilitates the mining of long-term dependencies and is of great help to semantic understanding. However, if it is used directly for labeling tasks, there will be a problem. Bi-LSTM It belongs to the time series model, so its output is only for the current character, which belongs to the local optimal solution. Conditional random fields, on the other hand, have high requirements on templates. Only a comprehensive template can allow the model to learn a lot of contextual information, but there are often cases where the template is not fully covered. Bi-LSTM can obtain context information, but it needs a solution model, and conditional random field can generate the global optimal solution, but it needs context information. Therefore, this application combines Bi-LSTM and conditional random field. model to build a complete model with complementary advantages.

104. Acquire the address text to be recognized uploaded by the user, and input the address text to be recognized into the address parsing model, and obtain the administrative division labeling of each character in the address text to be recognized;

In this embodiment, after the address resolution model is obtained, the address resolution model can be used to parse and identify different addresses to be identified input by the user. After inputting the model, the model labels each character in it as "province, region, county, district, county, township, township, village, village, other".

105. Convert the to-be-recognized address text into standard address text according to the administrative division labeling of each character in the to-be-recognized address text;

In this embodiment, the same marked characters are spliced to obtain the name of each type of marked administrative division. For example, if the two characters of "Chong" and "Qing" are marked as "province", then the two are spliced together. , get Chongqing, and so on for subsequent characters. After determining "Chongqing" as "province", match among 34 provincial-level administrative regions to determine which category of province, autonomous region, municipality directly under the Central Government, and special administrative region Chongqing is. For municipalities directly under the central government, the character "city" is added after Chongqing, and the 40 administrative districts and counties under Chongqing are matched. By analogy, the address text to be recognized "No. 1 Community of Tangfang Village, Tangfang Town, Wuxi, Chongqing" can be parsed and recognized. It is the standard address text "No.1 Community of Tangfang Village, Tangfang Town, Wuxi County, Chongqing City".

In this embodiment, the original address data is crawled from a preset data source by using a web crawler tool; the address expression data whose character length is within the preset length range is screened out from the original address data, and the Annotate the address expression data to obtain model training data; train an address parsing model according to the model training data and a preset neural network; obtain the address text to be recognized uploaded by the user, and input the address text to be recognized into the In the address parsing model, the administrative division labeling of each character in the address text to be recognized is obtained; according to the administrative division labeling of each character in the address text to be recognized, the address text to be recognized is converted into standard address text. Through the method, the computer can extract the semantic features of the entire address, and consider the division results of the administrative divisions of the characters before and after, so as to realize the multi-level administrative division analysis of the non-standardized address. Compared with the existing address resolution algorithms, this scheme does not rely on address canonicality, feature words and address dictionaries, so it can handle diverse non-canonical expressions. The method based on the deep model can also learn the naming and segmentation rules in the existing data, and apply it to model inference, which can improve the effect of non-standard Chinese address resolution, so that such Chinese address information can be directly used by computers for location services . In addition, this application also relates to blockchain technology, and the original address data can be stored in the blockchain.

Referring to FIG. 2, the second embodiment of the address information parsing method in the embodiment of the present application includes:

201. Use a web crawler tool to crawl original address data from a preset data source;

202. Filter out address expression data whose character length is within a preset length interval from the original address data, and annotate the address expression data to obtain model training data;

Steps 201-202 in this embodiment are similar to steps 101-102 in the first embodiment, and are not repeated here.

203. Convert each character in the model training data into a one-hot code vector;

204. Convert the one-hot code vector of the model training data into a low-dimensional dense word vector through the pre-trained vector matrix;

In this embodiment, the one-hot code is one-hot, and in the process of converting each character in the model training data into a word vector, each character in the model training data needs to be converted into a one-hot vector first. It is because the Embedding layer is a fully connected layer with one-hot as the input and the number of nodes in the middle layer as the dimension of the word vector. The one-hot code vector is converted into a low-dimensional dense word vector through the vector matrix of the pre-trained vector matrix. The problem of the lexical gap and the curse of dimensionality is solved.

205. Using the word vector input as the input of each time step of the bidirectional long-term and short-term memory network layer in the neural network to obtain the hidden state sequence output by the forward long-term and short-term memory network and the hidden state sequence output by the reverse long-term and short-term memory network;

206. Splicing the hidden state sequence output by the forward long-term and short-term memory network and the hidden state sequence output by the reverse long-term and short-term memory network to obtain a complete hidden output sequence;

In this embodiment, the encoding process through the Bi-LSTM layer includes: the Bi-LSTM layer automatically extracts sentence features, and uses the char embedding sequence (x1, x2, x3,...,xn) of each word of a sentence as Bi -The input of each time step of LSTM, and then the hidden state sequence ((h_1) ^→ , (h_2) ^→ , (h_3) ^→ ,...,(h_n) ^→ ) output by the forward LSTM and the output of the reverse LSTM at each position The hidden states of ((h_1) ^← , (h_2) ^← , (h_3) ^← ,…,(h_n) ^← ) are spliced by position to obtain a complete hidden output sequence: The output of the Bi-LSTM layer is each label of the word. value, and finally select the tag with the highest tag score as the tag of the word.

207. Input the hidden output sequence to the conditional random field layer in the neural network, and predict the labeling of each character in the training data of the model;

208. Obtain the administrative division sequence of the model training data according to the labeling of each character in the model training data;

In this embodiment, after the conditional random field layer predicts the labels of each character in the model training data, the administrative division labels of each character in the model training data are connected to obtain the administrative division sequence, such as the conditional random field layer prediction The administrative division sequence "province province city city city building building building building building building building building" obtained after the administrative division of each character in the model training data "Shanghai Jing'an Kerry Center, Shanghai, Shanghai".

209. Determine whether there are at least two administrative division annotation fragments with the same annotation type in the administrative division sequence, wherein the administrative division fragments are fragments formed by consecutive and identical administrative division annotations;

210. If yes, compare the positions of the administrative division annotation fragments with the same annotation type in the administrative division sequence, and re-predict the administrative division annotations in the administrative division annotation fragments that are located later in the administrative division annotation fragments with the same annotation type. ;

In this embodiment, the conditional random field layer predicts that the labeling of each character in the model training data may be incorrectly predicted. The administrative division sequence obtained by the administrative division labeling of characters is "province province city city province province building building building building building building", there are two identical administrative division label fragments "province province" and "province province", Obviously, it is impossible for a segment of address to appear in segments marked with the same administrative division, and it is necessary to re-predict the characters in the segment marked with the administrative division at the later position.

211. Compare and iterate the labeling of each character in the training data of the conditional random field layer prediction model with the original labeling of the model training data, to obtain a final pre-trained address resolution model;

212. Obtain the address text to be recognized uploaded by the user, input the address text to be recognized into the address resolution model, and obtain the administrative division labeling of each character in the address text to be recognized;

213. Convert the address text to be recognized into standard address text according to the administrative division labels of each character in the address text to be recognized.

Steps 212-213 in this embodiment are similar to steps 104-105 in the first embodiment, and are not repeated here.

On the basis of the previous embodiment, this embodiment describes in detail the process of obtaining an address resolution model by training according to the model training data and the preset neural network. By inputting the model training data into the embedding layer in the neural network , convert each character in the model training data into a word vector; input the word vector as the input of each time step of the bidirectional long short-term memory network layer in the neural network to obtain the implicit output sequence of the model training data; input the implicit output sequence To the conditional random field layer in the neural network, predict the label of each character in the model training data, compare and iterate with the original label of the model training data, and obtain the final pre-trained address parsing model. At the same time, the conditional random field layer that inputs the latent output sequence into the neural network is added to predict the labeling of each character in the model training data and post-processing the labeling, and obtain the model training according to the labeling of each character in the model training data. The administrative division sequence of the data; determine whether there are at least two administrative division annotation fragments with the same annotation type in the administrative division sequence, where the administrative division fragment is a fragment composed of consecutive and identical administrative division annotations; if so, compare the annotation types of the same The position of the administrative division annotation segment in the administrative division sequence, and re-predicts the administrative division annotation in the administrative division annotation segment in the later position in the administrative division annotation segment with the same annotation type.

Referring to FIG. 3 , the third embodiment of the address information parsing method in the embodiment of the present application includes:

301. Use a web crawler tool to crawl original address data from a preset data source;

302. Filter out address expression data whose character length is within a preset length interval from the original address data, and annotate the address expression data to obtain model training data;

303. Input the model training data into the embedding layer in the neural network, and convert each character in the model training data into a word vector;

304. The word vector input is used as the input of each time step of the bidirectional long short-term memory network layer in the neural network to obtain the latent output sequence of the model training data;

305. Input the hidden output sequence to the conditional random field layer in the neural network, and predict the labeling of each character in the training data of the model;

306. Obtain the administrative division sequence of the model training data according to the labeling of each character in the model training data;

307. According to the arrangement order of the administrative division labels in the administrative division sequence, determine whether there is an error in the administrative division sequence;

308. If yes, re-predict the administrative division labels of characters in the administrative division sequence;

In this embodiment, the output of the CRF layer needs to be post-processed, including splicing the characters corresponding to the adjacent administrative divisions, which may cause erroneous determinations. "May be marked as "province province province city city city province province building building building building building building", according to the order of administrative divisions in the normal address, "province" should be before "city", but in the administrative division In the annotation fragment "province province city city city province province building building building building building building", there is a situation in which "province" is behind "city", so the labeling of each character in the training data of the conditional random field layer prediction model is wrong, and it is necessary to Forecast again.

309. Compare and iterate the labeling of each character in the training data of the conditional random field layer prediction model with the original labeling of the model training data, to obtain a final pre-trained address resolution model;

In this embodiment, it is necessary to first convert each character in the model training data into a one-hot vector, and then convert the one-hot vector into the form of a word vector. A word vector is a distributed representation of each character in the text, Communicate semantics to the computer through low-dimensional vectors in space. After inputting the model training data to the output of the Embedding layer of the neural network in the form of word vectors, input the word vectors to the Bi-LSTM layer. The Bi-LSTM neural network is suitable for sequence labeling tasks. It targets each word in the input sequence. Vector performs the same operation, the operation here is matrix multiplication, which linearly maps a high-dimensional matrix (such as 300 dimensions) to a low-dimensional matrix (128 dimensions), each dimension of the matrix represents a feature, so this operation can remove useless feature. Each step of operation depends on the calculation result of the previous step, and encodes the features of the context at the same time. The specific implementation of this encoding is that the operation result of the previous step (here, the feature) is used as the input of the next step. For example, in the previous step, the feature h_{t-1} is extracted for the word "上", and the operation of extracting the feature for the word "sea" in the next step is f(x_t, h_{t-1})=h_t, where x_t is "sea" The character of the word itself, f is the function used in the operation, and h_t is the final extracted feature of the word "sea". Therefore, when extracting the features of the current step, the features of the previous step also participate in the operation, which is the so-called "encoding the above features". Do a similar operation for the following features. The features encoded into the context are output as all features extracted for each character. Taking the input of the Bi-LSTM layer as the input of the CRF layer, the features of the Bi-LSTM output do not consider the influence of the label of the previous step on the label of the current step. For example, the current word is "Wu", and the first two words "Chongqing" are the city name. Then "Wuxi" should be the name of a district or county or a town with a high probability. Therefore, the CRF layer (conditional random field) is spliced in the output layer of Bi-LSTM, so that the output sequence of Bi-LSTM becomes the observation sequence of the CRF layer, and then CRF calculates the optimal solution of the entire sequence in terms of probability, taking into account the sequence Interaction between tags. The output tag sequence of the CRF corresponds to each character of the input address, respectively.

310. Obtain the address text to be recognized uploaded by the user, and input the address text to be recognized into the address parsing model, and obtain the administrative division labeling of each character in the address text to be recognized;

311. Convert the to-be-recognized address text into standard address text according to the administrative division labeling of each character in the to-be-recognized address text.

On the basis of the previous embodiment, this embodiment adds the process of wrongly judging the label of each character in the model training data by predicting the conditional random field layer. According to the labeling of each character in the model training data, obtain The administrative division sequence of the model training data; according to the arrangement order of the administrative division labels in the administrative division sequence, determine whether there is an error in the administrative division sequence; if so, perform the administrative division labeling of the characters in the administrative division sequence. Reforecast. Through the method, errors in labeling of each character in the training data of the conditional random field layer prediction model can be corrected, thereby improving the efficiency of model training.

Referring to FIG. 4 , the fourth embodiment of the address information parsing method in the embodiment of the present application includes:

401. Use a web crawler tool to crawl original address data from a preset data source;

402. Filter out address expression data whose character length is within a preset length interval from the original address data, and annotate the address expression data to obtain model training data;

403. According to the model training data and the preset neural network, train to obtain an address resolution model;

404. Obtain the address text to be recognized uploaded by the user, and input the address text to be recognized into the address parsing model, and obtain the administrative division labeling of each character in the address text to be recognized;

Steps 401-404 in this embodiment are similar to steps 101-104 in the first embodiment, and are not repeated here.

405. Create an initially empty character buffer area, and process each character in the address text to be recognized according to the character sequence of the address text to be recognized;

406. Store the first character of the address text to be recognized in the character buffer area, and determine the administrative division label of the first character;

407. Determine whether the administrative division label of the first character is the same as the administrative division label of the second character;

408. If they are the same, store the second character in the character buffer;

409. If they are not the same, output the first character, empty the character buffer, and process the next character;

410. Splicing the characters marked with the same administrative division output from the character buffer area;

In this embodiment, a character buffer area that is initially empty is set, and the characters in the marked address text to be recognized are stored in the character buffer area in the order of the text itself, for example, "Chongqing Wuxitang" in the above figure Fang Town Tangfang Village No. 1 Community", first put "Chong" into the character buffer area, and judge whether "Chong" and "Qing" are the same administrative division mark, because "Chong" and "Qing" are both "province" Therefore, save "Qing" into the character buffer area, and judge whether "Qing" and "Wu" are the same administrative division label. Different, so the two characters "Chong" and "Qing" in the character buffer area are taken out, and "Chongqing" is obtained by splicing. Through the processing of each character, "Chongqing Wuxi Tangfang Town Tangfang Village No. "Chongqing", "Wuxi", "Tangfang Town", "Tangfang Village" and "One Society", through such a division, it is convenient to convert the address text to be recognized into the marked address text.

411. Convert the address text to be recognized into standard address text according to the administrative division labeling of each character in the address text to be recognized.

On the basis of the previous embodiment, this embodiment adds a process of splicing characters marked by consecutive identical administrative divisions in the address text to be recognized. Process each character in the address text to be recognized; store the first character of the address text to be recognized into the character buffer area, and determine the administrative division label of the first character; determine the first character Whether the administrative division label of the second character is the same as the administrative division label of the second character; if they are the same, store the second character in the character buffer area; if not, output the first character and clear the character buffer area, and process the next character; splicing the characters marked with the same administrative division output from the character buffer area. The characters marked by consecutive identical administrative divisions in this method are spliced, which facilitates the subsequent conversion of the address text to be identified into the marked address text.

The address information parsing method in the embodiment of the present application has been described above, and the address information parsing apparatus in the embodiment of the present application is described below. Referring to FIG. 5, an embodiment of the address information parsing apparatus in the embodiment of the present application includes:

A data crawling module 501 is used to crawl original address data from a preset data source by using a web crawler tool;

A screening module 502, configured to screen out address expression data whose character length is within a preset length interval from the original address data, and annotate the address expression data to obtain model training data;

A model training module 503, configured to obtain an address parsing model by training according to the model training data and a preset neural network;

A model input module 504, configured to acquire the address text to be recognized uploaded by the user, input the address text to be recognized into the address resolution model, and obtain the administrative division labeling of each character in the address text to be recognized;

The standard conversion module 505 is configured to convert the to-be-recognized address text into standard address text according to the administrative division labeling of each character in the to-be-recognized address text.

It should be emphasized that, in order to ensure the privacy and security of data, the above-mentioned address text to be recognized can be stored in a node of a blockchain.

In the embodiment of the present application, the address information parsing apparatus runs the address information parsing method, and the address information parsing method includes: using a web crawler tool to crawl original address data from a preset data source; Screening out address expression data whose character length is within a preset length interval, and marking the address expression data to obtain model training data; and training to obtain an address parsing model according to the model training data and a preset neural network; Obtain the address text to be recognized uploaded by the user, input the address text to be recognized into the address resolution model, and obtain the administrative division labeling of each character in the address text to be recognized; The administrative division labeling of the characters converts the address text to be recognized into standard address text. Through the method, the computer can extract the semantic features of the entire address, and consider the division results of the administrative divisions of the characters before and after, so as to realize the multi-level administrative division analysis of the non-standardized address. Compared with the existing address resolution algorithms, this scheme does not rely on address canonicality, feature words and address dictionaries, so it can handle diverse non-canonical expressions. The method based on the deep model can also learn the naming and segmentation rules in the existing data, and apply it to model inference, which can improve the effect of non-standard Chinese address resolution, so that such Chinese address information can be directly used by computers for location services . In addition, this application also relates to blockchain technology, and the original address data can be stored in the blockchain.

Referring to FIG. 6, the second embodiment of the address information parsing apparatus in the embodiment of the present application includes:

Wherein, the model training module 503 includes:

Vector conversion unit 5031, for inputting the model training data into the embedding layer in the neural network, and converting each character in the model training data into a word vector;

The sequence unit 5032 is used to input the word vector as the input of each time step of the bidirectional long short-term memory network layer in the neural network to obtain the implicit output sequence of the model training data;

The label prediction unit 5033 is used to input the hidden output sequence into the conditional random field layer in the neural network, predict the label of each character in the model training data, and compare the original label of the model training data. Align and iterate to get the final pre-trained geocoding model.

Optionally, the vector conversion unit 5031 is specifically used for:

converting each character in the model training data into a one-hot code vector;

The one-hot code vector of the model training data is converted into a low-dimensional dense word vector through a pre-trained vector matrix.

Optionally, the sequence unit 5032 is specifically used for:

The word vector input is used as the input of each time step of the bidirectional long-term and short-term memory network layer in the neural network to obtain the hidden state sequence output by the forward long-term and short-term memory network and the hidden state sequence output by the reverse long-term and short-term memory network;

The hidden state sequence output by the forward long-term and short-term memory network and the hidden state sequence output by the reverse long-term and short-term memory network are spliced to obtain a complete hidden output sequence.

Optionally, the model training module further includes a first re-measurement unit 5034, and the first re-measurement unit 5034 is specifically used for:

Obtain the administrative division sequence of the model training data according to the labeling of each character in the model training data;

Judging whether there are at least two administrative division annotation fragments with the same annotation type in the sequence of administrative divisions, wherein the administrative division fragments are fragments composed of consecutive and identical administrative division annotations;

If so, compare the positions of the administrative division annotation fragments with the same annotation type in the administrative division sequence, and re-predict the administrative division annotations in the administrative division annotation fragments that are located later in the administrative division annotation fragments with the same annotation type. .

Optionally, the model training module further includes a second re-measurement unit 5035, and the second re-measurement unit 5035 is specifically used for:

According to the arrangement order of the administrative division labels in the administrative division sequence, determine whether there is an error in the administrative division sequence;

If so, re-predict the administrative division labels of the characters in the administrative division sequence.

Wherein, the address information parsing device further includes a character connection module 506, and the character connection module 506 is specifically used for:

establishing an initially empty character buffer area, and processing each character in the address text to be recognized according to the character sequence of the address text to be recognized;

storing the first character of the address text to be recognized into the character buffer area, and determining the administrative division label of the first character;

Determine whether the administrative division labeling of the first character is the same as the administrative division labeling of the second character;

If the same, the second character is stored in the character buffer;

If not, output the first character, clear the character buffer, and process the next character;

The characters marked with the same administrative division output from the character buffer are spliced together.

On the basis of the previous embodiment, this embodiment describes in detail the specific functions of each module and the unit structure of some modules. Through this device, the computer can extract the semantic features of the entire address, and consider the division results of the administrative divisions of the characters before and after. , to achieve multi-level administrative division analysis of denormalized addresses. Compared with the existing address resolution algorithm, it does not depend on address norm, feature word and address dictionary, so it can handle diverse non-normative expressions. The method based on the deep model can also learn the naming and segmentation rules in the existing data, and apply it to model inference, which can improve the effect of non-standard Chinese address resolution, so that such Chinese address information can be directly used by computers for location services .

5 and 6 above describe the address information parsing apparatus in the embodiment of the present application in detail from the perspective of modular functional entities, and the following describes the address information parsing device in the embodiment of the present application in detail from the perspective of hardware processing.

FIG. 7 is a schematic structural diagram of an address information parsing device provided by an embodiment of the present application. The address information parsing device 700 may vary greatly due to different configurations or performances, and may include one or more central processing units. , CPU) 710 (eg, one or more processors) and memory 720, one or more storage media 730 (eg, one or more mass storage devices) storing application programs 733 or data 732. Among them, the memory 720 and the storage medium 730 may be short-term storage or persistent storage. The program stored in the storage medium 730 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the address information parsing device 700 . Further, the processor 710 may be configured to communicate with the storage medium 730, and execute a series of instruction operations in the storage medium 730 on the address information parsing device 700 to implement the steps of the above address information parsing method.

The address information parsing device 700 may also include one or more power supplies 740, one or more wired or wireless network interfaces 750, one or more input and output interfaces 760, and/or, one or more operating systems 731, such as Windows Server , Mac OS X, Unix, Linux, FreeBSD and more. Those skilled in the art can understand that the structure of the address information parsing device shown in FIG. 7 does not constitute a limitation on the address information parsing device provided by this application, and may include more or less components than those shown in the figure, or combine some components, Or a different component arrangement.

The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

The present application also provides a computer-readable storage medium. The computer-readable storage medium may be a non-volatile computer-readable storage medium. The computer-readable storage medium may also be a volatile computer-readable storage medium. Instructions are stored in the computer-readable storage medium, and when the instructions are executed on a computer, make the computer execute the steps of the address information parsing method.

Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the above-described system, device, and unit may refer to the corresponding process in the foregoing method embodiments, which will not be repeated here.

The integrated unit, if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand: The technical solutions described in the embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the present application.

Claims

An address information parsing method, wherein the address information parsing method comprises:

Use web crawler tools to crawl original address data from preset data sources;

Filter out address expression data whose character length is within a preset length interval from the original address data, and annotate the address expression data to obtain model training data;

According to the model training data and the preset neural network, an address resolution model is obtained by training;

Obtaining the address text to be recognized uploaded by the user, inputting the address text to be recognized into the address resolution model, and obtaining the administrative division labels of each character in the address text to be recognized;

The to-be-recognized address text is converted into a standard address text according to the administrative division labeling of each character in the to-be-recognized address text.
The address information parsing method according to claim 1, wherein the obtaining an address parsing model by training according to the model training data and a preset neural network comprises:

The model training data is input into the embedding layer in the neural network, and each character in the model training data is converted into a word vector;

The word vector input is used as the input of each time step of the bidirectional long short-term memory network layer in the neural network to obtain the latent output sequence of the model training data;

Input the latent output sequence into the conditional random field layer in the neural network, predict the label of each character in the model training data, and compare and iterate with the original label of the model training data to obtain the final Pretrained Geocoding Model.
The address information parsing method according to claim 2, wherein the inputting the model training data into an embedding layer in the neural network, and converting each character in the model training data into a word vector comprises: :

converting each character in the model training data into a one-hot code vector;

The one-hot code vector of the model training data is converted into a low-dimensional dense word vector through a pre-trained vector matrix.
The address information parsing method according to claim 3, wherein the input of the word vector is used as the input of each time step of the bidirectional long short-term memory network layer in the neural network, and the implicit output of the model training data is obtained. The sequence includes:

The word vector input is used as the input of each time step of the bidirectional long-term and short-term memory network layer in the neural network to obtain the hidden state sequence output by the forward long-term and short-term memory network and the hidden state sequence output by the reverse long-term and short-term memory network;

The hidden state sequence output by the forward long-term and short-term memory network and the hidden state sequence output by the reverse long-term and short-term memory network are spliced to obtain a complete hidden output sequence.
The address information parsing method according to claim 4, wherein after inputting the latent output sequence into the conditional random field layer in the neural network, and predicting the label of each character in the model training data, further include:

Obtain the administrative division sequence of the model training data according to the labeling of each character in the model training data;

Judging whether there are at least two administrative division annotation fragments with the same annotation type in the sequence of administrative divisions, wherein the administrative division fragments are fragments composed of consecutive and identical administrative division annotations;

If so, compare the positions of the administrative division annotation fragments with the same annotation type in the administrative division sequence, and re-predict the administrative division annotations in the administrative division annotation fragments located at the back of the administrative division annotation fragments with the same annotation type. .
The address information parsing method according to claim 4, wherein after inputting the latent output sequence into the conditional random field layer in the neural network, and predicting the label of each character in the model training data, further include:

Obtain the administrative division sequence of the model training data according to the labeling of each character in the model training data;

According to the arrangement order of the administrative division labels in the administrative division sequence, determine whether there is an error in the administrative division sequence;

If so, re-predict the administrative division labels of the characters in the administrative division sequence.
The address information parsing method according to any one of claims 1-6, wherein, in the acquiring the address text to be recognized uploaded by the user, and inputting the address text to be recognized into the address parsing model, obtaining After the administrative division of each character in the address text to be recognized is marked, it also includes:

establishing an initially empty character buffer area, and processing each character in the address text to be recognized according to the character sequence of the address text to be recognized;

storing the first character of the address text to be recognized into the character buffer area, and determining the administrative division label of the first character;

Determine whether the administrative division labeling of the first character is the same as the administrative division labeling of the second character;

If the same, the second character is stored in the character buffer;

If not, output the first character, clear the character buffer, and process the next character;

The characters marked with the same administrative division output from the character buffer are spliced together.
An address information parsing device, comprising a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor, and the processor implements the following when executing the computer-readable instructions step:

Use web crawler tools to crawl original address data from preset data sources;

Filter out address expression data whose character length is within a preset length interval from the original address data, and annotate the address expression data to obtain model training data;

According to the model training data and the preset neural network, an address resolution model is obtained by training;

Obtaining the address text to be recognized uploaded by the user, inputting the address text to be recognized into the address resolution model, and obtaining the administrative division labels of each character in the address text to be recognized;

The to-be-recognized address text is converted into a standard address text according to the administrative division labeling of each character in the to-be-recognized address text.
The address information parsing device according to claim 8, wherein the step of obtaining an address parsing model by training according to the model training data and a preset neural network comprises:

The model training data is input into the embedding layer in the neural network, and each character in the model training data is converted into a word vector;

The word vector input is used as the input of each time step of the bidirectional long short-term memory network layer in the neural network to obtain the latent output sequence of the model training data;

Input the latent output sequence into the conditional random field layer in the neural network, predict the label of each character in the model training data, and compare and iterate with the original label of the model training data to obtain the final Pretrained Geocoding Model.
The address information parsing device according to claim 9, wherein the processor performs the inputting of the model training data into an embedding layer in the neural network, wherein each character in the model training data is When converting to word vectors, the steps include:

converting each character in the model training data into a one-hot code vector;

The one-hot code vector of the model training data is converted into a low-dimensional dense word vector through a pre-trained vector matrix.
The address information parsing device according to claim 10, wherein the processor executes the input of the word vector as the input of each time step of a bidirectional long short-term memory network layer in the neural network to obtain the model The steps when training the latent output sequence of the data include:

The word vector input is used as the input of each time step of the bidirectional long-term and short-term memory network layer in the neural network to obtain the hidden state sequence output by the forward long-term and short-term memory network and the hidden state sequence output by the reverse long-term and short-term memory network;

The hidden state sequence output by the forward long-term and short-term memory network and the hidden state sequence output by the reverse long-term and short-term memory network are spliced to obtain a complete hidden output sequence.
The address information parsing device according to claim 11, wherein the processor executes the conditional random field layer of inputting the latent output sequence into the neural network, and predicts the value of each character in the model training data. After the marked steps, also include:

Obtain the administrative division sequence of the model training data according to the labeling of each character in the model training data;

Judging whether there are at least two administrative division annotation fragments with the same annotation type in the sequence of administrative divisions, wherein the administrative division fragments are fragments composed of consecutive and identical administrative division annotations;

If so, compare the positions of the administrative division annotation fragments with the same annotation type in the administrative division sequence, and re-predict the administrative division annotations in the administrative division annotation fragments located at the back of the administrative division annotation fragments with the same annotation type. .
The address information parsing device according to claim 11, wherein the processor executes the conditional random field layer of inputting the latent output sequence into the neural network, and predicts the value of each character in the model training data. After the marked steps, also include:

Obtain the administrative division sequence of the model training data according to the labeling of each character in the model training data;

According to the arrangement order of the administrative division labels in the administrative division sequence, determine whether there is an error in the administrative division sequence;

If so, re-predict the administrative division labels of the characters in the administrative division sequence.
The address information parsing device according to any one of claims 8-13, wherein the processor executes the acquiring the address text to be recognized uploaded by the user, and inputs the address text to be recognized into the In the address parsing model, after obtaining the administrative division labeling of each character in the address text to be recognized, it also includes:

establishing an initially empty character buffer area, and processing each character in the address text to be recognized according to the character sequence of the address text to be recognized;

storing the first character of the address text to be recognized into the character buffer area, and determining the administrative division label of the first character;

Determine whether the administrative division labeling of the first character is the same as the administrative division labeling of the second character;

If the same, the second character is stored in the character buffer;

If not, output the first character, clear the character buffer, and process the next character;

The characters marked with the same administrative division output from the character buffer are spliced together.
A computer-readable storage medium, wherein computer instructions are stored in the computer-readable storage medium, and when the computer instructions are executed on a computer, the computer is made to perform the following steps:

Use web crawler tools to crawl original address data from preset data sources;

Filter out address expression data whose character length is within a preset length interval from the original address data, and annotate the address expression data to obtain model training data;

According to the model training data and the preset neural network, an address resolution model is obtained by training;

Obtaining the address text to be recognized uploaded by the user, inputting the address text to be recognized into the address resolution model, and obtaining the administrative division labels of each character in the address text to be recognized;

The to-be-recognized address text is converted into a standard address text according to the administrative division labeling of each character in the to-be-recognized address text.
The computer-readable storage medium according to claim 15, wherein computer instructions are stored in the computer-readable storage medium, and when the computer instructions are executed on a computer, the computer causes the computer to execute the training data and the training data according to the model. The preset neural network, when training the steps to obtain the address resolution model, includes:

The model training data is input into the embedding layer in the neural network, and each character in the model training data is converted into a word vector;

The word vector input is used as the input of each time step of the bidirectional long short-term memory network layer in the neural network to obtain the latent output sequence of the model training data;

Input the latent output sequence into the conditional random field layer in the neural network, predict the label of each character in the model training data, and compare and iterate with the original label of the model training data to obtain the final Pretrained Geocoding Model.
The computer-readable storage medium according to claim 16, wherein the computer-readable storage medium stores computer instructions that, when executed on a computer, cause the computer to execute the inputting of the model training data In the embedding layer in the neural network, the step of converting each character in the model training data into a word vector includes:

converting each character in the model training data into a one-hot code vector;

The one-hot code vector of the model training data is converted into a low-dimensional dense word vector through a pre-trained vector matrix.
18. The computer-readable storage medium of claim 17, wherein the computer-readable storage medium stores computer instructions that, when executed on a computer, cause the computer to perform the inputting of the word vector as a When the input of each time step of the bidirectional long short-term memory network layer in the neural network, the steps of obtaining the latent output sequence of the model training data include:

The word vector input is used as the input of each time step of the bidirectional long-term and short-term memory network layer in the neural network to obtain the hidden state sequence output by the forward long-term and short-term memory network and the hidden state sequence output by the reverse long-term and short-term memory network;

The hidden state sequence output by the forward long-term and short-term memory network and the hidden state sequence output by the reverse long-term and short-term memory network are spliced to obtain a complete hidden output sequence.
19. The computer-readable storage medium of claim 18, wherein the computer-readable storage medium stores computer instructions that, when executed on a computer, cause the computer to execute the implicit output sequence according to the After the step of inputting the conditional random field layer in the neural network and predicting the labeling of each character in the model training data, it also includes:

Obtain the administrative division sequence of the model training data according to the labeling of each character in the model training data;

Judging whether there are at least two administrative division annotation fragments with the same annotation type in the sequence of administrative divisions, wherein the administrative division fragments are fragments composed of consecutive and identical administrative division annotations;

If so, compare the positions of the administrative division annotation fragments with the same annotation type in the administrative division sequence, and re-predict the administrative division annotations in the administrative division annotation fragments located at the back of the administrative division annotation fragments with the same annotation type. .
An address information parsing device, wherein the address information parsing device comprises:

The data crawling module is used to crawl the original address data from the preset data source by using the web crawler tool;

a screening module, configured to filter out address expression data whose character length is within a preset length interval from the original address data, and annotate the address expression data to obtain model training data;

a model training module, used for training an address resolution model according to the model training data and a preset neural network;

a model input module, configured to obtain the address text to be recognized uploaded by the user, input the address text to be recognized into the address resolution model, and obtain the administrative division labeling of each character in the address text to be recognized;

A standard conversion module, configured to convert the to-be-recognized address text into standard address text according to the administrative division labeling of each character in the to-be-recognized address text.