CN112612940A

CN112612940A - Address information analysis method, device, equipment and storage medium

Info

Publication number: CN112612940A
Application number: CN202011544487.1A
Authority: CN
Inventors: 赵焕丽; 徐国强
Original assignee: OneConnect Financial Technology Co Ltd Shanghai
Current assignee: OneConnect Financial Technology Co Ltd Shanghai
Priority date: 2020-12-23
Filing date: 2020-12-23
Publication date: 2021-04-06
Also published as: WO2022134592A1

Abstract

The invention relates to the field of artificial intelligence, and discloses an address information analysis method, an address information analysis device, address information analysis equipment and a storage medium, which are used for converting an address text to be identified uploaded by a user into a standard address text, wherein the method comprises the following steps: crawling original address data from a preset data source by utilizing a webpage crawler tool; screening address expression data with the character length within a preset length range from the original address data, and labeling to obtain model training data; training according to model training data and a preset neural network to obtain an address information analysis model; acquiring an address text to be recognized uploaded by a user, inputting the address text to be recognized into an address information analysis model, and acquiring administrative division labels of characters in the address text to be recognized; and converting the address text to be recognized into a standard address text according to the administrative division label of each character in the address text to be recognized. In addition, the invention also relates to a block chain technology, and the address text to be identified can be stored in the block chain.

Description

Address information analysis method, device, equipment and storage medium

Technical Field

The present invention relates to the field of artificial intelligence, and in particular, to an address information parsing method, apparatus, device, and storage medium.

Background

Services based on location information are increasingly widely used in people's lives, and there is an increasing need to quickly and accurately find geographic coordinates according to text address expressions. A standard Chinese address should contain a complete administrative district, and express according to the order of administrative district (province/city/county/village), street, brand, building, house, the characteristic word is obvious, can resolve with Chinese address segmentation algorithm, thus can correspond to the geographical position of the address accurately.

However, the non-normalized expression of the chinese address causes ambiguity or ambiguity of the location semantic information, preventing the computer from directly understanding the geographical location described by this address information, so that such chinese address information cannot be directly used by the computer for location services. The existing address analysis algorithm (Chinese address element segmentation method, thesaurus matching method, characteristic word segmentation method and the like) depends on address normalization, characteristic words and an address dictionary, and can not solve the problem of non-standard Chinese addresses well, so that the Chinese address information can not be directly used for position service by a computer.

Disclosure of Invention

The invention mainly aims to solve the technical problem that the accuracy of resolving non-standard Chinese addresses is low because the existing address resolution algorithm depends on address standardability, characteristic words and an address dictionary.

The invention provides an address information analysis method in a first aspect, which comprises the following steps:

crawling original address data from a preset data source by utilizing a webpage crawler tool;

screening address expression data with the character length within a preset length range from the original address data, and labeling the address expression data to obtain model training data;

training to obtain an address resolution model according to the model training data and a preset neural network;

acquiring an address text to be recognized uploaded by a user, inputting the address text to be recognized into the address resolution model, and acquiring administrative division labels of characters in the address text to be recognized;

and converting the address text to be recognized into a standard address text according to the administrative division label of each character in the address text to be recognized.

Optionally, in a first implementation manner of the first aspect of the present invention, the training, according to the model training data and a preset neural network, to obtain an address resolution model includes:

inputting the model training data into an embedding layer in the neural network, and converting each character in the model training data into a word vector;

taking the word vector input as the input of each time step of a bidirectional long-short term memory network layer in the neural network to obtain a hidden output sequence of the model training data;

and inputting the hidden output sequence into a conditional random field layer in the neural network, predicting the label of each character in the model training data, and comparing and iterating the label with the original label of the model training data to obtain a final pre-trained address resolution model.

Optionally, in a second implementation manner of the first aspect of the present invention, the inputting the model training data into an embedding layer in the neural network, and the converting each character in the model training data into a word vector includes:

converting each character in the model training data into a one-hot code vector;

and converting the one-hot code vector of the model training data into a low-dimensional dense word vector through a vector matrix which is pre-trained.

Optionally, in a third implementation manner of the first aspect of the present invention, the obtaining the hidden output sequence of the model training data by using the word vector input as the input of each time step of a bidirectional long-short term memory network layer in the neural network includes:

inputting the word vector as the input of each time step of a bidirectional long and short term memory network layer in the neural network to obtain a hidden state sequence output by a forward long and short term memory network and a hidden state sequence output by a reverse long and short term memory network;

and splicing the hidden state sequence output by the forward long and short term memory network and the hidden state sequence output by the reverse long and short term memory network to obtain a complete hidden output sequence.

Optionally, in a fourth implementation manner of the first aspect of the present invention, after the inputting the hidden output sequence to a conditional random field layer in the neural network and predicting a label of each character in the model training data, the method further includes:

acquiring an administrative division sequence of the model training data according to the labels of the characters in the model training data;

judging whether at least two sections of administrative division marking segments with the same marking type appear in the administrative division sequence, wherein the administrative division segments are segments formed by continuous identical administrative division marks;

if so, comparing the positions of the administrative division marking segments with the same marking type in the administrative division sequence, and predicting the administrative division marks in the administrative division marking segments with the same marking type, which are positioned later in the administrative division marking segments.

Optionally, in a fifth implementation manner of the first aspect of the present invention, after the inputting the hidden output sequence to a conditional random field layer in the neural network and predicting a label of each character in the model training data, the method further includes:

judging whether the administrative division sequence has errors or not according to the arrangement sequence of the administrative division labels in the administrative division sequence;

if yes, performing re-prediction on the administrative division labels of the characters in the administrative division sequence.

Optionally, in a sixth implementation manner of the first aspect of the present invention, after the obtaining an address text to be recognized uploaded by a user, inputting the address text to be recognized into the address resolution model, and obtaining an administrative division label of each character in the address text to be recognized, the method further includes:

establishing an initially empty character cache region, and processing each character in the address text to be recognized according to the character sequence of the address text to be recognized;

storing a first character of the address text to be recognized in the character cache region, and determining an administrative division label of the first character;

judging whether the administrative division label of the first character is the same as the administrative division label of the second character or not;

if the first character is the same as the second character, storing the second character into the character cache region;

if not, outputting the first character, clearing the character cache region, and processing the next character;

and splicing the characters marked by the same administrative regions output by the character cache region.

A second aspect of the present invention provides an address information analyzing apparatus, including:

the data crawling module is used for crawling original address data from a preset data source by utilizing a webpage crawler tool;

the screening module is used for screening address expression data with the character length within a preset length interval from the original address data and marking the address expression data to obtain model training data;

the model training module is used for training to obtain an address resolution model according to the model training data and a preset neural network;

the model input module is used for acquiring an address text to be recognized uploaded by a user, inputting the address text to be recognized into the address resolution model and acquiring administrative division labels of characters in the address text to be recognized;

and the standard conversion module is used for converting the address text to be recognized into a standard address text according to the administrative division label of each character in the address text to be recognized.

Optionally, in a first implementation manner of the second aspect of the present invention, the model training module includes:

the vector conversion unit is used for inputting the model training data into an embedding layer in the neural network and converting each character in the model training data into a word vector;

the sequence unit is used for inputting the word vector as the input of each time step of a bidirectional long-short term memory network layer in the neural network to obtain a hidden output sequence of the model training data;

and the label prediction unit is used for inputting the hidden output sequence into a conditional random field layer in the neural network, predicting labels of all characters in the model training data, and comparing and iterating the labels with original labels of the model training data to obtain a final pre-trained address analysis model.

Optionally, in a second implementation manner of the second aspect of the present invention, the vector conversion unit is specifically configured to:

Optionally, in a third implementation manner of the second aspect of the present invention, the sequence unit is specifically configured to:

Optionally, in a fourth implementation manner of the second aspect of the present invention, the model training module further includes a first retesting unit, where the first retesting unit is specifically configured to:

Optionally, in a fifth implementation manner of the second aspect of the present invention, the model training module further includes a second retesting unit, where the second retesting unit is specifically configured to:

Optionally, in a sixth implementation manner of the second aspect of the present invention, the address information analysis apparatus further includes a character connection module, where the character connection module is specifically configured to:

A third aspect of the present invention provides an address information analyzing apparatus, including: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line; the at least one processor calls the instructions in the memory to cause the address information resolution device to execute the steps of the address information resolution method.

A fourth aspect of the present invention provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to perform the steps of the above-described address information resolution method.

According to the technical scheme, original address data are crawled from a preset data source by utilizing a webpage crawler tool; screening address expression data with the character length within a preset length range from the original address data, and labeling the address expression data to obtain model training data; training to obtain an address resolution model according to the model training data and a preset neural network; acquiring an address text to be recognized uploaded by a user, inputting the address text to be recognized into the address resolution model, and acquiring administrative division labels of characters in the address text to be recognized; and converting the address text to be recognized into a standard address text according to the administrative division label of each character in the address text to be recognized. By the method, the computer can extract the semantic features of the whole address, and the multi-level administrative division analysis of the non-normalized address is realized by considering the division results of the administrative divisions of the front and rear characters. Compared with the existing address resolution algorithm, the scheme does not depend on address normativity, feature words and an address dictionary, so that diversified non-normative expressions can be processed. The method based on the depth model can also learn the naming and segmentation rules in the existing data and is applied to model inference, so that the non-standard Chinese address resolution effect can be improved, and the Chinese address information can be directly used for position service by a computer. In addition, the invention also relates to a block chain technology, and the original address data can be stored in the block chain.

Drawings

Fig. 1 is a schematic diagram of a first embodiment of an address information parsing method according to an embodiment of the present invention;

FIG. 2 is a diagram of a second embodiment of an address information parsing method according to an embodiment of the invention;

FIG. 3 is a diagram of a third embodiment of an address information parsing method according to an embodiment of the present invention;

FIG. 4 is a diagram of a fourth embodiment of an address information parsing method according to an embodiment of the present invention;

FIG. 5 is a diagram of an embodiment of an address information parsing apparatus according to an embodiment of the present invention;

FIG. 6 is a diagram of another embodiment of an address information resolution device according to an embodiment of the present invention;

fig. 7 is a schematic diagram of an embodiment of an address information analyzing apparatus in an embodiment of the present invention.

Detailed Description

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For convenience of understanding, a detailed flow of the embodiment of the present invention is described below, and referring to fig. 1, a first embodiment of an address information parsing method according to the embodiment of the present invention includes:

101. crawling original address data from a preset data source by utilizing a webpage crawler tool;

it should be understood that the execution subject of the present invention may be an address information analysis device, and may also be a terminal or a server, which is not limited herein. The embodiment of the present invention is described by taking a server as an execution subject.

It is emphasized that the original address data may be stored in a node of a blockchain in order to ensure privacy and security of the data.

In this embodiment, the preset data source may be some official information websites or a published address library, and crawl address data in the data source as original address data, where the original address data mostly include chinese addresses, and the chinese addresses may have non-normals and are different from standard administrative region division, for example, a characteristic word "region" of administrative division is omitted in "xu hui binlu", a part of the administrative division "xu hui district" is omitted in the middle of "kehai kohami lu", information levels of the administrative division are disordered, and a word "region" of "district american kindergarten" causes the non-administrative division part and the administrative division of the address to have the same name.

In this embodiment, after the millions of pieces of original address data are crawled out through a data source, a first step of screening is performed, and the standard original address data is obtained mainly by judging whether characters in the original address data are characters encoded by UTF-8 or not and deleting non-UTF-8 encoded characters, such as emoticons, in the original address data.

102. Screening address expression data with the character length within a preset length range from the original address data, and labeling the address expression data to obtain model training data;

in this embodiment, the preset length interval is generally set to be between 7 and 20 according to a specific application scenario, and for an application scenario with a detailed and complete address requirement, the range of the interval may be appropriately adjusted, in a technical aspect, the length of the character is a configurable parameter, and the length of the character has no influence on a subsequent model training process, so that only before different application scenarios, the character needs to be modified and configured, and the longest requirement of the general model does not exceed 128 characters.

In this embodiment, the labeling is mainly performed manually, and the labeled label is mainly an administrative division, and includes 10 levels, such as "province", "city", "county", "town", "street", "road", "house number", "village", "building name", "other", where "province" includes: province, direct municipality, autonomous region, special administrative region; "City" includes prefecture, region, autonomous state, union; "prefecture and county" includes prefecture, county-level city, county, flag, special area, forest area; "village and town" includes town, village, national countryside, sappan wood, national sappan wood, prefecture and district, and public place; the street and the village and town belong to a rural administrative district; "road": the labels of roads, streets and lanes are the same as the standard names.

In this embodiment, the manual labeling labels each character in the address expression data, for example, for "guangdong shenzhen city," each character may be labeled "province, city," the model training data may be arranged in the following format, "guangdong province/shenzhen city/city, precious district/western village street/street nanchang second new village/village X lane/road number/house number," and each character in the model training data has a corresponding label.

103. Training according to the model training data and a preset neural network to obtain an address resolution model;

in this embodiment, the preset neural network is a Bi-LSTM-CRF neural network, where the Bi-LSTM-CRF includes three layers of neural networks, respectively an Embedding layer, a Bi-LSTM layer and a CRF layer, where the Embedding layer is an Embedding layer, each character in the input model training data can be mapped into a vector on a low-dimensional space through the Embedding layer, the word vector is a distributed representation of each character in the text, semantics are transmitted to the computer through the low-dimensional vector in the space, the Bi-LSTM layer is a bidirectional long-term and short-term memory network layer, the bidirectional long-term and short-term memory network includes two sets of modules, namely a forward LSTM module and a backward LSTM module, and can obtain an associated dependency relationship of a long-term range of a context, capture characteristics of preceding and following text entities, obtain a time-space correlation between more entities, and eliminate influences of noise such as an interfering entity on the neural network model from two directions, the method greatly assists in mining long-term dependence, and conditional random fields (conditional random fields) are discriminant probability models, are one kind of random fields, and are commonly used for labeling or analyzing sequence data, such as natural language characters or biological sequences. The conditional random field is a graph model with no direction, the vertexes in the graph represent random variables, the connecting lines between the vertexes represent the dependence relations between the random variables, in the conditional random field, the distribution of the random variables Y is conditional probability, and a given observed value is the random variable X. In principle, the graph model layout of the conditional random field can be arbitrarily given, and a general layout is a chained architecture, which has a more efficient algorithm for calculation no matter in training (training), inference (inference), or decoding (decoding). The Bi-LSTM has the advantages that context information can be remembered, the mining of long-term dependence is greatly assisted, and the semantic understanding is greatly facilitated, but if the Bi-LSTM is directly used for labeling tasks, a problem exists, the Bi-LSTM belongs to a time sequence model, so that the output of the Bi-LSTM only aims at current characters and belongs to a local optimal solution. The conditional random field has high requirements on the template, and the model can learn information of a plurality of contexts only by covering the template completely, but the condition that the template is not completely covered usually occurs. The Bi-LSTM can obtain context information, but a solving model is needed, and the conditional random field can generate a global optimal solution, but the context information is needed, so that the invention combines the Bi-LSTM and the conditional random field to construct a complete model with complementary advantages.

104. Acquiring an address text to be recognized uploaded by a user, inputting the address text to be recognized into an address resolution model, and acquiring administrative division labels of characters in the address text to be recognized;

in this embodiment, after obtaining the address resolution model, the address resolution model may be used to perform resolution recognition on different addresses to be recognized, which are input by the user, for example, the user inputs "Chongqing wuxi pond house town pond house village society", and after inputting the model, the model labels each character therein, which is "other in villages" in villages, towns, counties ", respectively.

105. Converting the address text to be recognized into a standard address text according to the administrative division label of each character in the address text to be recognized;

in this embodiment, characters with the same labels are spliced to obtain names of administrative districts marked in each type, for example, if the labels of two characters of "Chongqing" and "qing" are both "province", the characters are spliced to obtain Chongqing, the subsequent characters are analogized, after the Chongqing is determined to be "province", the characters "city" are added after the Chongqing, the Chongqing is determined to be one of province, autonomous region, direct prefecture city and special administrative district, the Chongqing is direct prefecture, the characters "city" are matched from 40 administrative district counties under the Chongqing city, and the similar method is adopted, so that the address text "Chongqing wuxi city town city village society" to be identified can be resolved and identified as the standard address text "Chongqing city wu city town city village society".

In the embodiment, original address data is crawled from a preset data source by utilizing a webpage crawler tool; screening address expression data with the character length within a preset length range from the original address data, and labeling the address expression data to obtain model training data; training to obtain an address resolution model according to the model training data and a preset neural network; acquiring an address text to be recognized uploaded by a user, inputting the address text to be recognized into the address resolution model, and acquiring administrative division labels of characters in the address text to be recognized; and converting the address text to be recognized into a standard address text according to the administrative division label of each character in the address text to be recognized. By the method, the computer can extract the semantic features of the whole address, and the multi-level administrative division analysis of the non-normalized address is realized by considering the division results of the administrative divisions of the front and rear characters. Compared with the existing address resolution algorithm, the scheme does not depend on address normativity, feature words and an address dictionary, so that diversified non-normative expressions can be processed. The method based on the depth model can also learn the naming and segmentation rules in the existing data and is applied to model inference, so that the non-standard Chinese address resolution effect can be improved, and the Chinese address information can be directly used for position service by a computer. In addition, the invention also relates to a block chain technology, and the original address data can be stored in the block chain.

Referring to fig. 2, a second embodiment of the address information parsing method according to the embodiment of the present invention includes:

201. crawling original address data from a preset data source by utilizing a webpage crawler tool;

202. screening address expression data with the character length within a preset length range from the original address data, and labeling the address expression data to obtain model training data;

the steps 201-202 in the present embodiment are similar to the steps 101-102 in the first embodiment, and are not described herein again.

203. Converting each character in the model training data into a unique code vector;

204. converting the one-hot code vector of the model training data into a low-dimensional dense word vector through a vector matrix which is pre-trained;

in this embodiment, the one-hot code is one-hot, and each character in the model training data needs to be converted into one-hot vector in the process of converting each character in the model training data into a word vector, because the Embedding layer is a full-link layer with one-hot as input and the number of nodes in the middle layer as the word vector dimension, and the one-hot code vector is converted into a low-dimensional dense word vector through the vector matrix of the pre-trained vector matrix, thereby solving the problems of word gap and dimension disaster.

205. Inputting the word vector as the input of each time step of a bidirectional long and short term memory network layer in the neural network to obtain a hidden state sequence output by the forward long and short term memory network and a hidden state sequence output by the reverse long and short term memory network;

206. splicing the hidden state sequence output by the forward long and short term memory network and the hidden state sequence output by the reverse long and short term memory network to obtain a complete hidden output sequence;

in this embodiment, the Bi-LSTM layer is used for weavingThe code processing comprises the following steps: the Bi-LSTM layer automatically extracts sentence features, takes char embedding sequences (x1, x2, x3, …, xn) of each character of a sentence as the input of each time step of the Bi-LSTM, and then outputs hidden state sequences of the forward LSTM

Hidden state output at various positions with inverted LSTM

And splicing according to positions to obtain a complete hidden output sequence: the output of the Bi-LSTM layer is the score of each label of the word, and finally the label with the highest label score is selected as the label of the word.

207. Inputting the hidden output sequence into a conditional random field layer in a neural network, and predicting the label of each character in the model training data;

208. acquiring an administrative division sequence of the model training data according to the labels of the characters in the model training data;

in this embodiment, after the conditional stochastic field layer predicts the label of each character in the model training data, the administrative division labels of each character in the model training data are connected, so as to obtain an administrative division sequence, for example, an administrative division sequence "architectural architecture.

209. Judging whether at least two sections of administrative division marking segments with the same marking type appear in the administrative division sequence, wherein the administrative division segments are segments formed by continuous identical administrative division marks;

210. if so, comparing the positions of the administrative division marking segments with the same marking type in the administrative division sequence, and predicting the administrative division marks in the administrative division marking segments with the same marking type, which are positioned later in the administrative division marking segments;

in this embodiment, the condition random field layer predicts a situation that a prediction error may occur in the labeling of each character in the model training data, for example, a region sequence obtained by labeling the administrative regions of each character in the prediction model training data "shanghai shaoai jiali center" is "architectural buildings and provinces and cities and provinces", two sections of identical segments of the administrative region labeling "province and province" and "province" appear, obviously, segments of the same interval of the administrative region labeling cannot appear in one section of address, and characters in the segment of the administrative region labeling at the later position need to be re-predicted.

211. Comparing and iterating labels of all characters in the conditional random field layer prediction model training data with original labels of the model training data to obtain a final pre-trained address resolution model;

212. acquiring an address text to be recognized uploaded by a user, inputting the address text to be recognized into an address resolution model, and acquiring administrative division labels of characters in the address text to be recognized;

213. and converting the address text to be recognized into a standard address text according to the administrative division label of each character in the address text to be recognized.

The steps 212-213 in the present embodiment are similar to the steps 104-105 in the first embodiment, and are not described herein again.

This embodiment describes in detail a process of obtaining an address resolution model by training according to the model training data and a preset neural network on the basis of the previous embodiment, and converts each character in the model training data into a word vector by inputting the model training data into an embedded layer in the neural network; inputting the word vector as the input of each time step of a bidirectional long-short term memory network layer in a neural network to obtain a hidden output sequence of model training data; and inputting the hidden output sequence into a conditional random field layer in the neural network, predicting the label of each character in the model training data, and comparing and iterating the label with the original label of the model training data to obtain a final pre-trained address resolution model. Meanwhile, a conditional random field layer for inputting the hidden output sequence into the neural network is added, the process of post-processing the label after the label of each character in the model training data is predicted, and the administrative division sequence of the model training data is obtained according to the label of each character in the model training data; judging whether at least two sections of administrative division marking segments with the same marking type appear in the administrative division sequence, wherein the administrative division segments are segments formed by continuous identical administrative division marks; if so, comparing the positions of the administrative division labeling segments with the same labeling types in the administrative division sequence, and predicting the administrative division labels in the administrative division labeling segments with the same labeling types, which are positioned later in the administrative division labeling segments.

Referring to fig. 3, a third embodiment of the address information resolution method according to the embodiment of the present invention includes:

301. crawling original address data from a preset data source by utilizing a webpage crawler tool;

302. screening address expression data with the character length within a preset length range from the original address data, and labeling the address expression data to obtain model training data;

303. inputting model training data into an embedding layer in a neural network, and converting each character in the model training data into a word vector;

304. inputting the word vector as the input of each time step of a bidirectional long-short term memory network layer in a neural network to obtain a hidden output sequence of model training data;

305. inputting the hidden output sequence into a conditional random field layer in a neural network, and predicting the label of each character in the model training data;

306. acquiring an administrative division sequence of the model training data according to the labels of the characters in the model training data;

307. judging whether the administrative division sequence has errors or not according to the arrangement sequence of the administrative division labels in the administrative division sequence;

308. if yes, carrying out re-prediction on the administrative division labels of the characters in the administrative division sequence;

in this embodiment, post-processing is required for the output of the CRF layer, including splicing the characters corresponding to the administrative regions labeled adjacent to each other, and determining that an error may occur, for example, for "shanghai shao jiali center" that may mark each character as "architectural building.

309. Comparing and iterating labels of all characters in the conditional random field layer prediction model training data with original labels of the model training data to obtain a final pre-trained address resolution model;

in this embodiment, each character in the model training data needs to be converted into a one-hot vector, and then the one-hot vector is converted into a word vector, where the word vector is a distributed representation of each character in the text, and the semantics are conveyed to the computer through the low-dimensional vector in the space. The model training data is input to the output of the Embedding layer of the neural network in the form of word vectors, and then the word vectors are input to the Bi-LSTM layer, the Bi-LSTM neural network is suitable for a sequence labeling task and executes the same operation aiming at each word vector in the input sequence, wherein the operation is matrix multiplication, a high-dimensional matrix (for example 300 dimensions) is linearly mapped into a low-dimensional matrix (128 dimensions), and each dimension of the matrix represents a characteristic, so that the operation can delete useless characteristics. Each step of operation depends on the calculation result of the previous step, and the context characteristics are coded at the same time, and the specific implementation of the coding is that the calculation result (here, the characteristics) of the previous step is used as the input of the next step. For example, the feature h _ { t-1} is extracted from the previous step for the "top" word, and the operation for extracting the feature from the "sea" word in the next step is f (x _ t, h _ { t-1}) ═ h _ t, where x _ t is the feature of the "sea" word itself, f is the function used in the operation, and h _ t is the feature finally extracted from the "sea" word. Therefore, when the feature of the current step is extracted, the feature of the previous step also participates in the operation, namely, the feature of the previous step is coded. Similar operations are performed for the following features. The features encoded into the context are output as all features extracted for each character. The input of the Bi-LSTM layer is used as the input of the CRF layer, the influence of the previous step label on the current step label is not considered in the characteristics of the Bi-LSTM output, for example, the current word is 'Wu', the first 2 words 'Chongqing' are city names, and the 'Wuxi' has high probability of being a district name or a town name. Therefore, the CRF layer (conditional random field) is spliced on the output layer of the Bi-LSTM, so that the output sequence of the Bi-LSTM becomes the observed sequence of the CRF layer, and then the CRF calculates the optimal solution of the whole sequence in terms of probability, and the mutual influence among sequence labels is considered. The output tag sequence of the CRF corresponds to each character of the input address, respectively.

310. Acquiring an address text to be recognized uploaded by a user, inputting the address text to be recognized into an address resolution model, and acquiring administrative division labels of characters in the address text to be recognized;

311. and converting the address text to be recognized into a standard address text according to the administrative division label of each character in the address text to be recognized.

In this embodiment, on the basis of the previous embodiment, a process of predicting a wrong judgment of labels of characters in the model training data by adding a conditional random field layer is added, and an administrative division sequence of the model training data is obtained according to the labels of the characters in the model training data; judging whether the administrative division sequence has errors or not according to the arrangement sequence of the administrative division labels in the administrative division sequence; if yes, performing re-prediction on the administrative division labels of the characters in the administrative division sequence. By the method, errors of labeling of each character in the training data of the conditional random field layer prediction model can be corrected, and the model training efficiency is improved.

Referring to fig. 4, a fourth embodiment of the address information resolution method according to the embodiment of the present invention includes:

401. crawling original address data from a preset data source by utilizing a webpage crawler tool;

402. screening address expression data with the character length within a preset length range from the original address data, and labeling the address expression data to obtain model training data;

403. training according to the model training data and a preset neural network to obtain an address resolution model;

404. acquiring an address text to be recognized uploaded by a user, inputting the address text to be recognized into an address resolution model, and acquiring administrative division labels of characters in the address text to be recognized;

the

steps

401 and 404 in this embodiment are similar to the

steps

101 and 104 in the first embodiment, and are not described herein again.

405. Establishing an initially empty character cache region, and processing each character in the address text to be recognized according to the character sequence of the address text to be recognized;

406. storing a first character of an address text to be recognized in a character cache region, and determining administrative division marking of the first character;

407. judging whether the administrative division label of the first character is the same as the administrative division label of the second character or not;

408. if the characters are the same, storing the second character into a character cache region;

409. if not, outputting the first character, clearing the character buffer area and processing the next character;

410. splicing the characters marked by the same administrative regions output by the character cache region;

in the embodiment, an initially empty character cache region is arranged, characters in a marked address text to be recognized are stored in the character cache region according to the sequence of the text, for example, the character cache region is arranged in the character cache region of ' Chongqing Wuxi Tongchan Touchun village ' in the drawing, firstly, ' Chongqing ' is placed in the character cache region, whether ' Chongqing ' and ' Qing ' are the same administrative marking is judged, because ' Chongqing ' and ' Qing ' are administrative marking of ' province ', the ' Qing ' is stored in the character cache region, whether ' Qing ' and ' Wu ' are the same administrative marking region, and ' Wu ' is an administrative marking of ' district county ', different from ' Qing ', the ' two characters of ' Chongqing ' and ' Qing ' in the character cache region are taken out, the ' Chongqing ' is obtained by splicing, each character is processed, and the character cache region is divided into a Wuqing Wuxi Touchun village Touchun town ' and a wushu village towns ', through the division, the address text to be recognized is conveniently converted into the labeled address text subsequently.

411. And converting the address text to be recognized into a standard address text according to the administrative division label of each character in the address text to be recognized.

On the basis of the previous embodiment, the method and the device for processing the address text to be recognized add a process of splicing characters marked by continuous same administrative divisions in the address text to be recognized, and process each character in the address text to be recognized according to the character sequence of the address text to be recognized by establishing an initially empty character cache region; storing a first character of the address text to be recognized in the character cache region, and determining an administrative division label of the first character; judging whether the administrative division label of the first character is the same as the administrative division label of the second character or not; if the first character is the same as the second character, storing the second character into the character cache region; if not, outputting the first character, clearing the character cache region, and processing the next character; and splicing the characters marked by the same administrative regions output by the character cache region. The characters marked by the continuous same administrative divisions in the method are spliced, so that the address text to be recognized is conveniently converted into the marked address text in the follow-up process.

In the above description of the address information resolution method in the embodiment of the present invention, the following description of the address information resolution apparatus in the embodiment of the present invention refers to fig. 5, and an embodiment of the address information resolution apparatus in the embodiment of the present invention includes:

the data crawling module 501 is configured to crawl original address data from a preset data source by using a web crawler tool;

a screening module 502, configured to screen address expression data with a character length within a preset length interval from the original address data, and label the address expression data to obtain model training data;

a model training module 503, configured to train to obtain an address resolution model according to the model training data and a preset neural network;

the model input module 504 is configured to obtain an address text to be recognized uploaded by a user, input the address text to be recognized into the address resolution model, and obtain administrative division labels of characters in the address text to be recognized;

and the standard conversion module 505 is configured to convert the address text to be recognized into a standard address text according to the administrative division label of each character in the address text to be recognized.

It is emphasized that, in order to ensure the privacy and security of the data, the address text to be recognized may be stored in a node of a block chain.

In an embodiment of the present invention, the address information analyzing apparatus operates the address information analyzing method, and the address information analyzing method includes: crawling original address data from a preset data source by utilizing a webpage crawler tool; screening address expression data with the character length within a preset length range from the original address data, and labeling the address expression data to obtain model training data; training to obtain an address resolution model according to the model training data and a preset neural network; acquiring an address text to be recognized uploaded by a user, inputting the address text to be recognized into the address resolution model, and acquiring administrative division labels of characters in the address text to be recognized; and converting the address text to be recognized into a standard address text according to the administrative division label of each character in the address text to be recognized. By the method, the computer can extract the semantic features of the whole address, and the multi-level administrative division analysis of the non-normalized address is realized by considering the division results of the administrative divisions of the front and rear characters. Compared with the existing address resolution algorithm, the scheme does not depend on address normativity, feature words and an address dictionary, so that diversified non-normative expressions can be processed. The method based on the depth model can also learn the naming and segmentation rules in the existing data and is applied to model inference, so that the non-standard Chinese address resolution effect can be improved, and the Chinese address information can be directly used for position service by a computer. In addition, the invention also relates to a block chain technology, and the original address data can be stored in the block chain.

Referring to fig. 6, a second embodiment of an address information analyzing apparatus according to the present invention includes:

Wherein the model training module 503 comprises:

a vector conversion unit 5031, configured to input the model training data into an embedding layer in the neural network, and convert each character in the model training data into a word vector;

a sequence unit 5032, configured to input the word vector as an input of each time step of a bidirectional long-short term memory network layer in the neural network, to obtain a hidden output sequence of the model training data;

a label prediction unit 5033, configured to input the hidden output sequence into a conditional random field layer in the neural network, predict labels of the characters in the model training data, and compare and iterate the labels with original labels of the model training data to obtain a final pre-trained address resolution model.

Optionally, the vector conversion unit 5031 is specifically configured to:

Optionally, the sequence unit 5032 is specifically configured to:

Optionally, the model training module further includes a first retesting unit 5034, where the first retesting unit 5034 is specifically configured to:

Optionally, the model training module further includes a second retesting unit 5035, where the second retesting unit 5035 is specifically configured to:

The address information analysis apparatus further includes a character connection module 506, where the character connection module 506 is specifically configured to:

In this embodiment, specific functions of each module and unit configurations of partial modules are described in detail based on the previous embodiment, and by using the apparatus, a computer can extract semantic features of the entire address and realize multi-level administrative division analysis of non-normalized addresses by considering division results of administrative divisions of front and rear characters. Compared with the existing address resolution algorithm, the method does not depend on address normativity, characteristic words and an address dictionary, so that diversified non-normative expressions can be processed. The method based on the depth model can also learn the naming and segmentation rules in the existing data and is applied to model inference, so that the non-standard Chinese address resolution effect can be improved, and the Chinese address information can be directly used for position service by a computer.

Fig. 5 and fig. 6 describe the address information resolution device in the embodiment of the present invention in detail from the perspective of the modular functional entity, and the address information resolution device in the embodiment of the present invention is described in detail from the perspective of hardware processing.

Fig. 7 is a schematic structural diagram of an address information analyzing apparatus 700 according to an embodiment of the present invention, where the address information analyzing apparatus 700 may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 710 (e.g., one or more processors) and a memory 720, and one or more storage media 730 (e.g., one or more mass storage devices) for storing applications 733 or data 732. Memory 720 and storage medium 730 may be, among other things, transient storage or persistent storage. The program stored in the storage medium 730 may include one or more modules (not shown), and each module may include a series of instruction operations in the address information resolving apparatus 700. Further, the processor 710 may be configured to communicate with the storage medium 730, and execute a series of instruction operations in the storage medium 730 on the address information resolution device 700 to implement the steps of the address information resolution method.

The address information resolution device 700 may also include one or more power supplies 740, one or more wired or wireless network interfaces 750, one or more input-output interfaces 760, and/or one or more operating systems 731, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, and the like. Those skilled in the art will appreciate that the address information resolution device configuration shown in fig. 7 does not constitute a limitation of the address information resolution device provided herein, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

The present invention also provides a computer-readable storage medium, which may be a non-volatile computer-readable storage medium, and which may also be a volatile computer-readable storage medium, having stored therein instructions, which, when run on a computer, cause the computer to perform the steps of the address information resolution method.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses, and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An address information analysis method, characterized in that the address information analysis method comprises:

2. The method of claim 1, wherein the training an address resolution model according to the model training data and a preset neural network comprises:

and inputting the hidden output sequence into a conditional random field layer in the neural network, predicting the label of each character in the model training data, and comparing and iterating the label with the original label of the model training data to obtain a final pre-trained address analysis model.

3. The method of claim 2, wherein inputting the model training data into an embedding layer in the neural network, and wherein converting each character in the model training data into a word vector comprises:

4. The method according to claim 3, wherein the obtaining the hidden output sequence of the model training data by using the word vector input as the input of each time step of the bidirectional long-short term memory network layer in the neural network comprises:

5. The method according to claim 4, wherein after the inputting the hidden output sequence to the conditional random field layer in the neural network predicts the label of each character in the model training data, the method further comprises:

6. The method according to claim 4, wherein after the inputting the hidden output sequence to the conditional random field layer in the neural network predicts the label of each character in the model training data, the method further comprises:

7. The address information analysis method according to any one of claims 1 to 6, wherein after the obtaining of the address text to be recognized uploaded by the user, the inputting of the address text to be recognized into the address resolution model, and the obtaining of the administrative division label of each character in the address text to be recognized, the method further comprises:

8. An address information analysis device, characterized by comprising:

9. An address information resolving apparatus characterized by comprising: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line;

the at least one processor invoking the instructions in the memory to cause the address information resolution device to perform the steps of the address information resolution method of any of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the address information resolution method according to any one of claims 1 to 7.