WO2022134592A1 - Address information resolution method, apparatus and device, and storage medium - Google Patents

Address information resolution method, apparatus and device, and storage medium Download PDF

Info

Publication number
WO2022134592A1
WO2022134592A1 PCT/CN2021/109698 CN2021109698W WO2022134592A1 WO 2022134592 A1 WO2022134592 A1 WO 2022134592A1 CN 2021109698 W CN2021109698 W CN 2021109698W WO 2022134592 A1 WO2022134592 A1 WO 2022134592A1
Authority
WO
WIPO (PCT)
Prior art keywords
address
character
administrative division
training data
model training
Prior art date
Application number
PCT/CN2021/109698
Other languages
French (fr)
Chinese (zh)
Inventor
赵焕丽
徐国强
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2022134592A1 publication Critical patent/WO2022134592A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9537Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/041Abduction

Definitions

  • the present application relates to the field of artificial intelligence, and in particular, to an address information resolution method, apparatus, device and storage medium.
  • a standard Chinese address should contain complete administrative divisions and be expressed in the order of administrative divisions (province/city/county/township/village), roads, streets, grades, buildings, and households.
  • the sub-algorithm is parseable so that it can accurately correspond to the geographic location of the address.
  • the main purpose of this application is to solve the technical problem of low accuracy in parsing non-standard Chinese addresses due to the existing address parsing algorithms relying on address canonicality, feature words and address dictionaries.
  • a first aspect of the present application provides an address information parsing method, including: using a web crawler tool to crawl original address data from a preset data source; address expression data within a preset length interval, and annotate the address expression data to obtain model training data; according to the model training data and the preset neural network, train to obtain an address parsing model; obtain the to-be-identified uploaded by the user address text, input the address text to be recognized into the address parsing model, and obtain the administrative division labeling of each character in the address text to be recognized; according to the administrative division labeling of each character in the address text to be recognized, Convert the address text to be recognized into standard address text.
  • a second aspect of the present application provides an address information parsing device, including: a data crawling module for crawling original address data from a preset data source by using a web crawler tool; a screening module for extracting data from the original address The address expression data whose character length is within the preset length range is screened out from the data, and the address expression data is marked to obtain model training data; the model training module is used for training data according to the model and the preset neural network. , training to obtain an address parsing model; the model input module is used to obtain the address text to be recognized uploaded by the user, input the address text to be recognized into the address parsing model, and obtain the address text of each character in the address text to be recognized. Administrative division labeling; a standard conversion module, configured to convert the address text to be recognized into standard address text according to the administrative division labeling of each character in the address text to be recognized.
  • a third aspect of the present application provides an address information parsing device, comprising: a memory and at least one processor, wherein instructions are stored in the memory, and the memory and the at least one processor are interconnected by a line; the at least one processor The processor invokes the instructions in the memory, so that the address information parsing device executes the steps of the address information parsing method as follows: using a web crawler tool to crawl original address data from a preset data source; Filter out the address expression data whose character length is within the preset length interval from the original address data, and mark the address expression data to obtain model training data; according to the model training data and the preset neural network, training Obtain an address parsing model; acquire the address text to be recognized uploaded by the user, input the address text to be recognized into the address parsing model, and obtain the administrative division labels of each character in the address text to be recognized; Recognize the administrative division labeling of each character in the address text, and convert the address text to be recognized into standard address text.
  • a fourth aspect of the present application provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium runs on a computer, the computer performs the steps of the address information parsing method as follows: using The web crawler tool crawls the original address data from a preset data source; screen out the address expression data whose character length is within the preset length interval from the original address data, and annotate the address expression data to obtain a model training data; according to the model training data and the preset neural network, an address parsing model is obtained by training; the address text to be identified uploaded by the user is acquired, and the address text to be identified is input into the address parsing model to obtain the address parsing model. Describe the administrative division labeling of each character in the address text to be recognized; convert the address text to be recognized into standard address text according to the administrative division labeling of each character in the address text to be recognized.
  • the original address data is crawled from a preset data source by using a web crawler tool; the address expression data whose character length is within the preset length range is screened out from the original address data, and the The address expression data is marked to obtain model training data; an address parsing model is obtained by training according to the model training data and a preset neural network; the address text to be recognized uploaded by the user is acquired, and the address text to be recognized is input
  • the address parsing model the administrative division labeling of each character in the address text to be recognized is obtained; according to the administrative division labeling of each character in the address text to be recognized, the address text to be recognized is converted into a standard address text .
  • the computer can extract the semantic features of the entire address, and consider the division results of the administrative divisions of the characters before and after, so as to realize the multi-level administrative division analysis of the non-standardized address.
  • this scheme does not rely on address canonicality, feature words and address dictionaries, so it can handle diverse non-canonical expressions.
  • the method based on the deep model can also learn the naming and segmentation rules in the existing data, and apply it to model inference, which can improve the effect of non-standard Chinese address resolution, so that such Chinese address information can be directly used by computers for location services .
  • this application also relates to blockchain technology, and the original address data can be stored in the blockchain.
  • FIG. 1 is a schematic diagram of a first embodiment of a method for resolving address information in an embodiment of the present application
  • FIG. 2 is a schematic diagram of a second embodiment of a method for resolving address information in an embodiment of the present application
  • FIG. 3 is a schematic diagram of a third embodiment of a method for resolving address information in an embodiment of the present application
  • FIG. 4 is a schematic diagram of a fourth embodiment of a method for resolving address information in an embodiment of the present application
  • FIG. 5 is a schematic diagram of an embodiment of an address information parsing apparatus in an embodiment of the present application.
  • FIG. 6 is a schematic diagram of another embodiment of an address information parsing apparatus in an embodiment of the present application.
  • FIG. 7 is a schematic diagram of an embodiment of an address information parsing device in an embodiment of the present application.
  • the present application provides an address information parsing method, which solves the technical problem of low accuracy in parsing non-standard Chinese addresses due to the existing address parsing algorithms relying on address standardization, feature words and address dictionaries.
  • a flowchart of an address information parsing method provided by an embodiment of the present application specifically includes:
  • the execution body of the present application may be an address information parsing device, or may be a terminal or a server, which is not specifically limited here.
  • the embodiments of the present application take the server as an execution subject as an example for description.
  • the above-mentioned original address data can be stored in a node of a blockchain.
  • the preset data source may be some official information websites or public address databases, and the address data in the data source is crawled as original address data, and most of these original address data are Chinese addresses , these Chinese addresses may have irregularities, which are different from the standard administrative division.
  • the administrative division characteristic word “district” is omitted in "Xuhui Kaibin Road”, and part of the administrative division “Xuhui District” is omitted in "Shanghai Kaibin Road”.
  • the administrative division information level is messy, and the word “district” in "Wumei Kindergarten” causes the non-administrative division part of the address to have the same name as the administrative division.
  • the first step of screening is performed, mainly by judging whether the characters in the original address data are UTF-8 encoded characters, and the non-UTF-8 characters among them are sorted out.
  • -8 encoded characters, such as emoticons, are removed to get standard raw address data.
  • the preset length interval is related to specific application scenarios, and is generally set to be between 7 and 20.
  • the range of the interval can be appropriately adjusted.
  • the character length is a configurable parameter. The length of the character has no effect on the subsequent model training process. Therefore, you only need to modify the configuration before different application scenarios.
  • the model requires a maximum of 128 characters.
  • the labeling is mainly performed manually, and the labeling labels are mainly administrative divisions, including “province”, “city”, “district and county”, “township”, “street”, “road”, “House number”, “Village”, “Building name”, “Other” and other 10 levels, among which, “province” includes: province, municipality, autonomous region, special administrative region; "city” includes prefecture-level city, region, autonomous prefecture, league ; "Districts and counties” include municipal districts, county-level cities, counties, flags, special zones, and forest areas; "townships” include towns, townships, ethnic townships, Sumu, ethnic Sumu, county-administered districts, and district offices; "streets" and “Township” also belongs to the township-level administrative area; “Road”: road, street, alley, other labels and standard names are the same.
  • each character in the address expression data is marked by manual annotation.
  • each character can be marked with “province, province, city, city”
  • the model training data can be Organized into the following format, "Guangdongzhou/Province Shenzhen/City Bao'an District/District Xixiang Street/Street Nanchang Second New Village/Village X Lane/Road X/House Number", each character in the model training data has corresponding labels.
  • the preset neural network is a Bi-LSTM-CRF neural network
  • the Bi-LSTM-CRF includes three layers of neural networks, which are an Embedding layer, a Bi-LSTM layer and a CRF layer, wherein the Embedding layer
  • the layer is the embedding layer.
  • the Embedding layer Through the Embedding layer, each character in the input model training data can be mapped into a vector in the low-dimensional space.
  • the word vector is a distributed representation of each character in the text.
  • the computer conveys semantics.
  • the Bi-LSTM layer is a bidirectional long-term and short-term memory network layer.
  • the bi-directional long-term and short-term memory network includes two groups of modules: forward LSTM and backward LSTM, which can obtain long-term and long-range association dependencies of context and capture context entities. feature, to obtain more spatiotemporal correlations between entities, and to eliminate the influence of noise such as interfering entities on the neural network model from two directions, greatly assisting the mining of long-term dependencies, conditional random fields (conditional random fields) It is a discriminative probability model and a type of random field. It is often used to label or analyze sequence data, such as natural language characters or biological sequences.
  • the conditional random field is an undirected graph model. The vertices in the graph represent random variables, and the lines between the vertices represent the dependencies between the random variables.
  • the distribution of the random variable Y is the conditional probability. Given The observed value is the random variable X.
  • the graphical model layout of the conditional random field can be given arbitrarily, and the commonly used layout is the linked architecture, whether in training, inference, or decoding.
  • the advantage of Bi-LSTM is that it can remember context information, which greatly facilitates the mining of long-term dependencies and is of great help to semantic understanding. However, if it is used directly for labeling tasks, there will be a problem. Bi-LSTM It belongs to the time series model, so its output is only for the current character, which belongs to the local optimal solution.
  • Conditional random fields on the other hand, have high requirements on templates.
  • Bi-LSTM can obtain context information, but it needs a solution model, and conditional random field can generate the global optimal solution, but it needs context information. Therefore, this application combines Bi-LSTM and conditional random field. model to build a complete model with complementary advantages.
  • the address resolution model can be used to parse and identify different addresses to be identified input by the user. After inputting the model, the model labels each character in it as "province, region, county, district, county, township, township, village, village, other".
  • the same marked characters are spliced to obtain the name of each type of marked administrative division. For example, if the two characters of "Chong” and “Qing" are marked as “province”, then the two are spliced together. , get Chongqing, and so on for subsequent characters. After determining "Chongqing" as “province”, match among 34 civil-level administrative regions to determine which category of province, autonomous region, municipality directly under the Central Government, and special administrative region Chongqing is. For municipalities directly under the central government, the character "city” is added after Chongqing, and the 40 administrative districts and counties under Chongqing are matched. By analogy, the address text to be recognized “No. 1 Community of Tangfang Village, Tangfang Town, Wuxi, Chongqing" can be parsed and recognized. It is the standard address text "No.1 Community of Tangfang Village, Tangfang Town, Wuxi County, Chongqing City”.
  • the original address data is crawled from a preset data source by using a web crawler tool; the address expression data whose character length is within the preset length range is screened out from the original address data, and the Annotate the address expression data to obtain model training data; train an address parsing model according to the model training data and a preset neural network; obtain the address text to be recognized uploaded by the user, and input the address text to be recognized into the In the address parsing model, the administrative division labeling of each character in the address text to be recognized is obtained; according to the administrative division labeling of each character in the address text to be recognized, the address text to be recognized is converted into standard address text.
  • the computer can extract the semantic features of the entire address, and consider the division results of the administrative divisions of the characters before and after, so as to realize the multi-level administrative division analysis of the non-standardized address.
  • this scheme does not rely on address canonicality, feature words and address dictionaries, so it can handle diverse non-canonical expressions.
  • the method based on the deep model can also learn the naming and segmentation rules in the existing data, and apply it to model inference, which can improve the effect of non-standard Chinese address resolution, so that such Chinese address information can be directly used by computers for location services .
  • this application also relates to blockchain technology, and the original address data can be stored in the blockchain.
  • the second embodiment of the address information parsing method in the embodiment of the present application includes:
  • Steps 201-202 in this embodiment are similar to steps 101-102 in the first embodiment, and are not repeated here.
  • the one-hot code is one-hot, and in the process of converting each character in the model training data into a word vector, each character in the model training data needs to be converted into a one-hot vector first. It is because the Embedding layer is a fully connected layer with one-hot as the input and the number of nodes in the middle layer as the dimension of the word vector.
  • the one-hot code vector is converted into a low-dimensional dense word vector through the vector matrix of the pre-trained vector matrix. The problem of the lexical gap and the curse of dimensionality is solved.
  • the encoding process through the Bi-LSTM layer includes: the Bi-LSTM layer automatically extracts sentence features, and uses the char embedding sequence (x1, x2, x3,...,xn) of each word of a sentence as Bi -The input of each time step of LSTM, and then the hidden state sequence ((h_1) ⁇ , (h_2) ⁇ , (h_3) ⁇ ,...,(h_n) ⁇ ) output by the forward LSTM and the output of the reverse LSTM at each position
  • the hidden states of ((h_1) ⁇ , (h_2) ⁇ , (h_3) ⁇ ,...,(h_n) ⁇ ) are spliced by position to obtain a complete hidden output sequence:
  • the output of the Bi-LSTM layer is each label of the word. value, and finally select the tag with the highest tag score as the tag of the word.
  • the administrative division labels of each character in the model training data are connected to obtain the administrative division sequence, such as the conditional random field layer prediction
  • the administrative division sequence "province province city city city building building building building building building building building building building building building building building” obtained after the administrative division of each character in the model training data "Shanghai Jing'an Kerry Center, Shanghai, Shanghai”.
  • conditional random field layer predicts that the labeling of each character in the model training data may be incorrectly predicted.
  • the administrative division sequence obtained by the administrative division labeling of characters is "province province city city province province building building building building building building building building building”, there are two identical administrative division label fragments "province province” and "province province”, Obviously, it is impossible for a segment of address to appear in segments marked with the same administrative division, and it is necessary to re-predict the characters in the segment marked with the administrative division at the later position.
  • Steps 212-213 in this embodiment are similar to steps 104-105 in the first embodiment, and are not repeated here.
  • this embodiment describes in detail the process of obtaining an address resolution model by training according to the model training data and the preset neural network.
  • convert each character in the model training data into a word vector By inputting the model training data into the embedding layer in the neural network , convert each character in the model training data into a word vector; input the word vector as the input of each time step of the bidirectional long short-term memory network layer in the neural network to obtain the implicit output sequence of the model training data; input the implicit output sequence
  • To the conditional random field layer in the neural network predict the label of each character in the model training data, compare and iterate with the original label of the model training data, and obtain the final pre-trained address parsing model.
  • the conditional random field layer that inputs the latent output sequence into the neural network is added to predict the labeling of each character in the model training data and post-processing the labeling, and obtain the model training according to the labeling of each character in the model training data.
  • the administrative division sequence of the data determine whether there are at least two administrative division annotation fragments with the same annotation type in the administrative division sequence, where the administrative division fragment is a fragment composed of consecutive and identical administrative division annotations; if so, compare the annotation types of the same
  • the position of the administrative division annotation segment in the administrative division sequence and re-predicts the administrative division annotation in the administrative division annotation segment in the later position in the administrative division annotation segment with the same annotation type.
  • the third embodiment of the address information parsing method in the embodiment of the present application includes:
  • the word vector input is used as the input of each time step of the bidirectional long short-term memory network layer in the neural network to obtain the latent output sequence of the model training data;
  • the output of the CRF layer needs to be post-processed, including splicing the characters corresponding to the adjacent administrative divisions, which may cause erroneous determinations.
  • "province province city city city province province building building building building building building building building building building building building building building building building building building building building building building building building building building there is a situation in which "province” is behind “city”, so the labeling of each character in the training data of the conditional random field layer prediction model is wrong, and it is necessary to Forecast again.
  • a word vector is a distributed representation of each character in the text, Communicate semantics to the computer through low-dimensional vectors in space.
  • the model training data After inputting the model training data to the output of the Embedding layer of the neural network in the form of word vectors, input the word vectors to the Bi-LSTM layer.
  • the Bi-LSTM neural network is suitable for sequence labeling tasks. It targets each word in the input sequence.
  • Vector performs the same operation, the operation here is matrix multiplication, which linearly maps a high-dimensional matrix (such as 300 dimensions) to a low-dimensional matrix (128 dimensions), each dimension of the matrix represents a feature, so this operation can remove useless feature.
  • Each step of operation depends on the calculation result of the previous step, and encodes the features of the context at the same time.
  • the specific implementation of this encoding is that the operation result of the previous step (here, the feature) is used as the input of the next step.
  • the feature h_ ⁇ t-1 ⁇ is extracted for the word " ⁇ "
  • f is the function used in the operation
  • h_t is the final extracted feature of the word "sea”. Therefore, when extracting the features of the current step, the features of the previous step also participate in the operation, which is the so-called "encoding the above features”. Do a similar operation for the following features.
  • the features encoded into the context are output as all features extracted for each character.
  • the features of the Bi-LSTM output do not consider the influence of the label of the previous step on the label of the current step.
  • the current word is "Wu”
  • the first two words "Chongqing” are the city name.
  • "Wuxi” should be the name of a district or county or a town with a high probability. Therefore, the CRF layer (conditional random field) is spliced in the output layer of Bi-LSTM, so that the output sequence of Bi-LSTM becomes the observation sequence of the CRF layer, and then CRF calculates the optimal solution of the entire sequence in terms of probability, taking into account the sequence Interaction between tags.
  • the output tag sequence of the CRF corresponds to each character of the input address, respectively.
  • this embodiment adds the process of wrongly judging the label of each character in the model training data by predicting the conditional random field layer.
  • obtain The administrative division sequence of the model training data according to the arrangement order of the administrative division labels in the administrative division sequence, determine whether there is an error in the administrative division sequence; if so, perform the administrative division labeling of the characters in the administrative division sequence. Reforecast.
  • errors in labeling of each character in the training data of the conditional random field layer prediction model can be corrected, thereby improving the efficiency of model training.
  • the fourth embodiment of the address information parsing method in the embodiment of the present application includes:
  • Steps 401-404 in this embodiment are similar to steps 101-104 in the first embodiment, and are not repeated here.
  • a character buffer area that is initially empty is set, and the characters in the marked address text to be recognized are stored in the character buffer area in the order of the text itself, for example, "Chongqing Wuxitang” in the above figure Fang Town Tangfang Village No. 1 Community", first put “Chong” into the character buffer area, and judge whether “Chong” and “Qing” are the same administrative division mark, because "Chong” and “Qing” are both “province” Therefore, save “Qing” into the character buffer area, and judge whether "Qing” and "Wu” are the same administrative division label. Different, so the two characters “Chong” and “Qing” in the character buffer area are taken out, and "Chongqing” is obtained by splicing.
  • this embodiment adds a process of splicing characters marked by consecutive identical administrative divisions in the address text to be recognized.
  • Process each character in the address text to be recognized store the first character of the address text to be recognized into the character buffer area, and determine the administrative division label of the first character; determine the first character Whether the administrative division label of the second character is the same as the administrative division label of the second character; if they are the same, store the second character in the character buffer area; if not, output the first character and clear the character buffer area, and process the next character; splicing the characters marked with the same administrative division output from the character buffer area.
  • the characters marked by consecutive identical administrative divisions in this method are spliced, which facilitates the subsequent conversion of the address text to be identified into the marked address text.
  • an embodiment of the address information parsing apparatus in the embodiment of the present application includes:
  • a data crawling module 501 is used to crawl original address data from a preset data source by using a web crawler tool
  • a screening module 502 configured to screen out address expression data whose character length is within a preset length interval from the original address data, and annotate the address expression data to obtain model training data;
  • a model training module 503, configured to obtain an address parsing model by training according to the model training data and a preset neural network
  • a model input module 504 configured to acquire the address text to be recognized uploaded by the user, input the address text to be recognized into the address resolution model, and obtain the administrative division labeling of each character in the address text to be recognized;
  • the standard conversion module 505 is configured to convert the to-be-recognized address text into standard address text according to the administrative division labeling of each character in the to-be-recognized address text.
  • the above-mentioned address text to be recognized can be stored in a node of a blockchain.
  • the address information parsing apparatus runs the address information parsing method, and the address information parsing method includes: using a web crawler tool to crawl original address data from a preset data source; Screening out address expression data whose character length is within a preset length interval, and marking the address expression data to obtain model training data; and training to obtain an address parsing model according to the model training data and a preset neural network; Obtain the address text to be recognized uploaded by the user, input the address text to be recognized into the address resolution model, and obtain the administrative division labeling of each character in the address text to be recognized; The administrative division labeling of the characters converts the address text to be recognized into standard address text.
  • the computer can extract the semantic features of the entire address, and consider the division results of the administrative divisions of the characters before and after, so as to realize the multi-level administrative division analysis of the non-standardized address.
  • this scheme does not rely on address canonicality, feature words and address dictionaries, so it can handle diverse non-canonical expressions.
  • the method based on the deep model can also learn the naming and segmentation rules in the existing data, and apply it to model inference, which can improve the effect of non-standard Chinese address resolution, so that such Chinese address information can be directly used by computers for location services .
  • this application also relates to blockchain technology, and the original address data can be stored in the blockchain.
  • the second embodiment of the address information parsing apparatus in the embodiment of the present application includes:
  • a data crawling module 501 is used to crawl original address data from a preset data source by using a web crawler tool
  • a screening module 502 configured to screen out address expression data whose character length is within a preset length interval from the original address data, and annotate the address expression data to obtain model training data;
  • a model training module 503, configured to obtain an address parsing model by training according to the model training data and a preset neural network
  • a model input module 504 configured to acquire the address text to be recognized uploaded by the user, input the address text to be recognized into the address resolution model, and obtain the administrative division labeling of each character in the address text to be recognized;
  • the standard conversion module 505 is configured to convert the to-be-recognized address text into standard address text according to the administrative division labeling of each character in the to-be-recognized address text.
  • model training module 503 includes:
  • Vector conversion unit 5031 for inputting the model training data into the embedding layer in the neural network, and converting each character in the model training data into a word vector;
  • the sequence unit 5032 is used to input the word vector as the input of each time step of the bidirectional long short-term memory network layer in the neural network to obtain the implicit output sequence of the model training data;
  • the label prediction unit 5033 is used to input the hidden output sequence into the conditional random field layer in the neural network, predict the label of each character in the model training data, and compare the original label of the model training data. Align and iterate to get the final pre-trained geocoding model.
  • the vector conversion unit 5031 is specifically used for:
  • the one-hot code vector of the model training data is converted into a low-dimensional dense word vector through a pre-trained vector matrix.
  • sequence unit 5032 is specifically used for:
  • the word vector input is used as the input of each time step of the bidirectional long-term and short-term memory network layer in the neural network to obtain the hidden state sequence output by the forward long-term and short-term memory network and the hidden state sequence output by the reverse long-term and short-term memory network;
  • the hidden state sequence output by the forward long-term and short-term memory network and the hidden state sequence output by the reverse long-term and short-term memory network are spliced to obtain a complete hidden output sequence.
  • the model training module further includes a first re-measurement unit 5034, and the first re-measurement unit 5034 is specifically used for:
  • the model training module further includes a second re-measurement unit 5035, and the second re-measurement unit 5035 is specifically used for:
  • the address information parsing device further includes a character connection module 506, and the character connection module 506 is specifically used for:
  • this embodiment describes in detail the specific functions of each module and the unit structure of some modules.
  • the computer can extract the semantic features of the entire address, and consider the division results of the administrative divisions of the characters before and after. , to achieve multi-level administrative division analysis of denormalized addresses.
  • the existing address resolution algorithm it does not depend on address norm, feature word and address dictionary, so it can handle diverse non-normative expressions.
  • the method based on the deep model can also learn the naming and segmentation rules in the existing data, and apply it to model inference, which can improve the effect of non-standard Chinese address resolution, so that such Chinese address information can be directly used by computers for location services .
  • FIG. 7 is a schematic structural diagram of an address information parsing device provided by an embodiment of the present application.
  • the address information parsing device 700 may vary greatly due to different configurations or performances, and may include one or more central processing units. , CPU) 710 (eg, one or more processors) and memory 720, one or more storage media 730 (eg, one or more mass storage devices) storing application programs 733 or data 732.
  • the memory 720 and the storage medium 730 may be short-term storage or persistent storage.
  • the program stored in the storage medium 730 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the address information parsing device 700 .
  • the processor 710 may be configured to communicate with the storage medium 730, and execute a series of instruction operations in the storage medium 730 on the address information parsing device 700 to implement the steps of the above address information parsing method.
  • the address information parsing device 700 may also include one or more power supplies 740, one or more wired or wireless network interfaces 750, one or more input and output interfaces 760, and/or, one or more operating systems 731, such as Windows Server , Mac OS X, Unix, Linux, FreeBSD and more.
  • operating systems 731 such as Windows Server , Mac OS X, Unix, Linux, FreeBSD and more.
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium.
  • the computer-readable storage medium may also be a volatile computer-readable storage medium. Instructions are stored in the computer-readable storage medium, and when the instructions are executed on a computer, make the computer execute the steps of the address information parsing method.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium.
  • the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Machine Translation (AREA)

Abstract

An address information resolution method, apparatus and device, and a storage medium, which are used for converting, into standard address text, address text to be identified that is uploaded by a user, and relate to the field of artificial intelligence. The method comprises: crawling original address data from a preset data source by using a web crawler tool (101); selecting, from the original address data, address representation data with a character length within a preset length interval, and labeling same to obtain model training data (102); according to the model training data and a preset neural network, performing training to obtain an address resolution model (103); acquiring address text to be identified that is uploaded by a user, and inputting the address text to be identified into the address resolution model, so as to obtain an administrative division label of each character in the address text to be identified (104); and converting the address text to be identified into standard address text according to the administrative division label of each character in the address text to be identified (105). In addition, the present invention further relates to blockchain technology, and the address text to be identified can be stored in a blockchain.

Description

地址信息解析方法、装置、设备及存储介质Address information analysis method, device, device and storage medium
本申请要求于2020年12月23日提交中国专利局、申请号为202011544487.1、发明名称为“地址信息解析方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。This application claims the priority of the Chinese patent application with the application number 202011544487.1 and the invention title "Address Information Resolution Method, Apparatus, Equipment and Storage Medium" filed with the China Patent Office on December 23, 2020, the entire contents of which are incorporated by reference in application.
技术领域technical field
本申请涉及人工智能领域,尤其涉及一种地址信息解析方法、装置、设备及存储介质。The present application relates to the field of artificial intelligence, and in particular, to an address information resolution method, apparatus, device and storage medium.
背景技术Background technique
基于位置信息的服务在人们的生活中应用越来越广泛,根据文本地址表达快速准确地查找其地理坐标的需求日益增长。一个规范的中文地址应包含完整的行政区划,并按照行政区划(省/市/县/乡/村)、路街、牌号、建筑、户室的次序来表达,特征字明显,利用中文地址切分算法可解析,从而可以准确地与该地址的地理位置对应。Services based on location information are more and more widely used in people's lives, and there is an increasing demand for finding their geographic coordinates quickly and accurately based on textual address expressions. A standard Chinese address should contain complete administrative divisions and be expressed in the order of administrative divisions (province/city/county/township/village), roads, streets, grades, buildings, and households. The sub-algorithm is parseable so that it can accurately correspond to the geographic location of the address.
然而,发明人意识到,中文地址的非规范化表述造成位置语义信息模糊或歧义性,妨碍了计算机直接理解此地址信息所描述的地理位置,使得这样的中文地址信息不能够被计算机直接用于位置服务。现有的地址解析算法(中文地址要素切分方法、词库匹配法、特征字切分法等)依赖于地址规范性、特征字以及地址词典,无法很好地解决非规范的中文地址问题,使得这样的中文地址信息不能够被计算机直接用于位置服务。However, the inventor realizes that the non-standardized representation of Chinese addresses causes the ambiguity or ambiguity of the location semantic information, which prevents the computer from directly understanding the geographic location described by the address information, so that such Chinese address information cannot be directly used by the computer for location information. Serve. Existing address resolution algorithms (Chinese address element segmentation method, thesaurus matching method, feature word segmentation method, etc.) rely on address specification, feature words and address dictionaries, and cannot well solve the problem of non-standard Chinese addresses. Therefore, such Chinese address information cannot be directly used by computers for location services.
发明内容SUMMARY OF THE INVENTION
本申请的主要目的在于解决现有的地址解析算法依赖于地址规范性、特征字以及地址词典导致解析非规范的中文地址准确率低的技术问题。The main purpose of this application is to solve the technical problem of low accuracy in parsing non-standard Chinese addresses due to the existing address parsing algorithms relying on address canonicality, feature words and address dictionaries.
为实现上述目的,本申请第一方面提供了一种地址信息解析方法,包括:利用网页爬虫工具从预设的数据源中爬取原始地址数据;从所述原始地址数据中筛选出字符长度在预设长度区间内的地址表述数据,并对所述地址表述数据进行标注,得到模型训练数据;根据所述模型训练数据和预设的神经网络,训练得到地址解析模型;获取用户上传的待识别地址文本,并将所述待识别地址文本输入至所述地址解析模型中,获得所述待识别地址文本中各字符的行政区划标注;根据所述待识别地址文本中各字符的行政区划标注,将所述待识别地址文本转化为标准地址文本。In order to achieve the above purpose, a first aspect of the present application provides an address information parsing method, including: using a web crawler tool to crawl original address data from a preset data source; address expression data within a preset length interval, and annotate the address expression data to obtain model training data; according to the model training data and the preset neural network, train to obtain an address parsing model; obtain the to-be-identified uploaded by the user address text, input the address text to be recognized into the address parsing model, and obtain the administrative division labeling of each character in the address text to be recognized; according to the administrative division labeling of each character in the address text to be recognized, Convert the address text to be recognized into standard address text.
本申请第二方面提供了一种地址信息解析装置,包括:数据爬取模块,用于利用网页爬虫工具从预设的数据源中爬取原始地址数据;筛选模块,用于从所述原始地址数据中筛选出字符长度在预设长度区间内的地址表述数据,并对所述地址表述数据进行标注,得到模型训练数据;模型训练模块,用于根据所述模型训练数据和预设的神经网络,训练得到地址解析模型;模型输入模块,用于获取用户上传的待识别地址文本,并将所述待识别地址文本输入至所述地址解析模型中,获得所述待识别地址文本中各字符的行政区划标注;标准转化模块,用于根据所述待识别地址文本中各字符的行政区划标注,将所述待识别地址文本转化为标准地址文本。A second aspect of the present application provides an address information parsing device, including: a data crawling module for crawling original address data from a preset data source by using a web crawler tool; a screening module for extracting data from the original address The address expression data whose character length is within the preset length range is screened out from the data, and the address expression data is marked to obtain model training data; the model training module is used for training data according to the model and the preset neural network. , training to obtain an address parsing model; the model input module is used to obtain the address text to be recognized uploaded by the user, input the address text to be recognized into the address parsing model, and obtain the address text of each character in the address text to be recognized. Administrative division labeling; a standard conversion module, configured to convert the address text to be recognized into standard address text according to the administrative division labeling of each character in the address text to be recognized.
本申请第三方面提供了一种地址信息解析设备,包括:存储器和至少一个处理器,所述存储器中存储有指令,所述存储器和所述至少一个处理器通过线路互连;所述至少一个处理器调用所述存储器中的所述指令,以使得所述地址信息解析设备执行如下所述的地址信息解析方法的步骤:利用网页爬虫工具从预设的数据源中爬取原始地址数据;从所述原始地址数据中筛选出字符长度在预设长度区间内的地址表述数据,并对所述地址表述数据进行标注,得到模型训练数据;根据所述模型训练数据和预设的神经网络,训练得到地址解析模型;获取用户上传的待识别地址文本,并将所述待识别地址文本输入至所述地址解 析模型中,获得所述待识别地址文本中各字符的行政区划标注;根据所述待识别地址文本中各字符的行政区划标注,将所述待识别地址文本转化为标准地址文本。A third aspect of the present application provides an address information parsing device, comprising: a memory and at least one processor, wherein instructions are stored in the memory, and the memory and the at least one processor are interconnected by a line; the at least one processor The processor invokes the instructions in the memory, so that the address information parsing device executes the steps of the address information parsing method as follows: using a web crawler tool to crawl original address data from a preset data source; Filter out the address expression data whose character length is within the preset length interval from the original address data, and mark the address expression data to obtain model training data; according to the model training data and the preset neural network, training Obtain an address parsing model; acquire the address text to be recognized uploaded by the user, input the address text to be recognized into the address parsing model, and obtain the administrative division labels of each character in the address text to be recognized; Recognize the administrative division labeling of each character in the address text, and convert the address text to be recognized into standard address text.
本申请第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行如下所述的地址信息解析方法的步骤:利用网页爬虫工具从预设的数据源中爬取原始地址数据;从所述原始地址数据中筛选出字符长度在预设长度区间内的地址表述数据,并对所述地址表述数据进行标注,得到模型训练数据;根据所述模型训练数据和预设的神经网络,训练得到地址解析模型;获取用户上传的待识别地址文本,并将所述待识别地址文本输入至所述地址解析模型中,获得所述待识别地址文本中各字符的行政区划标注;根据所述待识别地址文本中各字符的行政区划标注,将所述待识别地址文本转化为标准地址文本。A fourth aspect of the present application provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium runs on a computer, the computer performs the steps of the address information parsing method as follows: using The web crawler tool crawls the original address data from a preset data source; screen out the address expression data whose character length is within the preset length interval from the original address data, and annotate the address expression data to obtain a model training data; according to the model training data and the preset neural network, an address parsing model is obtained by training; the address text to be identified uploaded by the user is acquired, and the address text to be identified is input into the address parsing model to obtain the address parsing model. Describe the administrative division labeling of each character in the address text to be recognized; convert the address text to be recognized into standard address text according to the administrative division labeling of each character in the address text to be recognized.
本申请提供的技术方案中,通过利用网页爬虫工具从预设的数据源中爬取原始地址数据;从所述原始地址数据中筛选出字符长度在预设长度区间内的地址表述数据,并对所述地址表述数据进行标注,得到模型训练数据;根据所述模型训练数据和预设的神经网络,训练得到地址解析模型;获取用户上传的待识别地址文本,并将所述待识别地址文本输入至所述地址解析模型中,获得所述待识别地址文本中各字符的行政区划标注;根据所述待识别地址文本中各字符的行政区划标注,将所述待识别地址文本转化为标准地址文本。通过本方法,可使计算机抽取整个地址的语义特征,并考虑前后字符行政区划的划分结果,实现非规范化地址的多级行政区划解析。相比现有的地址解析算法,此方案不依赖于地址规范性、特征字以及地址词典,因此可处理多样化的非规范表达。基于深度模型的方法还可学习到已有数据中的命名与切分规律,并应用于模型推断,可提升非规范的中文地址解析效果,使得这样的中文地址信息能够被计算机直接用于位置服务。此外,本申请还涉及区块链技术,原始地址数据可存储于区块链中。In the technical solution provided by this application, the original address data is crawled from a preset data source by using a web crawler tool; the address expression data whose character length is within the preset length range is screened out from the original address data, and the The address expression data is marked to obtain model training data; an address parsing model is obtained by training according to the model training data and a preset neural network; the address text to be recognized uploaded by the user is acquired, and the address text to be recognized is input In the address parsing model, the administrative division labeling of each character in the address text to be recognized is obtained; according to the administrative division labeling of each character in the address text to be recognized, the address text to be recognized is converted into a standard address text . Through the method, the computer can extract the semantic features of the entire address, and consider the division results of the administrative divisions of the characters before and after, so as to realize the multi-level administrative division analysis of the non-standardized address. Compared with the existing address resolution algorithms, this scheme does not rely on address canonicality, feature words and address dictionaries, so it can handle diverse non-canonical expressions. The method based on the deep model can also learn the naming and segmentation rules in the existing data, and apply it to model inference, which can improve the effect of non-standard Chinese address resolution, so that such Chinese address information can be directly used by computers for location services . In addition, this application also relates to blockchain technology, and the original address data can be stored in the blockchain.
附图说明Description of drawings
图1为本申请实施例中地址信息解析方法的第一个实施例示意图;1 is a schematic diagram of a first embodiment of a method for resolving address information in an embodiment of the present application;
图2为本申请实施例中地址信息解析方法的第二个实施例示意图;2 is a schematic diagram of a second embodiment of a method for resolving address information in an embodiment of the present application;
图3为本申请实施例中地址信息解析方法的第三个实施例示意图;3 is a schematic diagram of a third embodiment of a method for resolving address information in an embodiment of the present application;
图4为本申请实施例中地址信息解析方法的第四个实施例示意图;4 is a schematic diagram of a fourth embodiment of a method for resolving address information in an embodiment of the present application;
图5为本申请实施例中地址信息解析装置的一个实施例示意图;FIG. 5 is a schematic diagram of an embodiment of an address information parsing apparatus in an embodiment of the present application;
图6为本申请实施例中地址信息解析装置的另一个实施例示意图;FIG. 6 is a schematic diagram of another embodiment of an address information parsing apparatus in an embodiment of the present application;
图7为本申请实施例中地址信息解析设备的一个实施例示意图。FIG. 7 is a schematic diagram of an embodiment of an address information parsing device in an embodiment of the present application.
具体实施方式Detailed ways
本申请提供了一种地址信息解析方法,解决了现有的地址解析算法依赖于地址规范性、特征字以及地址词典导致解析非规范的中文地址准确率低的技术问题。The present application provides an address information parsing method, which solves the technical problem of low accuracy in parsing non-standard Chinese addresses due to the existing address parsing algorithms relying on address standardization, feature words and address dictionaries.
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例进行描述。In order to make those skilled in the art better understand the solutions of the present application, the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”或“具有”及其任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third", "fourth", etc. (if any) in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is to be understood that data so used may be interchanged under appropriate circumstances so that the embodiments described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" or "having" and any variations thereof are intended to cover non-exclusive inclusion, for example, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those expressly listed steps or units, but may include other steps or units not expressly listed or inherent to these processes, methods, products or devices.
请参阅图1,本申请实施例提供的地址信息解析方法的流程图,具体包括:Referring to FIG. 1, a flowchart of an address information parsing method provided by an embodiment of the present application specifically includes:
可以理解的是,本申请的执行主体可以为地址信息解析装置,还可以是终端或者服务器,具体此处不做限定。本申请实施例以服务器为执行主体为例进行说明。It can be understood that the execution body of the present application may be an address information parsing device, or may be a terminal or a server, which is not specifically limited here. The embodiments of the present application take the server as an execution subject as an example for description.
需要强调的是,为保证数据的私密和安全性,上述原始地址数据可以存储于一区块链的节点中。It should be emphasized that, in order to ensure the privacy and security of the data, the above-mentioned original address data can be stored in a node of a blockchain.
在本实施例中,所述预设的数据源可以是一些官方资讯网站或者已公开地址库,对数据源中的地址数据进行爬取,作为原始地址数据,这些原始地址数据中多为中文地址,这些中文地址可能存在不规范,与标准行政区域划分不相同,例如“徐汇凯宾路”中省略行政区划特征词“区”,“上海市凯滨路”中间省略部分行政划分“徐汇区”,行政区划信息层次杂乱、“区美幼儿园”的“区”字导致地址的非行政区划部分与行政区划同名等。In this embodiment, the preset data source may be some official information websites or public address databases, and the address data in the data source is crawled as original address data, and most of these original address data are Chinese addresses , these Chinese addresses may have irregularities, which are different from the standard administrative division. For example, the administrative division characteristic word "district" is omitted in "Xuhui Kaibin Road", and part of the administrative division "Xuhui District" is omitted in "Shanghai Kaibin Road". , the administrative division information level is messy, and the word "district" in "Wumei Kindergarten" causes the non-administrative division part of the address to have the same name as the administrative division.
在本实施例中,通过数据源爬取出上百万条原始地址数据后,进行第一步筛除,主要通过判断原始地址数据中的字符是否为UTF-8编码的字符,将其中的非UTF-8编码的字符,例如表情符进行删除,得到标准的原始地址数据。In this embodiment, after millions of pieces of original address data are crawled out through the data source, the first step of screening is performed, mainly by judging whether the characters in the original address data are UTF-8 encoded characters, and the non-UTF-8 characters among them are sorted out. -8 encoded characters, such as emoticons, are removed to get standard raw address data.
102、从原始地址数据中筛选出字符长度在预设长度区间内的地址表述数据,并对地址表述数据进行标注,得到模型训练数据;102. Filter out address expression data whose character length is within a preset length interval from the original address data, and mark the address expression data to obtain model training data;
在本实施例中,所述预设长度区间根据具体的应用场景有关,一般设置为7-20之间,对于地址要求较为详尽完整的应用场景,可以适当调整区间的范围,在技术层面上,字符长度是一个可以配置的参数,字符的长度对后续的模型训练流程无影响,因此只需要在不同的应用场景前,进行修改配置即可,一般模型要求最长不超过128个字符。In this embodiment, the preset length interval is related to specific application scenarios, and is generally set to be between 7 and 20. For application scenarios with more detailed and complete address requirements, the range of the interval can be appropriately adjusted. The character length is a configurable parameter. The length of the character has no effect on the subsequent model training process. Therefore, you only need to modify the configuration before different application scenarios. Generally, the model requires a maximum of 128 characters.
在本实施例中,所述标注主要是通过人工进行标注,标注的标签主要为行政区划,包括“省”、“市”、“区县”、“乡镇”、“街道”、“道路”、“门牌号”、“村”、“建筑名”、“其他”等10级,其中,“省”包括:省、直辖市、自治区、特别行政区;“市”包括地级市、地区、自治州、盟;“区县”包括市辖区、县级市、县、旗、特区、林区;“乡镇”包括镇、乡、民族乡、苏木、民族苏木、县辖区,区公所;“街道”与“乡镇”相同属于乡级行政区;“道路”:道路、街、巷,其他的标注标签和标准名相同。In this embodiment, the labeling is mainly performed manually, and the labeling labels are mainly administrative divisions, including "province", "city", "district and county", "township", "street", "road", "House number", "Village", "Building name", "Other" and other 10 levels, among which, "province" includes: province, municipality, autonomous region, special administrative region; "city" includes prefecture-level city, region, autonomous prefecture, league ; "Districts and counties" include municipal districts, county-level cities, counties, flags, special zones, and forest areas; "townships" include towns, townships, ethnic townships, Sumu, ethnic Sumu, county-administered districts, and district offices; "streets" and "Township" also belongs to the township-level administrative area; "Road": road, street, alley, other labels and standard names are the same.
在本实施例中,人工标注对地址表述数据中的每个字符均进行标注,例如,对于“广东省深圳市”可对每个字符分别标注“省 省 省 市 市 市”,模型训练数据可以整理成以下格式,“广东省/省 深圳市/市 宝安区/区 西乡街道/街道 南昌第二新村/村 X巷/道路 X号/门牌号”,模型训练数据中的每个字符均有相对应的标注。In this embodiment, each character in the address expression data is marked by manual annotation. For example, for "Shenzhen, Guangdong Province", each character can be marked with "province, province, city, city", and the model training data can be Organized into the following format, "Guangdong Province/Province Shenzhen/City Bao'an District/District Xixiang Street/Street Nanchang Second New Village/Village X Lane/Road X/House Number", each character in the model training data has corresponding labels.
103、根据模型训练数据和预设的神经网络,训练得到地址解析模型;103. According to the model training data and the preset neural network, train to obtain an address resolution model;
在本实施例中,所述预设的神经网络为Bi-LSTM-CRF神经网络,所述Bi-LSTM-CRF包括三层神经网络,分别为Embedding层、Bi-LSTM层和CRF层,其中Embedding层为嵌入层,通过Embedding层能够将输入的模型训练数据中的每个字符映射成低维空间上的向量,字向量是对文本中各个字符的分布式表示,通过空间中的低维向量向计算机传达语义,Bi-LSTM层为双向长短期记忆网络层,双向长短期记忆网络包含前向LSTM与后向LSTM两组模块,可获取上下文长时间长范围的相关联依赖关系,捕获前后文实体特征,获取更多实体之间的时空相关性,并能从两个方向上排除干扰实体等噪声对神经网络模型的影响,极大助力对长期依赖关系的挖掘,条件随机场(conditional random fields)是一种判别式概率模型,是随机场的一种,常用于标注或分析序列资料,如自然语言文字或是生物序列。条件随机场为具有无向的图模型,图中的顶点代表随机变量,顶点间的连线代表随机变量间的相依关系,在条件随机场中,随机变量Y的分布为条件机率,给定的观察值则为随机变量X。原则上,条件随机场的图模型布局是可以任意给定的,一般常用的布局是链结式的架构,链结式架构不论在训练(training)、推论(inference)、或是解码(decoding)上,都存在效率较高的算法可供演算。Bi-LSTM的优势是能够记住上下文信息,极大助力对长 期依赖关系的挖掘,对语义理解有很大的帮助,但如果直接用它来进行标注任务,就会有一个问题,Bi-LSTM属于时序模型,所以它的输出只针对当前字符,属于局部最优解。而条件随机场则对模板的要求很高,覆盖全面的模板才能够让模型学到很多上下文的信息,但往往会有模板覆盖不全的情况出现。Bi-LSTM可以获取上下文的信息,但需要的是一个求解的模型,而条件随机场可以生成全局最优解,但它需要上下文的信息,因此,本申请结合Bi-LSTM和条件随机场这两个模型,来构建一个优势互补的完整模型。In this embodiment, the preset neural network is a Bi-LSTM-CRF neural network, and the Bi-LSTM-CRF includes three layers of neural networks, which are an Embedding layer, a Bi-LSTM layer and a CRF layer, wherein the Embedding layer The layer is the embedding layer. Through the Embedding layer, each character in the input model training data can be mapped into a vector in the low-dimensional space. The word vector is a distributed representation of each character in the text. The computer conveys semantics. The Bi-LSTM layer is a bidirectional long-term and short-term memory network layer. The bi-directional long-term and short-term memory network includes two groups of modules: forward LSTM and backward LSTM, which can obtain long-term and long-range association dependencies of context and capture context entities. feature, to obtain more spatiotemporal correlations between entities, and to eliminate the influence of noise such as interfering entities on the neural network model from two directions, greatly assisting the mining of long-term dependencies, conditional random fields (conditional random fields) It is a discriminative probability model and a type of random field. It is often used to label or analyze sequence data, such as natural language characters or biological sequences. The conditional random field is an undirected graph model. The vertices in the graph represent random variables, and the lines between the vertices represent the dependencies between the random variables. In the conditional random field, the distribution of the random variable Y is the conditional probability. Given The observed value is the random variable X. In principle, the graphical model layout of the conditional random field can be given arbitrarily, and the commonly used layout is the linked architecture, whether in training, inference, or decoding. On the above, there are more efficient algorithms for calculus. The advantage of Bi-LSTM is that it can remember context information, which greatly facilitates the mining of long-term dependencies and is of great help to semantic understanding. However, if it is used directly for labeling tasks, there will be a problem. Bi-LSTM It belongs to the time series model, so its output is only for the current character, which belongs to the local optimal solution. Conditional random fields, on the other hand, have high requirements on templates. Only a comprehensive template can allow the model to learn a lot of contextual information, but there are often cases where the template is not fully covered. Bi-LSTM can obtain context information, but it needs a solution model, and conditional random field can generate the global optimal solution, but it needs context information. Therefore, this application combines Bi-LSTM and conditional random field. model to build a complete model with complementary advantages.
104、获取用户上传的待识别地址文本,并将待识别地址文本输入至地址解析模型中,获得待识别地址文本中各字符的行政区划标注;104. Acquire the address text to be recognized uploaded by the user, and input the address text to be recognized into the address parsing model, and obtain the administrative division labeling of each character in the address text to be recognized;
在本实施例中,在获得地址解析模型后,即可使用所述地址解析模型对用户输入的不同的待识别地址进行解析识别,例如用户输入“重庆巫溪塘坊镇塘坊村一社”,输入模型后,模型对其中的每个字符进行标注,分别为“省 省 区县 区县 乡镇 乡镇 乡镇 村 村 村 其他”。In this embodiment, after the address resolution model is obtained, the address resolution model can be used to parse and identify different addresses to be identified input by the user. After inputting the model, the model labels each character in it as "province, region, county, district, county, township, township, village, village, other".
105、根据待识别地址文本中各字符的行政区划标注,将待识别地址文本转化为标准地址文本;105. Convert the to-be-recognized address text into standard address text according to the administrative division labeling of each character in the to-be-recognized address text;
在本实施例中,在相同标注的字符进行拼接,得到每类标注的行政区划的名称,例如将“重”“庆”两个字符的标注都是“省 省”,则将两者进行拼接,得到重庆,后续字符以此类推,在确定“重庆”为“省”后,在34个省级行政区中进行匹配,确定重庆是省、自治区、直辖市、特别行政区中的哪一类,重庆为直辖市,则在重庆后添加字符“市”,并从重庆市下的40个行政区县进行匹配,以此类推,即可将待识别地址文本“重庆巫溪塘坊镇塘坊村一社”解析识别为标准地址文本“重庆市巫溪县塘坊镇塘坊村一社”。In this embodiment, the same marked characters are spliced to obtain the name of each type of marked administrative division. For example, if the two characters of "Chong" and "Qing" are marked as "province", then the two are spliced together. , get Chongqing, and so on for subsequent characters. After determining "Chongqing" as "province", match among 34 provincial-level administrative regions to determine which category of province, autonomous region, municipality directly under the Central Government, and special administrative region Chongqing is. For municipalities directly under the central government, the character "city" is added after Chongqing, and the 40 administrative districts and counties under Chongqing are matched. By analogy, the address text to be recognized "No. 1 Community of Tangfang Village, Tangfang Town, Wuxi, Chongqing" can be parsed and recognized. It is the standard address text "No.1 Community of Tangfang Village, Tangfang Town, Wuxi County, Chongqing City".
在本实施例中,通过利用网页爬虫工具从预设的数据源中爬取原始地址数据;从所述原始地址数据中筛选出字符长度在预设长度区间内的地址表述数据,并对所述地址表述数据进行标注,得到模型训练数据;根据所述模型训练数据和预设的神经网络,训练得到地址解析模型;获取用户上传的待识别地址文本,并将所述待识别地址文本输入至所述地址解析模型中,获得所述待识别地址文本中各字符的行政区划标注;根据所述待识别地址文本中各字符的行政区划标注,将所述待识别地址文本转化为标准地址文本。通过本方法,可使计算机抽取整个地址的语义特征,并考虑前后字符行政区划的划分结果,实现非规范化地址的多级行政区划解析。相比现有的地址解析算法,此方案不依赖于地址规范性、特征字以及地址词典,因此可处理多样化的非规范表达。基于深度模型的方法还可学习到已有数据中的命名与切分规律,并应用于模型推断,可提升非规范的中文地址解析效果,使得这样的中文地址信息能够被计算机直接用于位置服务。此外,本申请还涉及区块链技术,原始地址数据可存储于区块链中。In this embodiment, the original address data is crawled from a preset data source by using a web crawler tool; the address expression data whose character length is within the preset length range is screened out from the original address data, and the Annotate the address expression data to obtain model training data; train an address parsing model according to the model training data and a preset neural network; obtain the address text to be recognized uploaded by the user, and input the address text to be recognized into the In the address parsing model, the administrative division labeling of each character in the address text to be recognized is obtained; according to the administrative division labeling of each character in the address text to be recognized, the address text to be recognized is converted into standard address text. Through the method, the computer can extract the semantic features of the entire address, and consider the division results of the administrative divisions of the characters before and after, so as to realize the multi-level administrative division analysis of the non-standardized address. Compared with the existing address resolution algorithms, this scheme does not rely on address canonicality, feature words and address dictionaries, so it can handle diverse non-canonical expressions. The method based on the deep model can also learn the naming and segmentation rules in the existing data, and apply it to model inference, which can improve the effect of non-standard Chinese address resolution, so that such Chinese address information can be directly used by computers for location services . In addition, this application also relates to blockchain technology, and the original address data can be stored in the blockchain.
请参阅图2,本申请实施例中地址信息解析方法的第二个实施例包括:Referring to FIG. 2, the second embodiment of the address information parsing method in the embodiment of the present application includes:
201、利用网页爬虫工具从预设的数据源中爬取原始地址数据;201. Use a web crawler tool to crawl original address data from a preset data source;
202、从原始地址数据中筛选出字符长度在预设长度区间内的地址表述数据,并对地址表述数据进行标注,得到模型训练数据;202. Filter out address expression data whose character length is within a preset length interval from the original address data, and annotate the address expression data to obtain model training data;
本实施例中的步骤201-202与第一实施例中的步骤101-102相似,此处不再赘述。Steps 201-202 in this embodiment are similar to steps 101-102 in the first embodiment, and are not repeated here.
203、将模型训练数据中的每个字符转化独热码向量;203. Convert each character in the model training data into a one-hot code vector;
204、将模型训练数据的独热码向量通过预训练好的向量矩阵转化为低维稠密的字向量;204. Convert the one-hot code vector of the model training data into a low-dimensional dense word vector through the pre-trained vector matrix;
在本实施例中,所述独热码为one-hot,在将模型训练数据中的每个字符转化字向量的过程中需要先将模型训练数据中的每个字符转化one-hot向量,这是因为Embedding层是以one-hot为输入、中间层节点数为词向量维数的全连接层,独热码向量通过预训练好的向量矩阵的向量矩阵转化为低维稠密的字向量,解决了词汇鸿沟和维度灾难的问题。In this embodiment, the one-hot code is one-hot, and in the process of converting each character in the model training data into a word vector, each character in the model training data needs to be converted into a one-hot vector first. It is because the Embedding layer is a fully connected layer with one-hot as the input and the number of nodes in the middle layer as the dimension of the word vector. The one-hot code vector is converted into a low-dimensional dense word vector through the vector matrix of the pre-trained vector matrix. The problem of the lexical gap and the curse of dimensionality is solved.
205、将字向量输入作为神经网络中的双向长短期记忆网络层各个时间步的输入得到正向长短期记忆网络输出的隐状态序列和反向长短期记忆网络输出的隐状态序列;205. Using the word vector input as the input of each time step of the bidirectional long-term and short-term memory network layer in the neural network to obtain the hidden state sequence output by the forward long-term and short-term memory network and the hidden state sequence output by the reverse long-term and short-term memory network;
206、将正向长短期记忆网络输出的隐状态序列和反向长短期记忆网络输出的隐状态序列进行拼接,得到完整的隐输出序列;206. Splicing the hidden state sequence output by the forward long-term and short-term memory network and the hidden state sequence output by the reverse long-term and short-term memory network to obtain a complete hidden output sequence;
在本实施例中,所述经过Bi-LSTM层进行编码处理包括:Bi-LSTM层自动提取句子特征,将一个句子的各个字的char embedding序列(x1,x2,x3,…,xn)作为Bi-LSTM各个时间步的输入,再将正向LSTM输出的隐状态序列((h_1) ,(h_2) ,(h_3) ,…,(h_n) )与反向LSTM的在各个位置输出的隐状态((h_1) ,(h_2) ,(h_3) ,…,(h_n) )进行按位置拼接得到完整的隐输出序列:Bi-LSTM层的输出为字的每一个标签分值,最后通过挑选标签分值最高的作为该字的标签。 In this embodiment, the encoding process through the Bi-LSTM layer includes: the Bi-LSTM layer automatically extracts sentence features, and uses the char embedding sequence (x1, x2, x3,...,xn) of each word of a sentence as Bi -The input of each time step of LSTM, and then the hidden state sequence ((h_1) , (h_2) , (h_3) ,...,(h_n) ) output by the forward LSTM and the output of the reverse LSTM at each position The hidden states of ((h_1) , (h_2) , (h_3) ,…,(h_n) ) are spliced by position to obtain a complete hidden output sequence: The output of the Bi-LSTM layer is each label of the word. value, and finally select the tag with the highest tag score as the tag of the word.
207、将隐输出序列输入至神经网络中的条件随机场层,预测模型训练数据中各字符的标注;207. Input the hidden output sequence to the conditional random field layer in the neural network, and predict the labeling of each character in the training data of the model;
208、根据模型训练数据中各字符的标注,获得模型训练数据的行政区划序列;208. Obtain the administrative division sequence of the model training data according to the labeling of each character in the model training data;
在本实施例中,条件随机场层预测所述模型训练数据中各字符的标注后,将模型训练数据中每个字符的行政区划标注连接,即可得到行政区划序列,例如条件随机场层预测模型训练数据“上海省上海市上海静安嘉里中心”中每个字符的行政区划标注后得到的行政区划序列“省 省 省 市 市 市 建筑 建筑 建筑 建筑 建筑 建筑 建筑 建筑”。In this embodiment, after the conditional random field layer predicts the labels of each character in the model training data, the administrative division labels of each character in the model training data are connected to obtain the administrative division sequence, such as the conditional random field layer prediction The administrative division sequence "province province city city city building building building building building building building building" obtained after the administrative division of each character in the model training data "Shanghai Jing'an Kerry Center, Shanghai, Shanghai".
209、判断行政区划序列中,是否出现标注类型相同的至少两段行政区划标注片段,其中,行政区划片段为连续相同的行政区划标注构成的片段;209. Determine whether there are at least two administrative division annotation fragments with the same annotation type in the administrative division sequence, wherein the administrative division fragments are fragments formed by consecutive and identical administrative division annotations;
210、若是,则对比较标注类型相同的行政区划标注片段在行政区划序列中的位置,并对标注类型相同的行政区划标注片段中位置靠后的行政区划标注片段中的行政区划标注进行重新预测;210. If yes, compare the positions of the administrative division annotation fragments with the same annotation type in the administrative division sequence, and re-predict the administrative division annotations in the administrative division annotation fragments that are located later in the administrative division annotation fragments with the same annotation type. ;
在本实施例中,所述条件随机场层预测所述模型训练数据中各字符的标注可能会出现预测错误的情况,例如将预测模型训练数据“上海省上海市上海静安嘉里中心”中每个字符的行政区划标注得到的行政区划序列为“省 省 省 市 市 市 省 省 建筑 建筑 建筑 建筑 建筑 建筑”,出现了两段相同的行政区划标注片段“省 省 省”和“省 省”,显然,一段地址中不可能出现间隔为相同行政区划标注的片段,需要将位置靠后的行政区划标注片段中的字符进行重新预测。In this embodiment, the conditional random field layer predicts that the labeling of each character in the model training data may be incorrectly predicted. The administrative division sequence obtained by the administrative division labeling of characters is "province province city city province province building building building building building building", there are two identical administrative division label fragments "province province" and "province province", Obviously, it is impossible for a segment of address to appear in segments marked with the same administrative division, and it is necessary to re-predict the characters in the segment marked with the administrative division at the later position.
211、将条件随机场层预测模型训练数据中各字符的标注与模型训练数据原有的标注进行比对和迭代,得到最终预训练的地址解析模型;211. Compare and iterate the labeling of each character in the training data of the conditional random field layer prediction model with the original labeling of the model training data, to obtain a final pre-trained address resolution model;
212、获取用户上传的待识别地址文本,并将待识别地址文本输入至地址解析模型中,获得待识别地址文本中各字符的行政区划标注;212. Obtain the address text to be recognized uploaded by the user, input the address text to be recognized into the address resolution model, and obtain the administrative division labeling of each character in the address text to be recognized;
213、根据待识别地址文本中各字符的行政区划标注,将待识别地址文本转化为标准地址文本。213. Convert the address text to be recognized into standard address text according to the administrative division labels of each character in the address text to be recognized.
本实施例中的步骤212-213与第一实施例中的步骤104-105相似,此处不再赘述。Steps 212-213 in this embodiment are similar to steps 104-105 in the first embodiment, and are not repeated here.
本实施例在上一实施例的基础上,详细描述了根据所述模型训练数据和预设的神经网络,训练得到地址解析模型的过程,通过将模型训练数据输入至神经网络中的嵌入层中,将模型训练数据中的每个字符转化为字向量;将字向量输入作为神经网络中的双向长短期记忆网络层各个时间步的输入,得到模型训练数据的隐输出序列;将隐输出序列输入至神经网络中的条件随机场层,预测模型训练数据中各字符的标注,并与模型训练数据原有的标注进行比对和迭代,得到最终预训练的地址解析模型。同时增加了对隐输出序列输入至神经网络中的条件随机场层,预测模型训练数据中各字符的标注后对标注进行后处理的过程,通过根据模型训练数据中各字符的标注,获得模型训练数据的行政区划序列;判断行政区划序列中,是否出现标注类型相同的至少两段行政区划标注片段,其中,行政区划片 段为连续相同的行政区划标注构成的片段;若是,则对比较标注类型相同的行政区划标注片段在行政区划序列中的位置,并对标注类型相同的行政区划标注片段中位置靠后的行政区划标注片段中的行政区划标注进行重新预测。On the basis of the previous embodiment, this embodiment describes in detail the process of obtaining an address resolution model by training according to the model training data and the preset neural network. By inputting the model training data into the embedding layer in the neural network , convert each character in the model training data into a word vector; input the word vector as the input of each time step of the bidirectional long short-term memory network layer in the neural network to obtain the implicit output sequence of the model training data; input the implicit output sequence To the conditional random field layer in the neural network, predict the label of each character in the model training data, compare and iterate with the original label of the model training data, and obtain the final pre-trained address parsing model. At the same time, the conditional random field layer that inputs the latent output sequence into the neural network is added to predict the labeling of each character in the model training data and post-processing the labeling, and obtain the model training according to the labeling of each character in the model training data. The administrative division sequence of the data; determine whether there are at least two administrative division annotation fragments with the same annotation type in the administrative division sequence, where the administrative division fragment is a fragment composed of consecutive and identical administrative division annotations; if so, compare the annotation types of the same The position of the administrative division annotation segment in the administrative division sequence, and re-predicts the administrative division annotation in the administrative division annotation segment in the later position in the administrative division annotation segment with the same annotation type.
请参阅图3,本申请实施例中地址信息解析方法的第三个实施例包括:Referring to FIG. 3 , the third embodiment of the address information parsing method in the embodiment of the present application includes:
301、利用网页爬虫工具从预设的数据源中爬取原始地址数据;301. Use a web crawler tool to crawl original address data from a preset data source;
302、从原始地址数据中筛选出字符长度在预设长度区间内的地址表述数据,并对地址表述数据进行标注,得到模型训练数据;302. Filter out address expression data whose character length is within a preset length interval from the original address data, and annotate the address expression data to obtain model training data;
303、将模型训练数据输入至神经网络中的嵌入层中,将模型训练数据中的每个字符转化为字向量;303. Input the model training data into the embedding layer in the neural network, and convert each character in the model training data into a word vector;
304、将字向量输入作为神经网络中的双向长短期记忆网络层各个时间步的输入,得到模型训练数据的隐输出序列;304. The word vector input is used as the input of each time step of the bidirectional long short-term memory network layer in the neural network to obtain the latent output sequence of the model training data;
305、将隐输出序列输入至神经网络中的条件随机场层,预测模型训练数据中各字符的标注;305. Input the hidden output sequence to the conditional random field layer in the neural network, and predict the labeling of each character in the training data of the model;
306、根据模型训练数据中各字符的标注,获得模型训练数据的行政区划序列;306. Obtain the administrative division sequence of the model training data according to the labeling of each character in the model training data;
307、根据行政区划序列中行政区划标注的排列顺序,判断行政区划序列是否存在错误;307. According to the arrangement order of the administrative division labels in the administrative division sequence, determine whether there is an error in the administrative division sequence;
308、若是,则对行政区划序列中字符的行政区划标注进行重新预测;308. If yes, re-predict the administrative division labels of characters in the administrative division sequence;
在本实施例中,对CRF层的输出需要进行后处理,包括将标注相邻的行政区划对应的字符进行拼接,可能出现错误的判定情况,例如对于对于“上海省上海市上海静安嘉里中心”可能会对每个字符标注成“省 省 省 市 市 市 省 省 建筑 建筑 建筑 建筑 建筑 建筑”,根据正常地址中行政区划的排列顺序,“省”应该在“市”前,但在行政区划标注片段“省 省 省 市 市 市 省 省 建筑 建筑 建筑 建筑 建筑 建筑”中出现了“省”在“市”后的情况,故条件随机场层预测模型训练数据中各字符的标注出现错误,需要重新进行预测。In this embodiment, the output of the CRF layer needs to be post-processed, including splicing the characters corresponding to the adjacent administrative divisions, which may cause erroneous determinations. "May be marked as "province province province city city city province province building building building building building building", according to the order of administrative divisions in the normal address, "province" should be before "city", but in the administrative division In the annotation fragment "province province city city city province province building building building building building building", there is a situation in which "province" is behind "city", so the labeling of each character in the training data of the conditional random field layer prediction model is wrong, and it is necessary to Forecast again.
309、将条件随机场层预测模型训练数据中各字符的标注与模型训练数据原有的标注进行比对和迭代,得到最终预训练的地址解析模型;309. Compare and iterate the labeling of each character in the training data of the conditional random field layer prediction model with the original labeling of the model training data, to obtain a final pre-trained address resolution model;
在本实施例中需要先将所述模型训练数据中每个字符转化为one-hot向量,再将one-hot向量转换成字向量的形式,字向量是对文本中各个字符的分布式表示,通过空间中的低维向量向计算机传达语义。将模型训练数据输入至神经网络的Embedding层的到字向量形式的输出后,将字向量输入至Bi-LSTM层,Bi-LSTM神经网络适用于序列标注任务,它针对输入序列中的每个字向量执行相同的运算,此处的运算是矩阵相乘,将高维矩阵(例如300维)线性映射为低维矩阵(128维),矩阵的每一维表示一个特征,因此此运算可删除无用特征。每步运算依赖于前一步的计算结果,同时编码进上下文的特征,此次的编码的具体实现是前一步的运算结果(此处为特征)作为下一步的输入。例如前一步对“上”字抽取出特征h_{t-1},下一步对“海”字抽取特征的运算为f(x_t,h_{t-1})=h_t,其中x_t为“海”字本身特征,f为运算使用的函数,h_t为“海”字最终抽取的特征。因此抽取当前步特征时,前一步的特征也参与运算,即所谓的“编码上文特征”。对下文特征进行类似操作。将编码进上下文的特征作为每个字符抽取的所有特征输出。将Bi-LSTM层的输入作为CRF层的输入,Bi-LSTM输出的特征没有考虑前一步标签对当前步标签的影响,比如当前字为“巫”,前面2个字“重庆”为市名,则“巫溪”大概率应为区县名或乡镇名。因此在Bi-LSTM的输出层拼接CRF层(条件随机场),这样Bi-LSTM的输出序列就变成了CRF层的观测序列,然后CRF计算整个序列在概率上的最优解,考虑到了序列标签之间的相互影响。CRF的输出标签序列分别与输入地址的每个字符对应。In this embodiment, it is necessary to first convert each character in the model training data into a one-hot vector, and then convert the one-hot vector into the form of a word vector. A word vector is a distributed representation of each character in the text, Communicate semantics to the computer through low-dimensional vectors in space. After inputting the model training data to the output of the Embedding layer of the neural network in the form of word vectors, input the word vectors to the Bi-LSTM layer. The Bi-LSTM neural network is suitable for sequence labeling tasks. It targets each word in the input sequence. Vector performs the same operation, the operation here is matrix multiplication, which linearly maps a high-dimensional matrix (such as 300 dimensions) to a low-dimensional matrix (128 dimensions), each dimension of the matrix represents a feature, so this operation can remove useless feature. Each step of operation depends on the calculation result of the previous step, and encodes the features of the context at the same time. The specific implementation of this encoding is that the operation result of the previous step (here, the feature) is used as the input of the next step. For example, in the previous step, the feature h_{t-1} is extracted for the word "上", and the operation of extracting the feature for the word "sea" in the next step is f(x_t, h_{t-1})=h_t, where x_t is "sea" The character of the word itself, f is the function used in the operation, and h_t is the final extracted feature of the word "sea". Therefore, when extracting the features of the current step, the features of the previous step also participate in the operation, which is the so-called "encoding the above features". Do a similar operation for the following features. The features encoded into the context are output as all features extracted for each character. Taking the input of the Bi-LSTM layer as the input of the CRF layer, the features of the Bi-LSTM output do not consider the influence of the label of the previous step on the label of the current step. For example, the current word is "Wu", and the first two words "Chongqing" are the city name. Then "Wuxi" should be the name of a district or county or a town with a high probability. Therefore, the CRF layer (conditional random field) is spliced in the output layer of Bi-LSTM, so that the output sequence of Bi-LSTM becomes the observation sequence of the CRF layer, and then CRF calculates the optimal solution of the entire sequence in terms of probability, taking into account the sequence Interaction between tags. The output tag sequence of the CRF corresponds to each character of the input address, respectively.
310、获取用户上传的待识别地址文本,并将待识别地址文本输入至地址解析模型中, 获得待识别地址文本中各字符的行政区划标注;310. Obtain the address text to be recognized uploaded by the user, and input the address text to be recognized into the address parsing model, and obtain the administrative division labeling of each character in the address text to be recognized;
311、根据待识别地址文本中各字符的行政区划标注,将待识别地址文本转化为标准地址文本。311. Convert the to-be-recognized address text into standard address text according to the administrative division labeling of each character in the to-be-recognized address text.
本实施例在前实施例的基础上,增加了对条件随机场层,预测所述模型训练数据中各字符的标注进行错误判断的过程,通过根据所述模型训练数据中各字符的标注,获得所述模型训练数据的行政区划序列;根据所述行政区划序列中行政区划标注的排列顺序,判断所述行政区划序列是否存在错误;若是,则对所述行政区划序列中字符的行政区划标注进行重新预测。通过本方法,能够将对条件随机场层预测模型训练数据中各字符的标注出现的错误进行纠正,提高模型训练的效率。On the basis of the previous embodiment, this embodiment adds the process of wrongly judging the label of each character in the model training data by predicting the conditional random field layer. According to the labeling of each character in the model training data, obtain The administrative division sequence of the model training data; according to the arrangement order of the administrative division labels in the administrative division sequence, determine whether there is an error in the administrative division sequence; if so, perform the administrative division labeling of the characters in the administrative division sequence. Reforecast. Through the method, errors in labeling of each character in the training data of the conditional random field layer prediction model can be corrected, thereby improving the efficiency of model training.
请参阅图4,本申请实施例中地址信息解析方法的第四个实施例包括:Referring to FIG. 4 , the fourth embodiment of the address information parsing method in the embodiment of the present application includes:
401、利用网页爬虫工具从预设的数据源中爬取原始地址数据;401. Use a web crawler tool to crawl original address data from a preset data source;
402、从原始地址数据中筛选出字符长度在预设长度区间内的地址表述数据,并对地址表述数据进行标注,得到模型训练数据;402. Filter out address expression data whose character length is within a preset length interval from the original address data, and annotate the address expression data to obtain model training data;
403、根据模型训练数据和预设的神经网络,训练得到地址解析模型;403. According to the model training data and the preset neural network, train to obtain an address resolution model;
404、获取用户上传的待识别地址文本,并将待识别地址文本输入至地址解析模型中,获得待识别地址文本中各字符的行政区划标注;404. Obtain the address text to be recognized uploaded by the user, and input the address text to be recognized into the address parsing model, and obtain the administrative division labeling of each character in the address text to be recognized;
本实施例中的步骤401-404与第一实施例中的步骤101-104相似,此处不再赘述。Steps 401-404 in this embodiment are similar to steps 101-104 in the first embodiment, and are not repeated here.
405、建立初始为空的字符缓存区,按照待识别地址文本的字符顺序处理待识别地址文本中的每个字符;405. Create an initially empty character buffer area, and process each character in the address text to be recognized according to the character sequence of the address text to be recognized;
406、将待识别地址文本的第一字符存入字符缓存区,并确定第一字符的行政区划标注;406. Store the first character of the address text to be recognized in the character buffer area, and determine the administrative division label of the first character;
407、判断第一字符的行政区划标注与第二字符的行政区划标注是否相同;407. Determine whether the administrative division label of the first character is the same as the administrative division label of the second character;
408、若相同,则将第二字符存入字符缓存区;408. If they are the same, store the second character in the character buffer;
409、若不相同,则将第一字符输出,并清空字符缓存区,并进行下一字符的处理;409. If they are not the same, output the first character, empty the character buffer, and process the next character;
410、将字符缓存区输出的相同行政区划标注的字符拼接;410. Splicing the characters marked with the same administrative division output from the character buffer area;
在本实施例中,设置有初始为空的字符缓存区,将进行标注后的待识别地址文本中的字符按照文本本身的顺序存入字符缓存区中,例如上图中的“重庆巫溪塘坊镇塘坊村一社”,首先将“重”置入字符缓存区中,并判断“重”与“庆”是否为相同的行政区划标注,由于“重”与“庆”都是“省”的行政区划标注,所以将“庆”存入字符缓存区,并判断“庆”与“巫”是否为相同的行政区划标注,“巫”为“区县”的行政区划标注,与“庆”不同,所以将字符缓存区中的“重”和“庆”两个字符取出,拼接得到“重庆”,通过对每个字符的处理,将“重庆巫溪塘坊镇塘坊村一社”划分为“重庆”“巫溪”“塘坊镇”“塘坊村”“一社”,通过这样的划分,方便后续将待识别地址文本转化为标注地址文本。In this embodiment, a character buffer area that is initially empty is set, and the characters in the marked address text to be recognized are stored in the character buffer area in the order of the text itself, for example, "Chongqing Wuxitang" in the above figure Fang Town Tangfang Village No. 1 Community", first put "Chong" into the character buffer area, and judge whether "Chong" and "Qing" are the same administrative division mark, because "Chong" and "Qing" are both "province" Therefore, save "Qing" into the character buffer area, and judge whether "Qing" and "Wu" are the same administrative division label. Different, so the two characters "Chong" and "Qing" in the character buffer area are taken out, and "Chongqing" is obtained by splicing. Through the processing of each character, "Chongqing Wuxi Tangfang Town Tangfang Village No. "Chongqing", "Wuxi", "Tangfang Town", "Tangfang Village" and "One Society", through such a division, it is convenient to convert the address text to be recognized into the marked address text.
411、根据待识别地址文本中各字符的行政区划标注,将待识别地址文本转化为标准地址文本。411. Convert the address text to be recognized into standard address text according to the administrative division labeling of each character in the address text to be recognized.
本实施例在前实施例的基础上,增加了对待识别地址文本中连续相同行政区划标注的字符进行拼接的过程,通过建立初始为空的字符缓存区,按照所述待识别地址文本的字符顺序处理所述待识别地址文本中的每个字符;将所述待识别地址文本的第一字符存入所述字符缓存区,并确定所述第一字符的行政区划标注;判断所述第一字符的行政区划标注与第二字符的行政区划标注是否相同;若相同,则将所述第二字符存入所述字符缓存区;若不相同,则将所述第一字符输出,并清空所述字符缓存区,并进行下一字符的处理;将所述字符缓存区输出的相同行政区划标注的字符拼接。通过本方法中的连续相同行政区划标注的字符进行拼接,方便后续将待识别地址文本转化为标注地址文本。On the basis of the previous embodiment, this embodiment adds a process of splicing characters marked by consecutive identical administrative divisions in the address text to be recognized. Process each character in the address text to be recognized; store the first character of the address text to be recognized into the character buffer area, and determine the administrative division label of the first character; determine the first character Whether the administrative division label of the second character is the same as the administrative division label of the second character; if they are the same, store the second character in the character buffer area; if not, output the first character and clear the character buffer area, and process the next character; splicing the characters marked with the same administrative division output from the character buffer area. The characters marked by consecutive identical administrative divisions in this method are spliced, which facilitates the subsequent conversion of the address text to be identified into the marked address text.
上面对本申请实施例中地址信息解析方法进行了描述,下面对本申请实施例中地址信息解析装置进行描述,请参阅图5,本申请实施例中地址信息解析装置一个实施例包括:The address information parsing method in the embodiment of the present application has been described above, and the address information parsing apparatus in the embodiment of the present application is described below. Referring to FIG. 5, an embodiment of the address information parsing apparatus in the embodiment of the present application includes:
数据爬取模块501,用于利用网页爬虫工具从预设的数据源中爬取原始地址数据;A data crawling module 501 is used to crawl original address data from a preset data source by using a web crawler tool;
筛选模块502,用于从所述原始地址数据中筛选出字符长度在预设长度区间内的地址表述数据,并对所述地址表述数据进行标注,得到模型训练数据;A screening module 502, configured to screen out address expression data whose character length is within a preset length interval from the original address data, and annotate the address expression data to obtain model training data;
模型训练模块503,用于根据所述模型训练数据和预设的神经网络,训练得到地址解析模型;A model training module 503, configured to obtain an address parsing model by training according to the model training data and a preset neural network;
模型输入模块504,用于获取用户上传的待识别地址文本,并将所述待识别地址文本输入至所述地址解析模型中,获得所述待识别地址文本中各字符的行政区划标注;A model input module 504, configured to acquire the address text to be recognized uploaded by the user, input the address text to be recognized into the address resolution model, and obtain the administrative division labeling of each character in the address text to be recognized;
标准转化模块505,用于根据所述待识别地址文本中各字符的行政区划标注,将所述待识别地址文本转化为标准地址文本。The standard conversion module 505 is configured to convert the to-be-recognized address text into standard address text according to the administrative division labeling of each character in the to-be-recognized address text.
需要强调的是,为保证数据的私密和安全性,上述待识别地址文本可以存储于一区块链的节点中。It should be emphasized that, in order to ensure the privacy and security of data, the above-mentioned address text to be recognized can be stored in a node of a blockchain.
本申请实施例中,所述地址信息解析装置运行上述地址信息解析方法,所述地址信息解析方法包括:利用网页爬虫工具从预设的数据源中爬取原始地址数据;从所述原始地址数据中筛选出字符长度在预设长度区间内的地址表述数据,并对所述地址表述数据进行标注,得到模型训练数据;根据所述模型训练数据和预设的神经网络,训练得到地址解析模型;获取用户上传的待识别地址文本,并将所述待识别地址文本输入至所述地址解析模型中,获得所述待识别地址文本中各字符的行政区划标注;根据所述待识别地址文本中各字符的行政区划标注,将所述待识别地址文本转化为标准地址文本。通过本方法,可使计算机抽取整个地址的语义特征,并考虑前后字符行政区划的划分结果,实现非规范化地址的多级行政区划解析。相比现有的地址解析算法,此方案不依赖于地址规范性、特征字以及地址词典,因此可处理多样化的非规范表达。基于深度模型的方法还可学习到已有数据中的命名与切分规律,并应用于模型推断,可提升非规范的中文地址解析效果,使得这样的中文地址信息能够被计算机直接用于位置服务。此外,本申请还涉及区块链技术,原始地址数据可存储于区块链中。In the embodiment of the present application, the address information parsing apparatus runs the address information parsing method, and the address information parsing method includes: using a web crawler tool to crawl original address data from a preset data source; Screening out address expression data whose character length is within a preset length interval, and marking the address expression data to obtain model training data; and training to obtain an address parsing model according to the model training data and a preset neural network; Obtain the address text to be recognized uploaded by the user, input the address text to be recognized into the address resolution model, and obtain the administrative division labeling of each character in the address text to be recognized; The administrative division labeling of the characters converts the address text to be recognized into standard address text. Through the method, the computer can extract the semantic features of the entire address, and consider the division results of the administrative divisions of the characters before and after, so as to realize the multi-level administrative division analysis of the non-standardized address. Compared with the existing address resolution algorithms, this scheme does not rely on address canonicality, feature words and address dictionaries, so it can handle diverse non-canonical expressions. The method based on the deep model can also learn the naming and segmentation rules in the existing data, and apply it to model inference, which can improve the effect of non-standard Chinese address resolution, so that such Chinese address information can be directly used by computers for location services . In addition, this application also relates to blockchain technology, and the original address data can be stored in the blockchain.
请参阅图6,本申请实施例中地址信息解析装置的第二个实施例包括:Referring to FIG. 6, the second embodiment of the address information parsing apparatus in the embodiment of the present application includes:
数据爬取模块501,用于利用网页爬虫工具从预设的数据源中爬取原始地址数据;A data crawling module 501 is used to crawl original address data from a preset data source by using a web crawler tool;
筛选模块502,用于从所述原始地址数据中筛选出字符长度在预设长度区间内的地址表述数据,并对所述地址表述数据进行标注,得到模型训练数据;A screening module 502, configured to screen out address expression data whose character length is within a preset length interval from the original address data, and annotate the address expression data to obtain model training data;
模型训练模块503,用于根据所述模型训练数据和预设的神经网络,训练得到地址解析模型;A model training module 503, configured to obtain an address parsing model by training according to the model training data and a preset neural network;
模型输入模块504,用于获取用户上传的待识别地址文本,并将所述待识别地址文本输入至所述地址解析模型中,获得所述待识别地址文本中各字符的行政区划标注;A model input module 504, configured to acquire the address text to be recognized uploaded by the user, input the address text to be recognized into the address resolution model, and obtain the administrative division labeling of each character in the address text to be recognized;
标准转化模块505,用于根据所述待识别地址文本中各字符的行政区划标注,将所述待识别地址文本转化为标准地址文本。The standard conversion module 505 is configured to convert the to-be-recognized address text into standard address text according to the administrative division labeling of each character in the to-be-recognized address text.
其中,所述模型训练模块503包括:Wherein, the model training module 503 includes:
向量转化单元5031,用于将所述模型训练数据输入至所述神经网络中的嵌入层中,将所述模型训练数据中的每个字符转化为字向量; Vector conversion unit 5031, for inputting the model training data into the embedding layer in the neural network, and converting each character in the model training data into a word vector;
序列单元5032,用于将所述字向量输入作为所述神经网络中的双向长短期记忆网络层各个时间步的输入,得到所述模型训练数据的隐输出序列;The sequence unit 5032 is used to input the word vector as the input of each time step of the bidirectional long short-term memory network layer in the neural network to obtain the implicit output sequence of the model training data;
标注预测单元5033,用于将所述隐输出序列输入至所述神经网络中的条件随机场层,预测所述模型训练数据中各字符的标注,并与所述模型训练数据原有的标注进行比对和迭 代,得到最终预训练的地址解析模型。The label prediction unit 5033 is used to input the hidden output sequence into the conditional random field layer in the neural network, predict the label of each character in the model training data, and compare the original label of the model training data. Align and iterate to get the final pre-trained geocoding model.
可选的,所述向量转化单元5031具体用于:Optionally, the vector conversion unit 5031 is specifically used for:
将所述模型训练数据中的每个字符转化独热码向量;converting each character in the model training data into a one-hot code vector;
将所述模型训练数据的独热码向量通过预训练好的向量矩阵转化为低维稠密的字向量。The one-hot code vector of the model training data is converted into a low-dimensional dense word vector through a pre-trained vector matrix.
可选的,所述序列单元5032具体用于:Optionally, the sequence unit 5032 is specifically used for:
将所述字向量输入作为所述神经网络中的双向长短期记忆网络层各个时间步的输入得到正向长短期记忆网络输出的隐状态序列和反向长短期记忆网络输出的隐状态序列;The word vector input is used as the input of each time step of the bidirectional long-term and short-term memory network layer in the neural network to obtain the hidden state sequence output by the forward long-term and short-term memory network and the hidden state sequence output by the reverse long-term and short-term memory network;
将所述正向长短期记忆网络输出的隐状态序列和所述反向长短期记忆网络输出的隐状态序列进行拼接,得到完整的隐输出序列。The hidden state sequence output by the forward long-term and short-term memory network and the hidden state sequence output by the reverse long-term and short-term memory network are spliced to obtain a complete hidden output sequence.
可选的,所述模型训练模块还包括第一重测单元5034,所述第一重测单元5034具体用于:Optionally, the model training module further includes a first re-measurement unit 5034, and the first re-measurement unit 5034 is specifically used for:
根据所述模型训练数据中各字符的标注,获得所述模型训练数据的行政区划序列;Obtain the administrative division sequence of the model training data according to the labeling of each character in the model training data;
判断所述行政区划序列中,是否出现标注类型相同的至少两段行政区划标注片段,其中,所述行政区划片段为连续相同的行政区划标注构成的片段;Judging whether there are at least two administrative division annotation fragments with the same annotation type in the sequence of administrative divisions, wherein the administrative division fragments are fragments composed of consecutive and identical administrative division annotations;
若是,则对比较标注类型相同的行政区划标注片段在所述行政区划序列中的位置,并对标注类型相同的行政区划标注片段中位置靠后的行政区划标注片段中的行政区划标注进行重新预测。If so, compare the positions of the administrative division annotation fragments with the same annotation type in the administrative division sequence, and re-predict the administrative division annotations in the administrative division annotation fragments that are located later in the administrative division annotation fragments with the same annotation type. .
可选的,所述模型训练模块还包括第二重测单元5035,所述第二重测单元5035具体用于:Optionally, the model training module further includes a second re-measurement unit 5035, and the second re-measurement unit 5035 is specifically used for:
根据所述模型训练数据中各字符的标注,获得所述模型训练数据的行政区划序列;Obtain the administrative division sequence of the model training data according to the labeling of each character in the model training data;
根据所述行政区划序列中行政区划标注的排列顺序,判断所述行政区划序列是否存在错误;According to the arrangement order of the administrative division labels in the administrative division sequence, determine whether there is an error in the administrative division sequence;
若是,则对所述行政区划序列中字符的行政区划标注进行重新预测。If so, re-predict the administrative division labels of the characters in the administrative division sequence.
其中,所述地址信息解析装置还包括字符连接模块506,所述字符连接模块506具体用于:Wherein, the address information parsing device further includes a character connection module 506, and the character connection module 506 is specifically used for:
建立初始为空的字符缓存区,按照所述待识别地址文本的字符顺序处理所述待识别地址文本中的每个字符;establishing an initially empty character buffer area, and processing each character in the address text to be recognized according to the character sequence of the address text to be recognized;
将所述待识别地址文本的第一字符存入所述字符缓存区,并确定所述第一字符的行政区划标注;storing the first character of the address text to be recognized into the character buffer area, and determining the administrative division label of the first character;
判断所述第一字符的行政区划标注与第二字符的行政区划标注是否相同;Determine whether the administrative division labeling of the first character is the same as the administrative division labeling of the second character;
若相同,则将所述第二字符存入所述字符缓存区;If the same, the second character is stored in the character buffer;
若不相同,则将所述第一字符输出,并清空所述字符缓存区,并进行下一字符的处理;If not, output the first character, clear the character buffer, and process the next character;
将所述字符缓存区输出的相同行政区划标注的字符拼接。The characters marked with the same administrative division output from the character buffer are spliced together.
本实施例在上一实施例的基础上,详细描述了各个模块的具体功能以及部分模块的单元构成,通过本装置,可使计算机抽取整个地址的语义特征,并考虑前后字符行政区划的划分结果,实现非规范化地址的多级行政区划解析。相比现有的地址解析算法,不依赖于地址规范性、特征字以及地址词典,因此可处理多样化的非规范表达。基于深度模型的方法还可学习到已有数据中的命名与切分规律,并应用于模型推断,可提升非规范的中文地址解析效果,使得这样的中文地址信息能够被计算机直接用于位置服务。On the basis of the previous embodiment, this embodiment describes in detail the specific functions of each module and the unit structure of some modules. Through this device, the computer can extract the semantic features of the entire address, and consider the division results of the administrative divisions of the characters before and after. , to achieve multi-level administrative division analysis of denormalized addresses. Compared with the existing address resolution algorithm, it does not depend on address norm, feature word and address dictionary, so it can handle diverse non-normative expressions. The method based on the deep model can also learn the naming and segmentation rules in the existing data, and apply it to model inference, which can improve the effect of non-standard Chinese address resolution, so that such Chinese address information can be directly used by computers for location services .
上面图5和图6从模块化功能实体的角度对本申请实施例中的中地址信息解析装置进行详细描述,下面从硬件处理的角度对本申请实施例中地址信息解析设备进行详细描述。5 and 6 above describe the address information parsing apparatus in the embodiment of the present application in detail from the perspective of modular functional entities, and the following describes the address information parsing device in the embodiment of the present application in detail from the perspective of hardware processing.
图7是本申请实施例提供的一种地址信息解析设备的结构示意图,该地址信息解析设备700可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器 (central processing units,CPU)710(例如,一个或一个以上处理器)和存储器720,一个或一个以上存储应用程序733或数据732的存储介质730(例如一个或一个以上海量存储设备)。其中,存储器720和存储介质730可以是短暂存储或持久存储。存储在存储介质730的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对地址信息解析设备700中的一系列指令操作。更进一步地,处理器710可以设置为与存储介质730通信,在地址信息解析设备700上执行存储介质730中的一系列指令操作,以实现上述地址信息解析方法的步骤。FIG. 7 is a schematic structural diagram of an address information parsing device provided by an embodiment of the present application. The address information parsing device 700 may vary greatly due to different configurations or performances, and may include one or more central processing units. , CPU) 710 (eg, one or more processors) and memory 720, one or more storage media 730 (eg, one or more mass storage devices) storing application programs 733 or data 732. Among them, the memory 720 and the storage medium 730 may be short-term storage or persistent storage. The program stored in the storage medium 730 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the address information parsing device 700 . Further, the processor 710 may be configured to communicate with the storage medium 730, and execute a series of instruction operations in the storage medium 730 on the address information parsing device 700 to implement the steps of the above address information parsing method.
地址信息解析设备700还可以包括一个或一个以上电源740,一个或一个以上有线或无线网络接口750,一个或一个以上输入输出接口760,和/或,一个或一个以上操作系统731,例如Windows Serve,Mac OS X,Unix,Linux,FreeBSD等等。本领域技术人员可以理解,图7示出的地址信息解析设备结构并不构成对本申请提供的地址信息解析设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。The address information parsing device 700 may also include one or more power supplies 740, one or more wired or wireless network interfaces 750, one or more input and output interfaces 760, and/or, one or more operating systems 731, such as Windows Server , Mac OS X, Unix, Linux, FreeBSD and more. Those skilled in the art can understand that the structure of the address information parsing device shown in FIG. 7 does not constitute a limitation on the address information parsing device provided by this application, and may include more or less components than those shown in the figure, or combine some components, Or a different component arrangement.
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
本申请还提供一种计算机可读存储介质,该计算机可读存储介质可以为非易失性计算机可读存储介质,该计算机可读存储介质也可以为易失性计算机可读存储介质,所述计算机可读存储介质中存储有指令,当所述指令在计算机上运行时,使得计算机执行所述地址信息解析方法的步骤。The present application also provides a computer-readable storage medium. The computer-readable storage medium may be a non-volatile computer-readable storage medium. The computer-readable storage medium may also be a volatile computer-readable storage medium. Instructions are stored in the computer-readable storage medium, and when the instructions are executed on a computer, make the computer execute the steps of the address information parsing method.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统或装置、单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the above-described system, device, and unit may refer to the corresponding process in the foregoing method embodiments, which will not be repeated here.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand: The technical solutions described in the embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the present application.

Claims (20)

  1. 一种地址信息解析方法,其中,所述地址信息解析方法包括:An address information parsing method, wherein the address information parsing method comprises:
    利用网页爬虫工具从预设的数据源中爬取原始地址数据;Use web crawler tools to crawl original address data from preset data sources;
    从所述原始地址数据中筛选出字符长度在预设长度区间内的地址表述数据,并对所述地址表述数据进行标注,得到模型训练数据;Filter out address expression data whose character length is within a preset length interval from the original address data, and annotate the address expression data to obtain model training data;
    根据所述模型训练数据和预设的神经网络,训练得到地址解析模型;According to the model training data and the preset neural network, an address resolution model is obtained by training;
    获取用户上传的待识别地址文本,并将所述待识别地址文本输入至所述地址解析模型中,获得所述待识别地址文本中各字符的行政区划标注;Obtaining the address text to be recognized uploaded by the user, inputting the address text to be recognized into the address resolution model, and obtaining the administrative division labels of each character in the address text to be recognized;
    根据所述待识别地址文本中各字符的行政区划标注,将所述待识别地址文本转化为标准地址文本。The to-be-recognized address text is converted into a standard address text according to the administrative division labeling of each character in the to-be-recognized address text.
  2. 根据权利要求1所述的地址信息解析方法,其中,所述根据所述模型训练数据和预设的神经网络,训练得到地址解析模型包括:The address information parsing method according to claim 1, wherein the obtaining an address parsing model by training according to the model training data and a preset neural network comprises:
    将所述模型训练数据输入至所述神经网络中的嵌入层中,将所述模型训练数据中的每个字符转化为字向量;The model training data is input into the embedding layer in the neural network, and each character in the model training data is converted into a word vector;
    将所述字向量输入作为所述神经网络中的双向长短期记忆网络层各个时间步的输入,得到所述模型训练数据的隐输出序列;The word vector input is used as the input of each time step of the bidirectional long short-term memory network layer in the neural network to obtain the latent output sequence of the model training data;
    将所述隐输出序列输入至所述神经网络中的条件随机场层,预测所述模型训练数据中各字符的标注,并与所述模型训练数据原有的标注进行比对和迭代,得到最终预训练的地址解析模型。Input the latent output sequence into the conditional random field layer in the neural network, predict the label of each character in the model training data, and compare and iterate with the original label of the model training data to obtain the final Pretrained Geocoding Model.
  3. 根据权利要求2所述的地址信息解析方法,其中,所述将所述模型训练数据输入至所述神经网络中的嵌入层中,将所述模型训练数据中的每个字符转化为字向量包括:The address information parsing method according to claim 2, wherein the inputting the model training data into an embedding layer in the neural network, and converting each character in the model training data into a word vector comprises: :
    将所述模型训练数据中的每个字符转化独热码向量;converting each character in the model training data into a one-hot code vector;
    将所述模型训练数据的独热码向量通过预训练好的向量矩阵转化为低维稠密的字向量。The one-hot code vector of the model training data is converted into a low-dimensional dense word vector through a pre-trained vector matrix.
  4. 根据权利要求3所述的地址信息解析方法,其中,所述将所述字向量输入作为所述神经网络中的双向长短期记忆网络层各个时间步的输入,得到所述模型训练数据的隐输出序列包括:The address information parsing method according to claim 3, wherein the input of the word vector is used as the input of each time step of the bidirectional long short-term memory network layer in the neural network, and the implicit output of the model training data is obtained. The sequence includes:
    将所述字向量输入作为所述神经网络中的双向长短期记忆网络层各个时间步的输入得到正向长短期记忆网络输出的隐状态序列和反向长短期记忆网络输出的隐状态序列;The word vector input is used as the input of each time step of the bidirectional long-term and short-term memory network layer in the neural network to obtain the hidden state sequence output by the forward long-term and short-term memory network and the hidden state sequence output by the reverse long-term and short-term memory network;
    将所述正向长短期记忆网络输出的隐状态序列和所述反向长短期记忆网络输出的隐状态序列进行拼接,得到完整的隐输出序列。The hidden state sequence output by the forward long-term and short-term memory network and the hidden state sequence output by the reverse long-term and short-term memory network are spliced to obtain a complete hidden output sequence.
  5. 根据权利要求4所述的地址信息解析方法,其中,在所述将所述隐输出序列输入至所述神经网络中的条件随机场层,预测所述模型训练数据中各字符的标注之后,还包括:The address information parsing method according to claim 4, wherein after inputting the latent output sequence into the conditional random field layer in the neural network, and predicting the label of each character in the model training data, further include:
    根据所述模型训练数据中各字符的标注,获得所述模型训练数据的行政区划序列;Obtain the administrative division sequence of the model training data according to the labeling of each character in the model training data;
    判断所述行政区划序列中,是否出现标注类型相同的至少两段行政区划标注片段,其中,所述行政区划片段为连续相同的行政区划标注构成的片段;Judging whether there are at least two administrative division annotation fragments with the same annotation type in the sequence of administrative divisions, wherein the administrative division fragments are fragments composed of consecutive and identical administrative division annotations;
    若是,则对比较标注类型相同的行政区划标注片段在所述行政区划序列中的位置,并对标注类型相同的行政区划标注片段中位置靠后的行政区划标注片段中的行政区划标注进行重新预测。If so, compare the positions of the administrative division annotation fragments with the same annotation type in the administrative division sequence, and re-predict the administrative division annotations in the administrative division annotation fragments located at the back of the administrative division annotation fragments with the same annotation type. .
  6. 根据权利要求4所述的地址信息解析方法,其中,在所述将所述隐输出序列输入至所述神经网络中的条件随机场层,预测所述模型训练数据中各字符的标注之后,还包括:The address information parsing method according to claim 4, wherein after inputting the latent output sequence into the conditional random field layer in the neural network, and predicting the label of each character in the model training data, further include:
    根据所述模型训练数据中各字符的标注,获得所述模型训练数据的行政区划序列;Obtain the administrative division sequence of the model training data according to the labeling of each character in the model training data;
    根据所述行政区划序列中行政区划标注的排列顺序,判断所述行政区划序列是否存在错误;According to the arrangement order of the administrative division labels in the administrative division sequence, determine whether there is an error in the administrative division sequence;
    若是,则对所述行政区划序列中字符的行政区划标注进行重新预测。If so, re-predict the administrative division labels of the characters in the administrative division sequence.
  7. 根据权利要求1-6中任一项所述的地址信息解析方法,其中,在所述获取用户上传的待识别地址文本,并将所述待识别地址文本输入至所述地址解析模型中,获得所述待识别地址文本中各字符的行政区划标注之后,还包括:The address information parsing method according to any one of claims 1-6, wherein, in the acquiring the address text to be recognized uploaded by the user, and inputting the address text to be recognized into the address parsing model, obtaining After the administrative division of each character in the address text to be recognized is marked, it also includes:
    建立初始为空的字符缓存区,按照所述待识别地址文本的字符顺序处理所述待识别地址文本中的每个字符;establishing an initially empty character buffer area, and processing each character in the address text to be recognized according to the character sequence of the address text to be recognized;
    将所述待识别地址文本的第一字符存入所述字符缓存区,并确定所述第一字符的行政区划标注;storing the first character of the address text to be recognized into the character buffer area, and determining the administrative division label of the first character;
    判断所述第一字符的行政区划标注与第二字符的行政区划标注是否相同;Determine whether the administrative division labeling of the first character is the same as the administrative division labeling of the second character;
    若相同,则将所述第二字符存入所述字符缓存区;If the same, the second character is stored in the character buffer;
    若不相同,则将所述第一字符输出,并清空所述字符缓存区,并进行下一字符的处理;If not, output the first character, clear the character buffer, and process the next character;
    将所述字符缓存区输出的相同行政区划标注的字符拼接。The characters marked with the same administrative division output from the character buffer are spliced together.
  8. 一种地址信息解析设备,其中,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:An address information parsing device, comprising a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor, and the processor implements the following when executing the computer-readable instructions step:
    利用网页爬虫工具从预设的数据源中爬取原始地址数据;Use web crawler tools to crawl original address data from preset data sources;
    从所述原始地址数据中筛选出字符长度在预设长度区间内的地址表述数据,并对所述地址表述数据进行标注,得到模型训练数据;Filter out address expression data whose character length is within a preset length interval from the original address data, and annotate the address expression data to obtain model training data;
    根据所述模型训练数据和预设的神经网络,训练得到地址解析模型;According to the model training data and the preset neural network, an address resolution model is obtained by training;
    获取用户上传的待识别地址文本,并将所述待识别地址文本输入至所述地址解析模型中,获得所述待识别地址文本中各字符的行政区划标注;Obtaining the address text to be recognized uploaded by the user, inputting the address text to be recognized into the address resolution model, and obtaining the administrative division labels of each character in the address text to be recognized;
    根据所述待识别地址文本中各字符的行政区划标注,将所述待识别地址文本转化为标准地址文本。The to-be-recognized address text is converted into a standard address text according to the administrative division labeling of each character in the to-be-recognized address text.
  9. 根据权利要求8所述的地址信息解析设备,其中,所述根据所述模型训练数据和预设的神经网络,训练得到地址解析模型的步骤时,包括:The address information parsing device according to claim 8, wherein the step of obtaining an address parsing model by training according to the model training data and a preset neural network comprises:
    将所述模型训练数据输入至所述神经网络中的嵌入层中,将所述模型训练数据中的每个字符转化为字向量;The model training data is input into the embedding layer in the neural network, and each character in the model training data is converted into a word vector;
    将所述字向量输入作为所述神经网络中的双向长短期记忆网络层各个时间步的输入,得到所述模型训练数据的隐输出序列;The word vector input is used as the input of each time step of the bidirectional long short-term memory network layer in the neural network to obtain the latent output sequence of the model training data;
    将所述隐输出序列输入至所述神经网络中的条件随机场层,预测所述模型训练数据中各字符的标注,并与所述模型训练数据原有的标注进行比对和迭代,得到最终预训练的地址解析模型。Input the latent output sequence into the conditional random field layer in the neural network, predict the label of each character in the model training data, and compare and iterate with the original label of the model training data to obtain the final Pretrained Geocoding Model.
  10. 根据权利要求9所述的地址信息解析设备,其中,所述处理器执行所述将所述模型训练数据输入至所述神经网络中的嵌入层中,将所述模型训练数据中的每个字符转化为字向量的步骤时,包括:The address information parsing device according to claim 9, wherein the processor performs the inputting of the model training data into an embedding layer in the neural network, wherein each character in the model training data is When converting to word vectors, the steps include:
    将所述模型训练数据中的每个字符转化独热码向量;converting each character in the model training data into a one-hot code vector;
    将所述模型训练数据的独热码向量通过预训练好的向量矩阵转化为低维稠密的字向量。The one-hot code vector of the model training data is converted into a low-dimensional dense word vector through a pre-trained vector matrix.
  11. 根据权利要求10所述的地址信息解析设备,其中,所述处理器执行所述将所述字向量输入作为所述神经网络中的双向长短期记忆网络层各个时间步的输入,得到所述模型训练数据的隐输出序列的步骤时,包括:The address information parsing device according to claim 10, wherein the processor executes the input of the word vector as the input of each time step of a bidirectional long short-term memory network layer in the neural network to obtain the model The steps when training the latent output sequence of the data include:
    将所述字向量输入作为所述神经网络中的双向长短期记忆网络层各个时间步的输入得到正向长短期记忆网络输出的隐状态序列和反向长短期记忆网络输出的隐状态序列;The word vector input is used as the input of each time step of the bidirectional long-term and short-term memory network layer in the neural network to obtain the hidden state sequence output by the forward long-term and short-term memory network and the hidden state sequence output by the reverse long-term and short-term memory network;
    将所述正向长短期记忆网络输出的隐状态序列和所述反向长短期记忆网络输出的隐状态序列进行拼接,得到完整的隐输出序列。The hidden state sequence output by the forward long-term and short-term memory network and the hidden state sequence output by the reverse long-term and short-term memory network are spliced to obtain a complete hidden output sequence.
  12. 根据权利要求11所述的地址信息解析设备,其中,所述处理器执行所述将所述隐输出序列输入至所述神经网络中的条件随机场层,预测所述模型训练数据中各字符的标注 的步骤之后,还包括:The address information parsing device according to claim 11, wherein the processor executes the conditional random field layer of inputting the latent output sequence into the neural network, and predicts the value of each character in the model training data. After the marked steps, also include:
    根据所述模型训练数据中各字符的标注,获得所述模型训练数据的行政区划序列;Obtain the administrative division sequence of the model training data according to the labeling of each character in the model training data;
    判断所述行政区划序列中,是否出现标注类型相同的至少两段行政区划标注片段,其中,所述行政区划片段为连续相同的行政区划标注构成的片段;Judging whether there are at least two administrative division annotation fragments with the same annotation type in the sequence of administrative divisions, wherein the administrative division fragments are fragments composed of consecutive and identical administrative division annotations;
    若是,则对比较标注类型相同的行政区划标注片段在所述行政区划序列中的位置,并对标注类型相同的行政区划标注片段中位置靠后的行政区划标注片段中的行政区划标注进行重新预测。If so, compare the positions of the administrative division annotation fragments with the same annotation type in the administrative division sequence, and re-predict the administrative division annotations in the administrative division annotation fragments located at the back of the administrative division annotation fragments with the same annotation type. .
  13. 根据权利要求11所述的地址信息解析设备,其中,所述处理器执行所述将所述隐输出序列输入至所述神经网络中的条件随机场层,预测所述模型训练数据中各字符的标注的步骤之后,还包括:The address information parsing device according to claim 11, wherein the processor executes the conditional random field layer of inputting the latent output sequence into the neural network, and predicts the value of each character in the model training data. After the marked steps, also include:
    根据所述模型训练数据中各字符的标注,获得所述模型训练数据的行政区划序列;Obtain the administrative division sequence of the model training data according to the labeling of each character in the model training data;
    根据所述行政区划序列中行政区划标注的排列顺序,判断所述行政区划序列是否存在错误;According to the arrangement order of the administrative division labels in the administrative division sequence, determine whether there is an error in the administrative division sequence;
    若是,则对所述行政区划序列中字符的行政区划标注进行重新预测。If so, re-predict the administrative division labels of the characters in the administrative division sequence.
  14. 根据权利要求8-13中任一项所述的地址信息解析设备,其中,所述处理器执行所述将所述获取用户上传的待识别地址文本,并将所述待识别地址文本输入至所述地址解析模型中,获得所述待识别地址文本中各字符的行政区划标注之后,还包括:The address information parsing device according to any one of claims 8-13, wherein the processor executes the acquiring the address text to be recognized uploaded by the user, and inputs the address text to be recognized into the In the address parsing model, after obtaining the administrative division labeling of each character in the address text to be recognized, it also includes:
    建立初始为空的字符缓存区,按照所述待识别地址文本的字符顺序处理所述待识别地址文本中的每个字符;establishing an initially empty character buffer area, and processing each character in the address text to be recognized according to the character sequence of the address text to be recognized;
    将所述待识别地址文本的第一字符存入所述字符缓存区,并确定所述第一字符的行政区划标注;storing the first character of the address text to be recognized into the character buffer area, and determining the administrative division label of the first character;
    判断所述第一字符的行政区划标注与第二字符的行政区划标注是否相同;Determine whether the administrative division labeling of the first character is the same as the administrative division labeling of the second character;
    若相同,则将所述第二字符存入所述字符缓存区;If the same, the second character is stored in the character buffer;
    若不相同,则将所述第一字符输出,并清空所述字符缓存区,并进行下一字符的处理;If not, output the first character, clear the character buffer, and process the next character;
    将所述字符缓存区输出的相同行政区划标注的字符拼接。The characters marked with the same administrative division output from the character buffer are spliced together.
  15. 一种计算机可读存储介质,其中,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:A computer-readable storage medium, wherein computer instructions are stored in the computer-readable storage medium, and when the computer instructions are executed on a computer, the computer is made to perform the following steps:
    利用网页爬虫工具从预设的数据源中爬取原始地址数据;Use web crawler tools to crawl original address data from preset data sources;
    从所述原始地址数据中筛选出字符长度在预设长度区间内的地址表述数据,并对所述地址表述数据进行标注,得到模型训练数据;Filter out address expression data whose character length is within a preset length interval from the original address data, and annotate the address expression data to obtain model training data;
    根据所述模型训练数据和预设的神经网络,训练得到地址解析模型;According to the model training data and the preset neural network, an address resolution model is obtained by training;
    获取用户上传的待识别地址文本,并将所述待识别地址文本输入至所述地址解析模型中,获得所述待识别地址文本中各字符的行政区划标注;Obtaining the address text to be recognized uploaded by the user, inputting the address text to be recognized into the address resolution model, and obtaining the administrative division labels of each character in the address text to be recognized;
    根据所述待识别地址文本中各字符的行政区划标注,将所述待识别地址文本转化为标准地址文本。The to-be-recognized address text is converted into a standard address text according to the administrative division labeling of each character in the to-be-recognized address text.
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行所述根据所述模型训练数据和预设的神经网络,训练得到地址解析模型的步骤时,包括:The computer-readable storage medium according to claim 15, wherein computer instructions are stored in the computer-readable storage medium, and when the computer instructions are executed on a computer, the computer causes the computer to execute the training data and the training data according to the model. The preset neural network, when training the steps to obtain the address resolution model, includes:
    将所述模型训练数据输入至所述神经网络中的嵌入层中,将所述模型训练数据中的每个字符转化为字向量;The model training data is input into the embedding layer in the neural network, and each character in the model training data is converted into a word vector;
    将所述字向量输入作为所述神经网络中的双向长短期记忆网络层各个时间步的输入,得到所述模型训练数据的隐输出序列;The word vector input is used as the input of each time step of the bidirectional long short-term memory network layer in the neural network to obtain the latent output sequence of the model training data;
    将所述隐输出序列输入至所述神经网络中的条件随机场层,预测所述模型训练数据中各字符的标注,并与所述模型训练数据原有的标注进行比对和迭代,得到最终预训练的地址解析模型。Input the latent output sequence into the conditional random field layer in the neural network, predict the label of each character in the model training data, and compare and iterate with the original label of the model training data to obtain the final Pretrained Geocoding Model.
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行所述将所述模型训练数据输入至所述神经网络中的嵌入层中,将所述模型训练数据中的每个字符转化为字向量的步骤时,包括:The computer-readable storage medium according to claim 16, wherein the computer-readable storage medium stores computer instructions that, when executed on a computer, cause the computer to execute the inputting of the model training data In the embedding layer in the neural network, the step of converting each character in the model training data into a word vector includes:
    将所述模型训练数据中的每个字符转化独热码向量;converting each character in the model training data into a one-hot code vector;
    将所述模型训练数据的独热码向量通过预训练好的向量矩阵转化为低维稠密的字向量。The one-hot code vector of the model training data is converted into a low-dimensional dense word vector through a pre-trained vector matrix.
  18. 根据权利要求17所述的计算机可读存储介质,其中,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行所述将所述字向量输入作为所述神经网络中的双向长短期记忆网络层各个时间步的输入,得到所述模型训练数据的隐输出序列的步骤时,包括:18. The computer-readable storage medium of claim 17, wherein the computer-readable storage medium stores computer instructions that, when executed on a computer, cause the computer to perform the inputting of the word vector as a When the input of each time step of the bidirectional long short-term memory network layer in the neural network, the steps of obtaining the latent output sequence of the model training data include:
    将所述字向量输入作为所述神经网络中的双向长短期记忆网络层各个时间步的输入得到正向长短期记忆网络输出的隐状态序列和反向长短期记忆网络输出的隐状态序列;The word vector input is used as the input of each time step of the bidirectional long-term and short-term memory network layer in the neural network to obtain the hidden state sequence output by the forward long-term and short-term memory network and the hidden state sequence output by the reverse long-term and short-term memory network;
    将所述正向长短期记忆网络输出的隐状态序列和所述反向长短期记忆网络输出的隐状态序列进行拼接,得到完整的隐输出序列。The hidden state sequence output by the forward long-term and short-term memory network and the hidden state sequence output by the reverse long-term and short-term memory network are spliced to obtain a complete hidden output sequence.
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行根据所述将所述隐输出序列输入至所述神经网络中的条件随机场层,预测所述模型训练数据中各字符的标注的步骤之后,还包括:19. The computer-readable storage medium of claim 18, wherein the computer-readable storage medium stores computer instructions that, when executed on a computer, cause the computer to execute the implicit output sequence according to the After the step of inputting the conditional random field layer in the neural network and predicting the labeling of each character in the model training data, it also includes:
    根据所述模型训练数据中各字符的标注,获得所述模型训练数据的行政区划序列;Obtain the administrative division sequence of the model training data according to the labeling of each character in the model training data;
    判断所述行政区划序列中,是否出现标注类型相同的至少两段行政区划标注片段,其中,所述行政区划片段为连续相同的行政区划标注构成的片段;Judging whether there are at least two administrative division annotation fragments with the same annotation type in the sequence of administrative divisions, wherein the administrative division fragments are fragments composed of consecutive and identical administrative division annotations;
    若是,则对比较标注类型相同的行政区划标注片段在所述行政区划序列中的位置,并对标注类型相同的行政区划标注片段中位置靠后的行政区划标注片段中的行政区划标注进行重新预测。If so, compare the positions of the administrative division annotation fragments with the same annotation type in the administrative division sequence, and re-predict the administrative division annotations in the administrative division annotation fragments located at the back of the administrative division annotation fragments with the same annotation type. .
  20. 一种地址信息解析装置,其中,所述地址信息解析装置包括:An address information parsing device, wherein the address information parsing device comprises:
    数据爬取模块,用于利用网页爬虫工具从预设的数据源中爬取原始地址数据;The data crawling module is used to crawl the original address data from the preset data source by using the web crawler tool;
    筛选模块,用于从所述原始地址数据中筛选出字符长度在预设长度区间内的地址表述数据,并对所述地址表述数据进行标注,得到模型训练数据;a screening module, configured to filter out address expression data whose character length is within a preset length interval from the original address data, and annotate the address expression data to obtain model training data;
    模型训练模块,用于根据所述模型训练数据和预设的神经网络,训练得到地址解析模型;a model training module, used for training an address resolution model according to the model training data and a preset neural network;
    模型输入模块,用于获取用户上传的待识别地址文本,并将所述待识别地址文本输入至所述地址解析模型中,获得所述待识别地址文本中各字符的行政区划标注;a model input module, configured to obtain the address text to be recognized uploaded by the user, input the address text to be recognized into the address resolution model, and obtain the administrative division labeling of each character in the address text to be recognized;
    标准转化模块,用于根据所述待识别地址文本中各字符的行政区划标注,将所述待识别地址文本转化为标准地址文本。A standard conversion module, configured to convert the to-be-recognized address text into standard address text according to the administrative division labeling of each character in the to-be-recognized address text.
PCT/CN2021/109698 2020-12-23 2021-07-30 Address information resolution method, apparatus and device, and storage medium WO2022134592A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011544487.1A CN112612940A (en) 2020-12-23 2020-12-23 Address information analysis method, device, equipment and storage medium
CN202011544487.1 2020-12-23

Publications (1)

Publication Number Publication Date
WO2022134592A1 true WO2022134592A1 (en) 2022-06-30

Family

ID=75244917

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/109698 WO2022134592A1 (en) 2020-12-23 2021-07-30 Address information resolution method, apparatus and device, and storage medium

Country Status (2)

Country Link
CN (1) CN112612940A (en)
WO (1) WO2022134592A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116541472A (en) * 2023-03-22 2023-08-04 麦博(上海)健康科技有限公司 Knowledge graph construction method in medical field

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112612940A (en) * 2020-12-23 2021-04-06 深圳壹账通智能科技有限公司 Address information analysis method, device, equipment and storage medium
CN113255352A (en) * 2021-05-12 2021-08-13 北京易华录信息技术股份有限公司 Street information determination method and device and computer equipment
CN113449528B (en) * 2021-08-30 2021-11-30 企查查科技有限公司 Address element extraction method and device, computer equipment and storage medium
CN114035872A (en) * 2021-10-27 2022-02-11 北京闪送科技有限公司 Method for rapidly improving receiving and dispatching information through automatic identification and helping user
CN114091454A (en) * 2021-11-29 2022-02-25 重庆市地理信息和遥感应用中心 Method for extracting place name information and positioning space in internet text
CN114218957B (en) * 2022-02-22 2022-11-18 阿里巴巴(中国)有限公司 Method, device, equipment and storage medium for determining administrative division transition information
CN114861658B (en) * 2022-05-24 2023-07-25 北京百度网讯科技有限公司 Address information analysis method and device, equipment and medium
CN115410158B (en) * 2022-09-13 2023-06-30 北京交通大学 Landmark extraction method based on monitoring camera
CN116522943A (en) * 2023-05-11 2023-08-01 北京微聚智汇科技有限公司 Address element extraction method and device, storage medium and computer equipment
CN116955855B (en) * 2023-09-14 2023-11-24 南京擎天科技有限公司 Low-cost cross-region address resolution model construction method and system
CN117457135B (en) * 2023-12-22 2024-04-09 四川互慧软件有限公司 Address data management method and cyclic neural network model construction method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008140117A (en) * 2006-12-01 2008-06-19 National Institute Of Information & Communication Technology Apparatus for segmenting chinese character sequence to chinese word sequence
JP2010238043A (en) * 2009-03-31 2010-10-21 Mitsubishi Electric Corp Text analysis learning device
CN108763212A (en) * 2018-05-23 2018-11-06 北京神州泰岳软件股份有限公司 A kind of address information extraction method and device
CN110377686A (en) * 2019-07-04 2019-10-25 浙江大学 A kind of address information Feature Extraction Method based on deep neural network model
CN110688449A (en) * 2019-09-20 2020-01-14 京东数字科技控股有限公司 Address text processing method, device, equipment and medium based on deep learning
CN111125365A (en) * 2019-12-24 2020-05-08 京东数字科技控股有限公司 Address data labeling method and device, electronic equipment and storage medium
CN111209362A (en) * 2020-01-07 2020-05-29 苏州城方信息技术有限公司 Address data analysis method based on deep learning
CN112612940A (en) * 2020-12-23 2021-04-06 深圳壹账通智能科技有限公司 Address information analysis method, device, equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008140117A (en) * 2006-12-01 2008-06-19 National Institute Of Information & Communication Technology Apparatus for segmenting chinese character sequence to chinese word sequence
JP2010238043A (en) * 2009-03-31 2010-10-21 Mitsubishi Electric Corp Text analysis learning device
CN108763212A (en) * 2018-05-23 2018-11-06 北京神州泰岳软件股份有限公司 A kind of address information extraction method and device
CN110377686A (en) * 2019-07-04 2019-10-25 浙江大学 A kind of address information Feature Extraction Method based on deep neural network model
CN110688449A (en) * 2019-09-20 2020-01-14 京东数字科技控股有限公司 Address text processing method, device, equipment and medium based on deep learning
CN111125365A (en) * 2019-12-24 2020-05-08 京东数字科技控股有限公司 Address data labeling method and device, electronic equipment and storage medium
CN111209362A (en) * 2020-01-07 2020-05-29 苏州城方信息技术有限公司 Address data analysis method based on deep learning
CN112612940A (en) * 2020-12-23 2021-04-06 深圳壹账通智能科技有限公司 Address information analysis method, device, equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116541472A (en) * 2023-03-22 2023-08-04 麦博(上海)健康科技有限公司 Knowledge graph construction method in medical field

Also Published As

Publication number Publication date
CN112612940A (en) 2021-04-06

Similar Documents

Publication Publication Date Title
WO2022134592A1 (en) Address information resolution method, apparatus and device, and storage medium
CN109885824B (en) Hierarchical Chinese named entity recognition method, hierarchical Chinese named entity recognition device and readable storage medium
Zhang et al. Semantic NLP-based information extraction from construction regulatory documents for automated compliance checking
CN108416058B (en) Bi-LSTM input information enhancement-based relation extraction method
CN112329467B (en) Address recognition method and device, electronic equipment and storage medium
CN111104802B (en) Method for extracting address information text and related equipment
CN112597296B (en) Abstract generation method based on plan mechanism and knowledge graph guidance
CN112035511A (en) Target data searching method based on medical knowledge graph and related equipment
CN116204674B (en) Image description method based on visual concept word association structural modeling
CN110990520A (en) Address coding method and device, electronic equipment and storage medium
CN115658837A (en) Address data processing method and device, electronic equipment and storage medium
Yin et al. Pinpointing locational focus in microblogs
CN113312498B (en) Text information extraction method for embedding knowledge graph by undirected graph
CN114091454A (en) Method for extracting place name information and positioning space in internet text
CN112069824B (en) Region identification method, device and medium based on context probability and citation
CN111859984B (en) Intention mining method, device, equipment and storage medium
CN116386895B (en) Epidemic public opinion entity identification method and device based on heterogeneous graph neural network
CN117010398A (en) Address entity identification method based on multi-layer knowledge perception
CN112417812B (en) Address standardization method and system and electronic equipment
CN113157866B (en) Data analysis method, device, computer equipment and storage medium
CN113468881B (en) Address standardization method and device
CN116186241A (en) Event element extraction method and device based on semantic analysis and prompt learning, electronic equipment and storage medium
CN115048510A (en) Criminal name prediction method based on hierarchical legal knowledge and double-graph joint representation learning
Qiu et al. Integrating NLP and Ontology Matching into a Unified System for Automated Information Extraction from Geological Hazard Reports
CN113190596B (en) Method and device for mixing and matching place name and address

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21908597

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 31.10.2023)