CN109977395A - Handle method, apparatus, electronic equipment and the readable storage medium storing program for executing of address text - Google Patents

Handle method, apparatus, electronic equipment and the readable storage medium storing program for executing of address text Download PDF

Info

Publication number
CN109977395A
CN109977395A CN201910114666.2A CN201910114666A CN109977395A CN 109977395 A CN109977395 A CN 109977395A CN 201910114666 A CN201910114666 A CN 201910114666A CN 109977395 A CN109977395 A CN 109977395A
Authority
CN
China
Prior art keywords
address
word
level
place name
administrative area
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201910114666.2A
Other languages
Chinese (zh)
Inventor
谢晓东
刘洋
袁树明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sankuai Online Technology Co Ltd
Original Assignee
Beijing Sankuai Online Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sankuai Online Technology Co Ltd filed Critical Beijing Sankuai Online Technology Co Ltd
Priority to CN201910114666.2A priority Critical patent/CN109977395A/en
Publication of CN109977395A publication Critical patent/CN109977395A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

Embodiment of the disclosure provides a kind of method, apparatus, electronic equipment and readable storage medium storing program for executing for handling address text, the method to optimize processing address text in the related technology.This method comprises: being based on administrative area dictionary and place name suffix dictionary, address text to be processed is divided into address word sequence, the administrative area dictionary includes multiple administrative area words, and the place name suffix dictionary includes multiple place name suffix words;To in the address word sequence with an administrative area word or the matched each address word of a place name suffix word, corresponding address level label is marked respectively, and, to, with any administrative area word and the unmatched each address word of any place name suffix word, mark presets label respectively in the address word sequence;By each address word in the address word sequence and the label marked respectively to each address word in the address word sequence, input address level prediction model obtains the respective address level of each address word in the address word sequence.

Description

Handle method, apparatus, electronic equipment and the readable storage medium storing program for executing of address text
Technical field
The invention relates to technical field of information processing more particularly to it is a kind of handle address text method, apparatus, Electronic equipment and readable storage medium storing program for executing.
Background technique
In extensive information processing system based on by address text, address text is accurately identified by electronic equipment This includes which address word and these address words belong to which address level (such as: country, province, city, area, town, village etc.) is It is vital.The step of two cores of process is participle and mark, by segmenting address text dividing at one by one Vocabulary puts on corresponding address level to vocabulary by marking.
To participle step, the technical solution that the relevant technologies provide is: the participle model based on general corpus is used, such as: HMM (Hidden Markov Model, Hidden Markov Model, CRF (Conditional Random Field, condition random ) model, address text is segmented.Have a disadvantage in that: the participle model based on general corpus includes to address text The discrimination of place name noun is not good enough, is applied to and segments to address text, segments less effective;In addition, marking Cheng Zhong has dependence between each address word that address text includes, does not meet the observation (i.e. address word) of HMM strictly Independent dependence is it is assumed that therefore there are rooms for promotion for its mark performance.
The technical solution for another processing address text that the relevant technologies provide is: first using pure HMM (Hidden Markov Model, Hidden Markov Model) address text is segmented, then using rule and HMM combination by the way of into Rower note.Have a disadvantage in that: HMM is the participle model based on general corpus, is applied to and segments to address text, point Word less effective;In addition, having dependence between each address word that address text includes, not strictly in annotation process Meet observation (i.e. address word) the independent dependence of HMM it is assumed that therefore there are rooms for promotion for its mark performance.
As it can be seen that the method for processing address text in the related technology has much room for improvement.
Summary of the invention
The embodiment of the present application provides a kind of method, apparatus, electronic equipment and readable storage medium storing program for executing for handling address text, with The method of the processing address text of optimization in the related technology.
The embodiment of the present application first aspect provides a kind of method for handling address text, which comprises
Based on administrative area dictionary and place name suffix dictionary, address text to be processed is divided into address word sequence, the row Administrative division dictionary includes multiple administrative area words, and the place name suffix dictionary includes multiple place name suffix words;
To in the address word sequence with an administrative area word or the matched each address word of a place name suffix word, respectively Mark corresponding address level label, and, in the address word sequence with any administrative area word and any place name suffix word Unmatched each address word marks default label respectively;
By each address word in the address word sequence and each address word in the address word sequence is marked respectively The label of note, input address level prediction model obtain the respective address level of each address word in the address word sequence.
Optionally, after obtaining the respective address level of each address word in the address word sequence, the method Further include:
For each address word in the address word sequence, comparing the respective address level of every two neighbor address word is It is no identical;
Two neighbor address words identical to address level merge, and the address level of the address word after merging is true It is set to the address level of two neighbor address words before merging.
Optionally, the method also includes:
Multiple sample address words are obtained, each sample address word carries the label marked in advance and the address layer marked in advance Grade;
The label marked in advance respectively carried according to the multiple sample address word and the multiple sample address word The address level marked in advance, is trained preset model, obtains the address level prediction model.
Optionally, multiple sample address words are obtained, each sample address word carries the label marked in advance and in advance mark Address level, comprising:
For each sample address text in multiple sample address texts, following steps are executed:
Multiple place names that the multiple administrative area words and the place name suffix dictionary for including based on the administrative area dictionary include The sample address text is divided into sample address word sequence by suffix word;
To in the sample address word sequence with an administrative area word or matched each sample of place name suffix word Location word marks corresponding address level label respectively, and, in the sample address word sequence with any administrative area word and appoint The unmatched each sample address word of one place name suffix word marks default label respectively;
According to the relevance between sample address word each in the sample address word sequence, and to the address word order The label that each address word in column marks respectively marks phase for each sample address word in the sample address word sequence respectively The address level answered.
Optionally, it is based on administrative area dictionary and place name suffix dictionary, address text to be processed is divided into address word sequence, Include:
Based on the administrative area dictionary, the address text to be processed is divided into initial word sequence;
It, will be unmatched each with any administrative area word in the initial word sequence based on the place name suffix dictionary Location word divides, and obtains the address word sequence.
Optionally, the address level label includes administrative area word level label and place name suffix word level label;To institute It states in the word sequence of address with an administrative area word or the matched each address word of a place name suffix word, marks respectively accordingly Location level label, comprising:
To, with the matched each address word of an administrative area word, marking corresponding administrative area respectively in the initial word sequence Word level label;
To mismatched in the initial word sequence with any administrative area word and with a place name suffix word it is matched each Address word marks corresponding place name suffix word level label respectively.
Optionally, it is based on administrative area dictionary and place name suffix dictionary, address text to be processed is divided into address word sequence, Include:
Based on the administrative area dictionary and the place name suffix dictionary, and the participle mode based on statistical information is combined, it will Address text to be processed is divided into address word sequence.
The embodiment of the present application second aspect provides a kind of device for handling address text, and described device includes:
Address text to be processed is divided into address for being based on administrative area dictionary and place name suffix dictionary by word segmentation module Word sequence, the administrative area dictionary include multiple administrative area words, and the place name suffix dictionary includes multiple place name suffix words;
Pre- labeling module, for matched with an administrative area word or a place name suffix word in the address word sequence Each address word marks corresponding address level label respectively, and, in the address word sequence with any administrative area word and The unmatched each address word of any place name suffix word marks default label respectively;
Address level prediction module, for by each address word in the address word sequence and to the address word sequence In the label that marks respectively of each address word, input address level prediction model obtains each in the address word sequence The respective address level of address word.
Optionally, described device further include:
Comparison module, for it is each to compare every two neighbor address word for each address word in the address word sequence From address level it is whether identical;
Merging module, for being merged to identical two neighbor address words of address level, and by the address after merging The address level of word is determined as the address level of two neighbor address words before merging.
Optionally, described device further include:
Module is obtained, for obtaining multiple sample address words, each sample address word carries the label marked in advance and pre- The address level first marked;
Training module, it is pre- for respectively being carried according to the multiple sample address word and the multiple sample address word The label first marked and the address level marked in advance, are trained preset model, obtain the address level prediction model.
Optionally, the acquisition module includes:
Submodule is divided, for being based on the administration for each sample address text in multiple sample address texts Multiple place name suffix words that the multiple administrative area words and the place name suffix dictionary that area's dictionary includes include, by sample address text Originally it is divided into sample address word sequence;
First mark submodule, each sample address text for being directed in multiple sample address texts, to the sample With an administrative area word or the matched each sample address word of a place name suffix word in this address word sequence, mark is corresponding respectively Address level label, and, in the sample address word sequence with any administrative area word and any place name suffix word not Matched each sample address word marks default label respectively;
Second mark submodule, each sample address text for being directed in multiple sample address texts, according to described Relevance in sample address word sequence between each sample address word, and to each address word in the address word sequence The label marked respectively marks corresponding address level for each sample address word in the sample address word sequence respectively.
Optionally, the word segmentation module includes:
The address text to be processed is divided into initially by Preliminary division submodule for being based on the administrative area dictionary Word sequence;
Divide submodule again, for be based on the place name suffix dictionary, by the initial word sequence with any administration Word unmatched each address word in area's divides, and obtains the address word sequence.
Optionally, the word segmentation module includes:
Submodule is divided, for being based on the administrative area dictionary and the place name suffix dictionary, and is combined based on statistics letter Address text to be processed is divided into address word sequence by the participle mode of breath.
The embodiment of the present application third aspect provides a kind of computer readable storage medium, is stored thereon with computer program, The step in the method as described in the application first aspect is realized when the program is executed by processor.
The embodiment of the present application fourth aspect provides a kind of electronic equipment, including memory, processor and is stored in memory Computer program that is upper and can running on a processor, the processor realize method described in the application first aspect when executing The step of.
Using the method for processing address text provided by the embodiments of the present application, the administration of the characteristic based on reflection address text Area's dictionary and place name suffix dictionary are segmented, and mark corresponding label respectively to obtained each address word is segmented, so Afterwards based on the label marked respectively to each address word, the address level of each address word is predicted, due to pre- in address level During survey, this Consideration " to the label of address word mark " has been newly increased, so improving the prediction of address level Accuracy.
Detailed description of the invention
Technical solution in ord to more clearly illustrate embodiments of the present application, below by institute in the description to the embodiment of the present application Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the application Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings Obtain other attached drawings.
Fig. 1 is the flow chart of the method for the processing address text that one embodiment of the application proposes;
Fig. 2 is the flow chart of the method for the processing address text that another embodiment of the application provides;
Fig. 3 is the flow chart of the method for the processing address text that another embodiment of the application provides;
Fig. 4 is the flow chart of the method for the building address level prediction model that another embodiment of the application provides;
Fig. 5 is the flow chart of the method for the processing address text that another embodiment of the application provides;
Fig. 6 is the schematic diagram of the device for the processing address text that one embodiment of the application proposes.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen Please in embodiment, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall in the protection scope of this application.
It is the flow chart of the method for the processing address text that one embodiment of the application proposes with reference to Fig. 1, Fig. 1.Such as Fig. 1 institute Show, method includes the following steps:
Step S11: it is based on administrative area dictionary and place name suffix dictionary, address text to be processed is divided into address word order Column, the administrative area dictionary includes multiple administrative area words, and the place name suffix dictionary includes multiple place name suffix words.
In the present embodiment, it is contemplated that the characteristic of address text: address text includes a large amount of place name noun, these place names Noun is usually to generate because administrative area divides, or have special suffix, thus, in the process segmented to address text In, administrative area dictionary and place name suffix dictionary are introduced, administrative area dictionary and place name suffix dictionary are then based on, to address text It is segmented, to improve participle effect.
Administrative area dictionary is the set of administrative area word, including multiple administrative area words.Administrative area word is for describing an administration Region, such as: Sichuan Province, Nanjing, Shanghai City, Hong Kong Special Administrative Region etc..
Place name suffix word dictionary is the set of place name suffix word, including multiple place name suffix words.Place name suffix word is one The suffix of place name.Place name suffix word may include the suffix of above-mentioned administrative area word, such as " province ", " city ", also may include other ground The suffix of name, such as " number ", " building ", " floor ".
In one embodiment, it is based on administrative area dictionary and place name suffix dictionary, and uses the participle side based on dictionary Formula (such as: Forward Maximum Method (FMM, Forward Maximum Matching), reverse maximum matching (BMM, Backward Maximum Matching) and bilateral scanning method etc.), text to be processed is divided into multiple address words, multiple address phrases At sequence be address word sequence.Wherein, the participle mode based on dictionary can refer to the relevant technologies, and details are not described herein.
It in another embodiment, is the processing address text of another embodiment offer of the application with reference to Fig. 2, Fig. 2 The flow chart of method.As shown in Fig. 2, the method comprising the steps of S12- step S13 and following steps:
Step S11 ': being based on the administrative area dictionary and the place name suffix dictionary, and combines point based on statistical information Address text to be processed is divided into address word sequence by word mode.
In order to optimize participle effect, in the present embodiment, by the participle mode based on dictionary and point based on statistical information Word mode combines.Specifically, be based on administrative area dictionary and place name suffix dictionary, in conjunction with based on statistical information (such as: hidden Ma Er Section's husband's model (HMM), condition random field (CRF), open source jieba segment tool) mode, address text is segmented.
It in another embodiment, is the processing address text of another embodiment offer of the application with reference to Fig. 3, Fig. 3 The flow chart of method.As shown in figure 3, the method comprising the steps of S12- step S13 and following steps:
Step S111: the multiple administrative area words for including based on the administrative area dictionary draw the address text to be processed It is divided into initial word sequence;
Step S112: the multiple place name suffix words for including based on the place name suffix dictionary, it will be in the initial word sequence It is divided with the unmatched each address word of any administrative area word, obtains the address word sequence.
It in the specific implementation process, include two sub-steps to the participle process of address text: firstly, being based on administrative area word Allusion quotation, and the participle mode based on statistical information is combined, address text to be processed is tentatively segmented;Then, based on after place name Sew dictionary, and combine the participle mode based on statistical information, the address word sequence after preliminary participle is segmented again, is obtained Form the address word sequence of address text to be processed.
Illustratively, address text to be processed is " 1 building, Kowloon Tsim Sha Tsui Kimberley Road 45-47 Campart Bitters square ", to this The address word sequence that address text to be processed obtain after tentatively segmenting be " Kowloon/Tsim Sha Tsui/Kimberley Road/No. 45-47// Campart Bitters square/1/ building ";Address word sequence after preliminary participle is segmented again, obtains forming address text to be processed Address word sequence be " Kowloon/Tsim Sha Tsui/Campart Bitters/road/No. 45-47//Campart Bitters/square/1/ building ".
Word sequence after the preliminary participle of comparison and the word sequence after participle again, it is found that by segmenting again, by ground Location word " Kimberley Road " is divided into two address words " Campart Bitters " and " road ", also, address word " Campart Bitters square " is divided into Two address words " Campart Bitters " and " square ".
Using the above-mentioned participle scheme to address text, the administrative area dictionary and place name of the characteristic of reflection address text are utilized Suffix dictionary segments address text, improves participle effect.Further, after based on administrative area dictionary and place name Under the premise of sewing dictionary, the participle mode based on dictionary is combined with the participle mode based on statistical information, to address text It is segmented, further improves participle effect.
Step S12: to matched each with an administrative area word or a place name suffix word in the address word sequence Location word marks corresponding address level label respectively, and, in the address word sequence with any administrative area word and any The name unmatched each address word of suffix word marks default label respectively.
After executing the step S11, address text to be processed is divided into multiple address words, this multiple address word composition Address word sequence.To each address word in the word sequence of address, by each administrative area word in the address word and administrative area dictionary And each place name suffix word in place name suffix dictionary compares, and according to comparison result, marks corresponding label to the address word.
In one embodiment, the address level label includes administrative area word level label and place name suffix word level Label;In step S12 to matched each with an administrative area word or a place name suffix word in the address word sequence Location word marks corresponding address level label respectively, comprising:
To, with the matched each address word of an administrative area word, marking corresponding administrative area respectively in the initial word sequence Word level label;
To mismatched in the initial word sequence with any administrative area word and with a place name suffix word it is matched each Address word marks corresponding place name suffix word level label respectively.
In the present embodiment, administrative area word level label refers to the level of each administrative area word in the dictionary of administrative area, is The prespecified level in administrative partition process.Illustratively, table 1 is the list of administrative area word level label.
1 administrative area word level list of labels of table
Level number Level title English abbreviation Description
0 Country CTR Country or peer-level in administrative meaning
2 It saves ST Province, municipality directly under the Central Government or its peer-level in administrative meaning
3 City CT Prefecture-level city or its peer-level in administrative meaning
5 Area DS Area, county or its peer-level in administrative meaning
6 Town TN Town, township or its peer-level in administrative meaning
7 Village VG Village or its peer-level in administrative meaning
10 Street RD Including the ranks such as lane, section, road or its peer-level
Place name suffix word level label refers to the level of each place name suffix word in place name suffix dictionary, except including administration It further include the level of other place name suffix words outside the level of area's word.Illustratively, table 2 is the list of place name suffix word level label.
2 place name suffix word level list of labels of table
In the present embodiment, to each address word in the word sequence of address, by the address word with it is every in the dictionary of administrative area Each place name suffix word in a administrative area word and place name suffix dictionary compares, and according to comparison result, marks from following three kinds A kind of notation methods are selected in mode, word is labeled to the address:
If 1) the address word is matched with an administrative area word in the dictionary of administrative area, basis and the matched row of address word The level of administrative division word marks corresponding address level label to the address word;
If 2) the address word is matched with a place name suffix word in place name suffix dictionary, basis is matched with the address word Place name suffix word level, corresponding address level label is marked to the address word;
If 3) any place name in any administrative area word and place name suffix dictionary in the address word and administrative area dictionary Suffix word mismatches, and marks default label to the address word.
Illustratively, address text to be processed is " 1 building, Kowloon Tsim Sha Tsui Kimberley Road 45-47 Campart Bitters square ", firstly, The address text to be processed is tentatively segmented, obtain initial word sequence " Kowloon/Tsim Sha Tsui/Kimberley Road/No. 45-47// Campart Bitters square/1/ building ".
To the address word " Kowloon " in above-mentioned initial word sequence, determining with the matched administrative area word of address word " Kowloon " is " Kowloon city ", and the level of administrative area word " Kowloon city " is " city (CT) ", therefore marks administrative area word layer to address word " Kowloon " Grade label " city (CT) ".
To the address word " Tsim Sha Tsui " in above-mentioned initial word sequence, determining and address word " Tsim Sha Tsui " matched administrative area word It is " Tsim Sha Tsui town ", and the level of administrative area word " Tsim Sha Tsui town " is " town (TN) ", therefore marks and go to address word " Tsim Sha Tsui " Administrative division word level label " town (TN) ".
To the address word " Kimberley Road " in above-mentioned initial word sequence, address word " Kimberley Road " and any administrative area are determined Word and any place name suffix word mismatch, i.e., there is no with after the matched administrative area word of address word " Kimberley Road " or place name Sew word, therefore default label " empty (NULL) " is marked to address word " Kimberley Road ".
Other address words in above-mentioned initial word sequence are labeled using same procedure, obtain following preliminary mark knot Fruit:
Kowloon/the Tsim Sha Tsui CT/TN Kimberley Road/NULL No. 45-47/NULL/NULL Campart Bitters square/NULL 1/ The building NULL/NULL
Then, initial word sequence is segmented again, obtain forming text to be processed address word sequence " Kowloon/ Tsim Sha Tsui/Campart Bitters/road/No. 45-47//Campart Bitters/square/1/ building ".
To the address word " Campart Bitters " in address above mentioned word sequence, determine address word " Campart Bitters " and any administrative area word with And any place name suffix word mismatches, that is, is not present and the matched administrative area word of address word " Kimberley Road " or place name suffix Word, therefore default label " empty (NULL) " is marked to address word " Campart Bitters ".
To the address word " road " in address above mentioned word sequence, determining and address word " road " matched place name suffix word is " road Road ", and the level of place name suffix word " road " is " road (RD-SF) ", therefore marks place name suffix word layer to address word " road " Grade label " road (RD-SF) ".
Similarly, to the address word " 45-47 " in address above mentioned word sequence, determining and address word " 45-47 " matched place name Suffix word is " number ", and the level of place name suffix word " number " is " digital (CN-SF) ", therefore is marked to address word " 45-47 " Infuse place name suffix word level label " digital (CN-SF) ".
To the address word " number " in address above mentioned word sequence, the determining and matched place name suffix word of address word " number " is " door The trade mark ", and the level of place name suffix word " number " is " number (HN-SF) ", thus to address word " number " mark place name after Sew word level label " number (HN-SF) ".
Other address words in address above mentioned word sequence are labeled using same procedure, obtain following final mark knot Fruit:
Kowloon/the Tsim Sha Tsui the CT/TN Campart Bitters/road NULL/No. 45-47/CN-SF/HN-SF of RD-SF Campart Bitters/ The square the NULL/building POI-SF 1/CN-SF/LV-SF
Using the above-mentioned labelling schemes to address word sequence, according to the matched administrative area of address word in the word sequence of address Word or place name suffix word mark corresponding address level label to address word, are each address in subsequent determining address word sequence The address level of word is laid a good foundation.
Step S13: by each address word in the address word sequence and to each address in the address word sequence The label that word marks respectively, input address level prediction model, each address word obtained in the address word sequence are respective Address level.
It in one embodiment, is that the building address level that another embodiment of the application provides is predicted with reference to Fig. 4, Fig. 4 The flow chart of the method for model.As shown in figure 4, what address above mentioned level prediction model was obtained through the following steps:
Step S41: obtaining multiple sample address words, and each sample address word carries the label marked in advance and in advance mark Address level;
Step S42: the preparatory mark respectively carried according to the multiple sample address word and the multiple sample address word The label of note and the address level marked in advance, are trained preset model, obtain the address level prediction model.
In one embodiment, step S41 includes:
For each sample address text in multiple sample address texts, following steps are executed:
Multiple place names that the multiple administrative area words and the place name suffix dictionary for including based on the administrative area dictionary include The sample address text is divided into sample address word sequence by suffix word;
To in the sample address word sequence with an administrative area word or matched each sample of place name suffix word Location word marks corresponding address level label respectively, and, in the sample address word sequence with any administrative area word and appoint The unmatched each sample address word of one place name suffix word marks default label respectively;
According to the relevance between sample address word each in the sample address word sequence, and to the address word order The label that each address word in column marks respectively marks phase for each sample address word in the sample address word sequence respectively The address level answered.
In the present embodiment, sample address text and address text to be processed, the two is substantially address text.In order to Address above mentioned hierarchal model is obtained, first using a part of address text as sample address text, then by each sample address Text executes step S11- step S12 as the address text to be processed in step S11- step S12, obtains forming the sample The label marked in advance that the multiple sample address words and multiple sample address words of this address text respectively carry.
Illustratively, a sample address text is " 1 building, Kowloon Tsim Sha Tsui Kimberley Road 45-47 Campart Bitters square ", warp Step S11- step S12 is crossed, obtained multiple sample address words and label such as 3 institute of table marked respectively to multiple sample address words Show.
A sample address word of table more than 3 and the label that it is marked respectively
Sample address word To the label of sample address word mark
Kowloon CT
Tsim Sha Tsui TN
Campart Bitters NULL
Road RD-SF
45-47 CN-SF
Number HN-SF
Campart Bitters NULL
Square POI-SF
1 CN-SF
Building LV-SF
Then, for each sample address text, according between each sample address word for forming the sample address text Relevance, and to the label that each address word marks respectively, mark corresponding address layer respectively for each sample address word Grade.The process for marking address level can be by being accomplished manually, i.e., relevance between each sample address word of manual analysis, and ties The label marked respectively to each address word is closed, judges the address level of each sample address word, and then with marking each sample The address level of location word.
Specifically, for each sample address word of composition sample address text, if by step S12, to the sample The label of location word mark is address level label (administrative area word level label or place name suffix word level label), then by the sample The address level of address word is labeled as the address level that its address level label is characterized;If by step S12, to the sample The label of location word mark be default label (such as: NULL), then according to sample address word sample address word adjacent thereto it Between relevance, and to the address level label of sample address word mark adjacent thereto, mark the ground of the sample address word Location level.
Illustratively, a sample address text is " 1 building, Kowloon Tsim Sha Tsui Kimberley Road 45-47 Campart Bitters square ", warp Step S11- step S12 is crossed, sample address word " Kowloon " " Campart Bitters " " road " is obtained.By step S12, to sample address word The address level label of " Kowloon " mark is " CT ", and therefore, the address level for sample address word " Kowloon " mark is " CT ";Together Reason, by step S12, the address level label to sample address word " road " mark is " RD-SF ", is sample address word therefore The address level of " road " mark is " RD ";By step S12, default label " NULL " is marked to sample address word " Campart Bitters ", The address level label of sample address word " road " mark adjacent thereto is " RD-SF ", it is contemplated that sample address word " Campart Bitters " Adjacent with sample address word " road ", the address level of the two is usually identical, therefore to the address layer of address word " Campart Bitters " mark Grade is " RD ".Identical method is used to other sample address words, obtains multiple sample address words and to multiple sample address words The label and address level marked respectively is as shown in table 4.
A sample address word of table more than 4 and the label and address level that it is marked respectively
Obtaining the label marked in advance that multiple sample address words and multiple sample address words respectively carry and in advance After the address level of mark, preset model (such as CRF model) is trained, the preset model after training has pre- The function of the address level for each address word that individual address text includes is surveyed, the preset model after the training is address level Prediction model.
By adopting the above technical scheme, during trained to preset model, it is pre- a kind of characteristic parameter conduct has been newly increased If the input of model: the label marked in advance that sample address word carries, the identification of thus obtained address level prediction model Rate and accuracy rate are higher.Improve the recognition capability and accuracy of identification of address level prediction model.
After obtaining address level prediction model, for address text to be processed, step S11- step S12 is being executed, is being obtained To after the label for forming each address word of address text to be processed and being marked respectively to each address word, will form to be processedly Each address word of location text and the label marked respectively to each address word are input to address level prediction model, can be obtained Form the respective address level of each address word of address text to be processed.
Illustratively, address text to be processed is " helping honest No. 41 Stalls of Pilkem Street in Kowloon ", by step S11- step S12 obtains each address word and the label marked respectively to each address word as " Kowloon/CT helps honest/TN and shelters/NULL benefit Gold/the street the NULL/building RD-SF mono-/CN-SF of No. 41/CN-SF/HN-SF/LV-SF ".With " Kowloon/CT assistant is honest/and TN shelters/ The NULL benefit gold/street the NULL/building RD-SF mono-/CN-SF of No. 41/CN-SF/HN-SF/LV-SF " is input, input address layer Grade prediction model, obtained output are that " Kowloon/CT helps honest/TN and shelters/RD benefit gold/street RD/mono-/LV of No. 41/HN/HN of RD Building/LV ", it is thus achieved that each address word for forming address text to be processed " helping honest No. 41 Stalls of Pilkem Street in Kowloon " The prediction of address level.
By adopting the above technical scheme, the administrative area dictionary and place name suffix dictionary of the characteristic based on reflection address text carry out Participle, and corresponding label is marked respectively to obtained each address word is segmented, it is then based on and each address word is marked respectively The label of note predicts the address level of each address word, due to having newly increased " over the ground during the prediction of address level This Consideration of the label of location word mark ", so improving the accuracy of address level prediction.
It is another embodiment of the application with reference to Fig. 5, Fig. 5 in another embodiment of the application in conjunction with above-described embodiment The flow chart of the method for the processing address text of offer.As shown in figure 5, this method is also wrapped in addition to including step S11- step S13 Include following steps:
Step S14: for each address word in the address word sequence, compare every two neighbor address word respectively Whether location level is identical;
Step S15: two neighbor address words identical to address level merge, and by the ground of the address word after merging Location level is determined as the address level of two neighbor address words before merging.
In the present embodiment, address level prediction result can also be advanced optimized.Specifically, step is being executed Rapid S13 after the address level for determining each address word, is detected adjacent for forming each address word of address text to be processed Whether the address level of two address words is identical, if identical, merges the two address words, and keeps the address after merging The address level of word is constant.
Illustratively, address text to be processed is " helping honest No. 41 Stalls of Pilkem Street in Kowloon ", by step S11- step S13, it is obtaining the result is that " Kowloon/CT assistant honest/TN shelter/RD benefit gold/street the RD/building RD mono-/LV of No. 41/HN/HN/LV ", Wherein, address word " sheltering " and address word " benefit gold " adjacent and the two address level are " RD ", address word " benefit gold " and address Word " street " is mutually adjacent, and the address level of the two is " RD ", after merging, obtains " Pilkem Street/RD ";Similarly, right Address word " 41 " and address word " number ", address word " one " are processed similarly with address word " building ", obtain " No. 41/HN " " Stall/ LV”。
Thus, for address text to be processed " helping honest No. 41 Stalls of Pilkem Street in Kowloon ", finally obtained address level Prediction result is " Kowloon/CT helps honest/TN Pilkem Street/No. 41/HN of RD Stall/LV ".
By adopting the above technical scheme, level identical neighbor address word in address is merged, and determines the ground after merging The address level of location word is constant, realizes the simplification to the address level prediction result of address text to be processed.
Based on the same inventive concept, one embodiment of the application provides a kind of device for handling address text.With reference to Fig. 6, Fig. 6 It is the schematic diagram of the device for the processing address text that one embodiment of the application provides.As shown in fig. 6, the device includes:
Address text to be processed is divided into ground for being based on administrative area dictionary and place name suffix dictionary by word segmentation module 601 Location word sequence, the administrative area dictionary include multiple administrative area words, and the place name suffix dictionary includes multiple place name suffix words;
Pre- labeling module 602, for in the address word sequence with an administrative area word or a place name suffix word The each address word matched marks corresponding address level label respectively, and, in the address word sequence with any administrative area Word and the unmatched each address word of any place name suffix word mark default label respectively;
Address level prediction module 603, for by each address word in the address word sequence and to the address word The label that each address word in sequence marks respectively, input address level prediction model obtain in the address word sequence Each respective address level of address word.
Optionally, described device further include:
Comparison module, for it is each to compare every two neighbor address word for each address word in the address word sequence From address level it is whether identical;
Merging module, for being merged to identical two neighbor address words of address level, and by the address after merging The address level of word is determined as the address level of two neighbor address words before merging.
Optionally, described device further include:
Module is obtained, for obtaining multiple sample address words, each sample address word carries the label marked in advance and pre- The address level first marked;
Training module, it is pre- for respectively being carried according to the multiple sample address word and the multiple sample address word The label first marked and the address level marked in advance, are trained preset model, obtain the address level prediction model.
Optionally, the acquisition module includes:
Submodule is divided, for being based on the administration for each sample address text in multiple sample address texts Multiple place name suffix words that the multiple administrative area words and the place name suffix dictionary that area's dictionary includes include, by sample address text Originally it is divided into sample address word sequence;
First mark submodule, each sample address text for being directed in multiple sample address texts, to the sample With an administrative area word or the matched each sample address word of a place name suffix word in this address word sequence, mark is corresponding respectively Address level label, and, in the sample address word sequence with any administrative area word and any place name suffix word not Matched each sample address word marks default label respectively;
Second mark submodule, each sample address text for being directed in multiple sample address texts, according to described Relevance in sample address word sequence between each sample address word, and to each address word in the address word sequence The label marked respectively marks corresponding address level for each sample address word in the sample address word sequence respectively.
Optionally, the word segmentation module includes:
The address text to be processed is divided into initially by Preliminary division submodule for being based on the administrative area dictionary Word sequence;
Divide submodule again, for be based on the place name suffix dictionary, by the initial word sequence with any administration Word unmatched each address word in area's divides, and obtains the address word sequence.
Optionally, the word segmentation module includes:
Submodule is divided, for being based on the administrative area dictionary and the place name suffix dictionary, and is combined based on statistics letter Address text to be processed is divided into address word sequence by the participle mode of breath.
Based on the same inventive concept, another embodiment of the application provides a kind of computer readable storage medium, stores thereon There is computer program, the step in the method as described in any of the above-described embodiment of the application is realized when which is executed by processor Suddenly.
Based on the same inventive concept, another embodiment of the application provides a kind of electronic equipment, including memory, processor and The computer program that can be run on a memory and on a processor is stored, the processor realizes the application above-mentioned when executing Step in method described in one embodiment.
For device embodiment, since it is basically similar to the method embodiment, related so being described relatively simple Place illustrates referring to the part of embodiment of the method.
All the embodiments in this specification are described in a progressive manner, the highlights of each of the examples are with The difference of other embodiments, the same or similar parts between the embodiments can be referred to each other.
It should be understood by those skilled in the art that, the embodiments of the present application may be provided as method, apparatus or calculating Machine program product.Therefore, the embodiment of the present application can be used complete hardware embodiment, complete software embodiment or combine software and The form of the embodiment of hardware aspect.Moreover, the embodiment of the present application can be used one or more wherein include computer can With in the computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) of program code The form of the computer program product of implementation.
The embodiment of the present application is referring to according to the method for the embodiment of the present application, terminal device (system) and computer program The flowchart and/or the block diagram of product describes.It should be understood that flowchart and/or the block diagram can be realized by computer program instructions In each flow and/or block and flowchart and/or the block diagram in process and/or box combination.It can provide these Computer program instructions are set to general purpose computer, special purpose computer, Embedded Processor or other programmable data processing terminals Standby processor is to generate a machine, so that being held by the processor of computer or other programmable data processing terminal devices Capable instruction generates for realizing in one or more flows of the flowchart and/or one or more blocks of the block diagram The device of specified function.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing terminal devices In computer-readable memory operate in a specific manner, so that instruction stored in the computer readable memory generates packet The manufacture of command device is included, which realizes in one side of one or more flows of the flowchart and/or block diagram The function of being specified in frame or multiple boxes.
These computer program instructions can also be loaded into computer or other programmable data processing terminal devices, so that Series of operation steps are executed on computer or other programmable terminal equipments to generate computer implemented processing, thus The instruction executed on computer or other programmable terminal equipments is provided for realizing in one or more flows of the flowchart And/or in one or more blocks of the block diagram specify function the step of.
Although preferred embodiments of the embodiments of the present application have been described, once a person skilled in the art knows bases This creative concept, then additional changes and modifications can be made to these embodiments.So the following claims are intended to be interpreted as Including preferred embodiment and all change and modification within the scope of the embodiments of the present application.
Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning Covering non-exclusive inclusion, so that process, method, article or terminal device including a series of elements not only wrap Those elements are included, but also including other elements that are not explicitly listed, or further includes for this process, method, article Or the element that terminal device is intrinsic.In the absence of more restrictions, being wanted by what sentence "including a ..." limited Element, it is not excluded that there is also other identical elements in process, method, article or the terminal device for including the element.
Above to method, apparatus, storage medium and the electronic equipment of a kind of processing address text provided herein, into It has gone and has been discussed in detail, specific examples are used herein to illustrate the principle and implementation manner of the present application, the above implementation The explanation of example is merely used to help understand the present processes and its core concept;Meanwhile for the general technology people of this field Member, according to the thought of the application, there will be changes in the specific implementation manner and application range, in conclusion this explanation Book content should not be construed as the limitation to the application.

Claims (10)

1. a kind of method for handling address text, which is characterized in that the described method includes:
Based on administrative area dictionary and place name suffix dictionary, address text to be processed is divided into address word sequence, the administrative area Dictionary includes multiple administrative area words, and the place name suffix dictionary includes multiple place name suffix words;
To, with an administrative area word or the matched each address word of a place name suffix word, being marked respectively in the address word sequence Corresponding address level label, and, in the address word sequence with any administrative area word and any place name suffix word not Matched each address word marks default label respectively;
By each address word in the address word sequence and each address word in the address word sequence marked respectively Label, input address level prediction model obtain the respective address level of each address word in the address word sequence.
2. the method according to claim 1, wherein each obtaining each address word in the address word sequence From address level after, the method also includes:
For each address word in the address word sequence, compare the respective address level of every two neighbor address word whether phase Together;
Two neighbor address words identical to address level merge, and the address level of the address word after merging is determined as The address level of two neighbor address words before merging.
3. the method according to claim 1, wherein the method also includes:
Multiple sample address words are obtained, each sample address word carries the label marked in advance and the address level marked in advance;
The label marked in advance that is respectively carried according to the multiple sample address word and the multiple sample address word and pre- The address level first marked, is trained preset model, obtains the address level prediction model.
4. according to the method described in claim 3, each sample address word is taken it is characterized in that, obtaining multiple sample address words The label and the address level marked in advance that band marks in advance, comprising:
For each sample address text in multiple sample address texts, following steps are executed:
Multiple place name suffix that the multiple administrative area words and the place name suffix dictionary for including based on the administrative area dictionary include The sample address text is divided into sample address word sequence by word;
To in the sample address word sequence with an administrative area word or the matched each sample address word of a place name suffix word, Mark corresponding address level label respectively, and, in the sample address word sequence with any administrative area word and any The name unmatched each sample address word of suffix word marks default label respectively;
According to the relevance between sample address word each in the sample address word sequence, and in the address word sequence The label that marks respectively of each address word, marked respectively accordingly for each sample address word in the sample address word sequence Address level.
5. the method according to claim 1, wherein be based on administrative area dictionary and place name suffix dictionary, will be wait locate Reason address text is divided into address word sequence, comprising:
Based on the administrative area dictionary, the address text to be processed is divided into initial word sequence;
Based on the place name suffix dictionary, by the initial word sequence with the unmatched each address word of any administrative area word It divides, obtains the address word sequence.
6. according to the method described in claim 5, it is characterized in that, the address level label includes administrative area word level label With place name suffix word level label;To matched each with an administrative area word or a place name suffix word in the address word sequence A address word marks corresponding address level label respectively, comprising:
To, with the matched each address word of an administrative area word, marking corresponding administrative area word layer respectively in the initial word sequence Grade label;
To mismatched in the initial word sequence with any administrative area word and with a matched each address of place name suffix word Word marks corresponding place name suffix word level label respectively.
7. the method according to claim 1, wherein be based on administrative area dictionary and place name suffix dictionary, will be wait locate Reason address text is divided into address word sequence, comprising:
Based on the administrative area dictionary and the place name suffix dictionary, and the participle mode based on statistical information is combined, it will be wait locate Reason address text is divided into address word sequence.
8. a kind of device for handling address text, which is characterized in that described device includes:
Address text to be processed is divided into address word order for being based on administrative area dictionary and place name suffix dictionary by word segmentation module Column, the administrative area dictionary includes multiple administrative area words, and the place name suffix dictionary includes multiple place name suffix words;
Pre- labeling module, for matched each with an administrative area word or a place name suffix word in the address word sequence Address word marks corresponding address level label respectively, and, in the address word sequence with any administrative area word and any The unmatched each address word of place name suffix word marks default label respectively;
Address level prediction module, for by each address word in the address word sequence and in the address word sequence The label that each address word marks respectively, input address level prediction model obtain each address in the address word sequence The respective address level of word.
9. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is held by processor The step in method as claimed in claim 1 is realized when row.
10. a kind of electronic equipment including memory, processor and stores the calculating that can be run on a memory and on a processor Machine program, which is characterized in that the step of processor realizes method as claimed in claim 1 when executing.
CN201910114666.2A 2019-02-14 2019-02-14 Handle method, apparatus, electronic equipment and the readable storage medium storing program for executing of address text Withdrawn CN109977395A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910114666.2A CN109977395A (en) 2019-02-14 2019-02-14 Handle method, apparatus, electronic equipment and the readable storage medium storing program for executing of address text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910114666.2A CN109977395A (en) 2019-02-14 2019-02-14 Handle method, apparatus, electronic equipment and the readable storage medium storing program for executing of address text

Publications (1)

Publication Number Publication Date
CN109977395A true CN109977395A (en) 2019-07-05

Family

ID=67076968

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910114666.2A Withdrawn CN109977395A (en) 2019-02-14 2019-02-14 Handle method, apparatus, electronic equipment and the readable storage medium storing program for executing of address text

Country Status (1)

Country Link
CN (1) CN109977395A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111104802A (en) * 2019-12-11 2020-05-05 中国平安财产保险股份有限公司 Method for extracting address information text and related equipment
CN111125365A (en) * 2019-12-24 2020-05-08 京东数字科技控股有限公司 Address data labeling method and device, electronic equipment and storage medium
CN111581311A (en) * 2020-04-21 2020-08-25 拉扎斯网络科技(上海)有限公司 Data processing method and device, readable storage medium and electronic equipment
CN112988989A (en) * 2019-12-18 2021-06-18 中国移动通信集团四川有限公司 Geographical name and address matching method and server
CN113761137A (en) * 2020-06-02 2021-12-07 阿里巴巴集团控股有限公司 Method and device for extracting address information
CN115081449A (en) * 2022-08-23 2022-09-20 北京睿企信息科技有限公司 Address identification method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106528526A (en) * 2016-10-09 2017-03-22 武汉工程大学 A Chinese address semantic tagging method based on the Bayes word segmentation algorithm
CN108268445A (en) * 2018-01-11 2018-07-10 苏宁云商集团股份有限公司 A kind of method and device for handling address information
CN109033086A (en) * 2018-08-03 2018-12-18 银联数据服务有限公司 A kind of address resolution, matched method and device
CN109284358A (en) * 2018-09-05 2019-01-29 普信恒业科技发展(北京)有限公司 A kind of hierarchical method and apparatus of Chinese address noun

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106528526A (en) * 2016-10-09 2017-03-22 武汉工程大学 A Chinese address semantic tagging method based on the Bayes word segmentation algorithm
CN108268445A (en) * 2018-01-11 2018-07-10 苏宁云商集团股份有限公司 A kind of method and device for handling address information
CN109033086A (en) * 2018-08-03 2018-12-18 银联数据服务有限公司 A kind of address resolution, matched method and device
CN109284358A (en) * 2018-09-05 2019-01-29 普信恒业科技发展(北京)有限公司 A kind of hierarchical method and apparatus of Chinese address noun

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111104802A (en) * 2019-12-11 2020-05-05 中国平安财产保险股份有限公司 Method for extracting address information text and related equipment
CN111104802B (en) * 2019-12-11 2023-03-28 中国平安财产保险股份有限公司 Method for extracting address information text and related equipment
CN112988989A (en) * 2019-12-18 2021-06-18 中国移动通信集团四川有限公司 Geographical name and address matching method and server
CN111125365A (en) * 2019-12-24 2020-05-08 京东数字科技控股有限公司 Address data labeling method and device, electronic equipment and storage medium
CN111581311A (en) * 2020-04-21 2020-08-25 拉扎斯网络科技(上海)有限公司 Data processing method and device, readable storage medium and electronic equipment
CN113761137A (en) * 2020-06-02 2021-12-07 阿里巴巴集团控股有限公司 Method and device for extracting address information
CN113761137B (en) * 2020-06-02 2024-01-09 阿里巴巴集团控股有限公司 Method and device for extracting address information
CN115081449A (en) * 2022-08-23 2022-09-20 北京睿企信息科技有限公司 Address identification method and system
CN115081449B (en) * 2022-08-23 2022-11-04 北京睿企信息科技有限公司 Address identification method and system

Similar Documents

Publication Publication Date Title
CN109977395A (en) Handle method, apparatus, electronic equipment and the readable storage medium storing program for executing of address text
CN105718586B (en) The method and device of participle
CN105869642B (en) A kind of error correction method and device of speech text
US8819012B2 (en) Accessing anchors in voice site content
WO2017215370A1 (en) Method and apparatus for constructing decision model, computer device and storage device
CN110991187B (en) Entity linking method, device, electronic equipment and medium
WO2016165538A1 (en) Address data management method and device
CN108959242A (en) A kind of target entity recognition methods and device based on Chinese character part of speech feature
CN108121700A (en) A kind of keyword extracting method, device and electronic equipment
US20090067719A1 (en) System and method for automatic segmentation of ASR transcripts
CN106202041A (en) A kind of method and apparatus of the entity alignment problem solved in knowledge mapping
JPWO2008152805A1 (en) Image recognition apparatus and image recognition method
CN106570180A (en) Artificial intelligence based voice searching method and device
CN105096934A (en) Method for constructing speech feature library as well as speech synthesis method, device and equipment
US11669567B2 (en) Method and system for providing audio content
CN112395390B (en) Training corpus generation method of intention recognition model and related equipment thereof
US20130262090A1 (en) System and method for reducing semantic ambiguity
Fiscus et al. Multiple Dimension Levenshtein Edit Distance Calculations for Evaluating Automatic Speech Recognition Systems During Simultaneous Speech.
CN110990520B (en) Address coding method and device, electronic equipment and storage medium
CN105701083A (en) Text representation method and device
CN106205613A (en) A kind of navigation audio recognition method and system
Gildea Optimal parsing strategies for linear context-free rewriting systems
CN113868351A (en) Address clustering method and device, electronic equipment and storage medium
CN107426610A (en) Video information synchronous method and device
CN112149419A (en) Method, device and system for normalized automatic naming of fields

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20190705

WW01 Invention patent application withdrawn after publication