CN109977395A - Handle method, apparatus, electronic equipment and the readable storage medium storing program for executing of address text - Google Patents
Handle method, apparatus, electronic equipment and the readable storage medium storing program for executing of address text Download PDFInfo
- Publication number
- CN109977395A CN109977395A CN201910114666.2A CN201910114666A CN109977395A CN 109977395 A CN109977395 A CN 109977395A CN 201910114666 A CN201910114666 A CN 201910114666A CN 109977395 A CN109977395 A CN 109977395A
- Authority
- CN
- China
- Prior art keywords
- address
- word
- level
- place name
- administrative area
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
Embodiment of the disclosure provides a kind of method, apparatus, electronic equipment and readable storage medium storing program for executing for handling address text, the method to optimize processing address text in the related technology.This method comprises: being based on administrative area dictionary and place name suffix dictionary, address text to be processed is divided into address word sequence, the administrative area dictionary includes multiple administrative area words, and the place name suffix dictionary includes multiple place name suffix words;To in the address word sequence with an administrative area word or the matched each address word of a place name suffix word, corresponding address level label is marked respectively, and, to, with any administrative area word and the unmatched each address word of any place name suffix word, mark presets label respectively in the address word sequence;By each address word in the address word sequence and the label marked respectively to each address word in the address word sequence, input address level prediction model obtains the respective address level of each address word in the address word sequence.
Description
Technical field
The invention relates to technical field of information processing more particularly to it is a kind of handle address text method, apparatus,
Electronic equipment and readable storage medium storing program for executing.
Background technique
In extensive information processing system based on by address text, address text is accurately identified by electronic equipment
This includes which address word and these address words belong to which address level (such as: country, province, city, area, town, village etc.) is
It is vital.The step of two cores of process is participle and mark, by segmenting address text dividing at one by one
Vocabulary puts on corresponding address level to vocabulary by marking.
To participle step, the technical solution that the relevant technologies provide is: the participle model based on general corpus is used, such as:
HMM (Hidden Markov Model, Hidden Markov Model, CRF (Conditional Random Field, condition random
) model, address text is segmented.Have a disadvantage in that: the participle model based on general corpus includes to address text
The discrimination of place name noun is not good enough, is applied to and segments to address text, segments less effective;In addition, marking
Cheng Zhong has dependence between each address word that address text includes, does not meet the observation (i.e. address word) of HMM strictly
Independent dependence is it is assumed that therefore there are rooms for promotion for its mark performance.
The technical solution for another processing address text that the relevant technologies provide is: first using pure HMM (Hidden
Markov Model, Hidden Markov Model) address text is segmented, then using rule and HMM combination by the way of into
Rower note.Have a disadvantage in that: HMM is the participle model based on general corpus, is applied to and segments to address text, point
Word less effective;In addition, having dependence between each address word that address text includes, not strictly in annotation process
Meet observation (i.e. address word) the independent dependence of HMM it is assumed that therefore there are rooms for promotion for its mark performance.
As it can be seen that the method for processing address text in the related technology has much room for improvement.
Summary of the invention
The embodiment of the present application provides a kind of method, apparatus, electronic equipment and readable storage medium storing program for executing for handling address text, with
The method of the processing address text of optimization in the related technology.
The embodiment of the present application first aspect provides a kind of method for handling address text, which comprises
Based on administrative area dictionary and place name suffix dictionary, address text to be processed is divided into address word sequence, the row
Administrative division dictionary includes multiple administrative area words, and the place name suffix dictionary includes multiple place name suffix words;
To in the address word sequence with an administrative area word or the matched each address word of a place name suffix word, respectively
Mark corresponding address level label, and, in the address word sequence with any administrative area word and any place name suffix word
Unmatched each address word marks default label respectively;
By each address word in the address word sequence and each address word in the address word sequence is marked respectively
The label of note, input address level prediction model obtain the respective address level of each address word in the address word sequence.
Optionally, after obtaining the respective address level of each address word in the address word sequence, the method
Further include:
For each address word in the address word sequence, comparing the respective address level of every two neighbor address word is
It is no identical;
Two neighbor address words identical to address level merge, and the address level of the address word after merging is true
It is set to the address level of two neighbor address words before merging.
Optionally, the method also includes:
Multiple sample address words are obtained, each sample address word carries the label marked in advance and the address layer marked in advance
Grade;
The label marked in advance respectively carried according to the multiple sample address word and the multiple sample address word
The address level marked in advance, is trained preset model, obtains the address level prediction model.
Optionally, multiple sample address words are obtained, each sample address word carries the label marked in advance and in advance mark
Address level, comprising:
For each sample address text in multiple sample address texts, following steps are executed:
Multiple place names that the multiple administrative area words and the place name suffix dictionary for including based on the administrative area dictionary include
The sample address text is divided into sample address word sequence by suffix word;
To in the sample address word sequence with an administrative area word or matched each sample of place name suffix word
Location word marks corresponding address level label respectively, and, in the sample address word sequence with any administrative area word and appoint
The unmatched each sample address word of one place name suffix word marks default label respectively;
According to the relevance between sample address word each in the sample address word sequence, and to the address word order
The label that each address word in column marks respectively marks phase for each sample address word in the sample address word sequence respectively
The address level answered.
Optionally, it is based on administrative area dictionary and place name suffix dictionary, address text to be processed is divided into address word sequence,
Include:
Based on the administrative area dictionary, the address text to be processed is divided into initial word sequence;
It, will be unmatched each with any administrative area word in the initial word sequence based on the place name suffix dictionary
Location word divides, and obtains the address word sequence.
Optionally, the address level label includes administrative area word level label and place name suffix word level label;To institute
It states in the word sequence of address with an administrative area word or the matched each address word of a place name suffix word, marks respectively accordingly
Location level label, comprising:
To, with the matched each address word of an administrative area word, marking corresponding administrative area respectively in the initial word sequence
Word level label;
To mismatched in the initial word sequence with any administrative area word and with a place name suffix word it is matched each
Address word marks corresponding place name suffix word level label respectively.
Optionally, it is based on administrative area dictionary and place name suffix dictionary, address text to be processed is divided into address word sequence,
Include:
Based on the administrative area dictionary and the place name suffix dictionary, and the participle mode based on statistical information is combined, it will
Address text to be processed is divided into address word sequence.
The embodiment of the present application second aspect provides a kind of device for handling address text, and described device includes:
Address text to be processed is divided into address for being based on administrative area dictionary and place name suffix dictionary by word segmentation module
Word sequence, the administrative area dictionary include multiple administrative area words, and the place name suffix dictionary includes multiple place name suffix words;
Pre- labeling module, for matched with an administrative area word or a place name suffix word in the address word sequence
Each address word marks corresponding address level label respectively, and, in the address word sequence with any administrative area word and
The unmatched each address word of any place name suffix word marks default label respectively;
Address level prediction module, for by each address word in the address word sequence and to the address word sequence
In the label that marks respectively of each address word, input address level prediction model obtains each in the address word sequence
The respective address level of address word.
Optionally, described device further include:
Comparison module, for it is each to compare every two neighbor address word for each address word in the address word sequence
From address level it is whether identical;
Merging module, for being merged to identical two neighbor address words of address level, and by the address after merging
The address level of word is determined as the address level of two neighbor address words before merging.
Optionally, described device further include:
Module is obtained, for obtaining multiple sample address words, each sample address word carries the label marked in advance and pre-
The address level first marked;
Training module, it is pre- for respectively being carried according to the multiple sample address word and the multiple sample address word
The label first marked and the address level marked in advance, are trained preset model, obtain the address level prediction model.
Optionally, the acquisition module includes:
Submodule is divided, for being based on the administration for each sample address text in multiple sample address texts
Multiple place name suffix words that the multiple administrative area words and the place name suffix dictionary that area's dictionary includes include, by sample address text
Originally it is divided into sample address word sequence;
First mark submodule, each sample address text for being directed in multiple sample address texts, to the sample
With an administrative area word or the matched each sample address word of a place name suffix word in this address word sequence, mark is corresponding respectively
Address level label, and, in the sample address word sequence with any administrative area word and any place name suffix word not
Matched each sample address word marks default label respectively;
Second mark submodule, each sample address text for being directed in multiple sample address texts, according to described
Relevance in sample address word sequence between each sample address word, and to each address word in the address word sequence
The label marked respectively marks corresponding address level for each sample address word in the sample address word sequence respectively.
Optionally, the word segmentation module includes:
The address text to be processed is divided into initially by Preliminary division submodule for being based on the administrative area dictionary
Word sequence;
Divide submodule again, for be based on the place name suffix dictionary, by the initial word sequence with any administration
Word unmatched each address word in area's divides, and obtains the address word sequence.
Optionally, the word segmentation module includes:
Submodule is divided, for being based on the administrative area dictionary and the place name suffix dictionary, and is combined based on statistics letter
Address text to be processed is divided into address word sequence by the participle mode of breath.
The embodiment of the present application third aspect provides a kind of computer readable storage medium, is stored thereon with computer program,
The step in the method as described in the application first aspect is realized when the program is executed by processor.
The embodiment of the present application fourth aspect provides a kind of electronic equipment, including memory, processor and is stored in memory
Computer program that is upper and can running on a processor, the processor realize method described in the application first aspect when executing
The step of.
Using the method for processing address text provided by the embodiments of the present application, the administration of the characteristic based on reflection address text
Area's dictionary and place name suffix dictionary are segmented, and mark corresponding label respectively to obtained each address word is segmented, so
Afterwards based on the label marked respectively to each address word, the address level of each address word is predicted, due to pre- in address level
During survey, this Consideration " to the label of address word mark " has been newly increased, so improving the prediction of address level
Accuracy.
Detailed description of the invention
Technical solution in ord to more clearly illustrate embodiments of the present application, below by institute in the description to the embodiment of the present application
Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the application
Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings
Obtain other attached drawings.
Fig. 1 is the flow chart of the method for the processing address text that one embodiment of the application proposes;
Fig. 2 is the flow chart of the method for the processing address text that another embodiment of the application provides;
Fig. 3 is the flow chart of the method for the processing address text that another embodiment of the application provides;
Fig. 4 is the flow chart of the method for the building address level prediction model that another embodiment of the application provides;
Fig. 5 is the flow chart of the method for the processing address text that another embodiment of the application provides;
Fig. 6 is the schematic diagram of the device for the processing address text that one embodiment of the application proposes.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete
Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen
Please in embodiment, every other implementation obtained by those of ordinary skill in the art without making creative efforts
Example, shall fall in the protection scope of this application.
It is the flow chart of the method for the processing address text that one embodiment of the application proposes with reference to Fig. 1, Fig. 1.Such as Fig. 1 institute
Show, method includes the following steps:
Step S11: it is based on administrative area dictionary and place name suffix dictionary, address text to be processed is divided into address word order
Column, the administrative area dictionary includes multiple administrative area words, and the place name suffix dictionary includes multiple place name suffix words.
In the present embodiment, it is contemplated that the characteristic of address text: address text includes a large amount of place name noun, these place names
Noun is usually to generate because administrative area divides, or have special suffix, thus, in the process segmented to address text
In, administrative area dictionary and place name suffix dictionary are introduced, administrative area dictionary and place name suffix dictionary are then based on, to address text
It is segmented, to improve participle effect.
Administrative area dictionary is the set of administrative area word, including multiple administrative area words.Administrative area word is for describing an administration
Region, such as: Sichuan Province, Nanjing, Shanghai City, Hong Kong Special Administrative Region etc..
Place name suffix word dictionary is the set of place name suffix word, including multiple place name suffix words.Place name suffix word is one
The suffix of place name.Place name suffix word may include the suffix of above-mentioned administrative area word, such as " province ", " city ", also may include other ground
The suffix of name, such as " number ", " building ", " floor ".
In one embodiment, it is based on administrative area dictionary and place name suffix dictionary, and uses the participle side based on dictionary
Formula (such as: Forward Maximum Method (FMM, Forward Maximum Matching), reverse maximum matching (BMM, Backward
Maximum Matching) and bilateral scanning method etc.), text to be processed is divided into multiple address words, multiple address phrases
At sequence be address word sequence.Wherein, the participle mode based on dictionary can refer to the relevant technologies, and details are not described herein.
It in another embodiment, is the processing address text of another embodiment offer of the application with reference to Fig. 2, Fig. 2
The flow chart of method.As shown in Fig. 2, the method comprising the steps of S12- step S13 and following steps:
Step S11 ': being based on the administrative area dictionary and the place name suffix dictionary, and combines point based on statistical information
Address text to be processed is divided into address word sequence by word mode.
In order to optimize participle effect, in the present embodiment, by the participle mode based on dictionary and point based on statistical information
Word mode combines.Specifically, be based on administrative area dictionary and place name suffix dictionary, in conjunction with based on statistical information (such as: hidden Ma Er
Section's husband's model (HMM), condition random field (CRF), open source jieba segment tool) mode, address text is segmented.
It in another embodiment, is the processing address text of another embodiment offer of the application with reference to Fig. 3, Fig. 3
The flow chart of method.As shown in figure 3, the method comprising the steps of S12- step S13 and following steps:
Step S111: the multiple administrative area words for including based on the administrative area dictionary draw the address text to be processed
It is divided into initial word sequence;
Step S112: the multiple place name suffix words for including based on the place name suffix dictionary, it will be in the initial word sequence
It is divided with the unmatched each address word of any administrative area word, obtains the address word sequence.
It in the specific implementation process, include two sub-steps to the participle process of address text: firstly, being based on administrative area word
Allusion quotation, and the participle mode based on statistical information is combined, address text to be processed is tentatively segmented;Then, based on after place name
Sew dictionary, and combine the participle mode based on statistical information, the address word sequence after preliminary participle is segmented again, is obtained
Form the address word sequence of address text to be processed.
Illustratively, address text to be processed is " 1 building, Kowloon Tsim Sha Tsui Kimberley Road 45-47 Campart Bitters square ", to this
The address word sequence that address text to be processed obtain after tentatively segmenting be " Kowloon/Tsim Sha Tsui/Kimberley Road/No. 45-47//
Campart Bitters square/1/ building ";Address word sequence after preliminary participle is segmented again, obtains forming address text to be processed
Address word sequence be " Kowloon/Tsim Sha Tsui/Campart Bitters/road/No. 45-47//Campart Bitters/square/1/ building ".
Word sequence after the preliminary participle of comparison and the word sequence after participle again, it is found that by segmenting again, by ground
Location word " Kimberley Road " is divided into two address words " Campart Bitters " and " road ", also, address word " Campart Bitters square " is divided into
Two address words " Campart Bitters " and " square ".
Using the above-mentioned participle scheme to address text, the administrative area dictionary and place name of the characteristic of reflection address text are utilized
Suffix dictionary segments address text, improves participle effect.Further, after based on administrative area dictionary and place name
Under the premise of sewing dictionary, the participle mode based on dictionary is combined with the participle mode based on statistical information, to address text
It is segmented, further improves participle effect.
Step S12: to matched each with an administrative area word or a place name suffix word in the address word sequence
Location word marks corresponding address level label respectively, and, in the address word sequence with any administrative area word and any
The name unmatched each address word of suffix word marks default label respectively.
After executing the step S11, address text to be processed is divided into multiple address words, this multiple address word composition
Address word sequence.To each address word in the word sequence of address, by each administrative area word in the address word and administrative area dictionary
And each place name suffix word in place name suffix dictionary compares, and according to comparison result, marks corresponding label to the address word.
In one embodiment, the address level label includes administrative area word level label and place name suffix word level
Label;In step S12 to matched each with an administrative area word or a place name suffix word in the address word sequence
Location word marks corresponding address level label respectively, comprising:
To, with the matched each address word of an administrative area word, marking corresponding administrative area respectively in the initial word sequence
Word level label;
To mismatched in the initial word sequence with any administrative area word and with a place name suffix word it is matched each
Address word marks corresponding place name suffix word level label respectively.
In the present embodiment, administrative area word level label refers to the level of each administrative area word in the dictionary of administrative area, is
The prespecified level in administrative partition process.Illustratively, table 1 is the list of administrative area word level label.
1 administrative area word level list of labels of table
Level number | Level title | English abbreviation | Description |
0 | Country | CTR | Country or peer-level in administrative meaning |
2 | It saves | ST | Province, municipality directly under the Central Government or its peer-level in administrative meaning |
3 | City | CT | Prefecture-level city or its peer-level in administrative meaning |
5 | Area | DS | Area, county or its peer-level in administrative meaning |
6 | Town | TN | Town, township or its peer-level in administrative meaning |
7 | Village | VG | Village or its peer-level in administrative meaning |
10 | Street | RD | Including the ranks such as lane, section, road or its peer-level |
Place name suffix word level label refers to the level of each place name suffix word in place name suffix dictionary, except including administration
It further include the level of other place name suffix words outside the level of area's word.Illustratively, table 2 is the list of place name suffix word level label.
2 place name suffix word level list of labels of table
In the present embodiment, to each address word in the word sequence of address, by the address word with it is every in the dictionary of administrative area
Each place name suffix word in a administrative area word and place name suffix dictionary compares, and according to comparison result, marks from following three kinds
A kind of notation methods are selected in mode, word is labeled to the address:
If 1) the address word is matched with an administrative area word in the dictionary of administrative area, basis and the matched row of address word
The level of administrative division word marks corresponding address level label to the address word;
If 2) the address word is matched with a place name suffix word in place name suffix dictionary, basis is matched with the address word
Place name suffix word level, corresponding address level label is marked to the address word;
If 3) any place name in any administrative area word and place name suffix dictionary in the address word and administrative area dictionary
Suffix word mismatches, and marks default label to the address word.
Illustratively, address text to be processed is " 1 building, Kowloon Tsim Sha Tsui Kimberley Road 45-47 Campart Bitters square ", firstly,
The address text to be processed is tentatively segmented, obtain initial word sequence " Kowloon/Tsim Sha Tsui/Kimberley Road/No. 45-47//
Campart Bitters square/1/ building ".
To the address word " Kowloon " in above-mentioned initial word sequence, determining with the matched administrative area word of address word " Kowloon " is
" Kowloon city ", and the level of administrative area word " Kowloon city " is " city (CT) ", therefore marks administrative area word layer to address word " Kowloon "
Grade label " city (CT) ".
To the address word " Tsim Sha Tsui " in above-mentioned initial word sequence, determining and address word " Tsim Sha Tsui " matched administrative area word
It is " Tsim Sha Tsui town ", and the level of administrative area word " Tsim Sha Tsui town " is " town (TN) ", therefore marks and go to address word " Tsim Sha Tsui "
Administrative division word level label " town (TN) ".
To the address word " Kimberley Road " in above-mentioned initial word sequence, address word " Kimberley Road " and any administrative area are determined
Word and any place name suffix word mismatch, i.e., there is no with after the matched administrative area word of address word " Kimberley Road " or place name
Sew word, therefore default label " empty (NULL) " is marked to address word " Kimberley Road ".
Other address words in above-mentioned initial word sequence are labeled using same procedure, obtain following preliminary mark knot
Fruit:
Kowloon/the Tsim Sha Tsui CT/TN Kimberley Road/NULL No. 45-47/NULL/NULL Campart Bitters square/NULL 1/
The building NULL/NULL
Then, initial word sequence is segmented again, obtain forming text to be processed address word sequence " Kowloon/
Tsim Sha Tsui/Campart Bitters/road/No. 45-47//Campart Bitters/square/1/ building ".
To the address word " Campart Bitters " in address above mentioned word sequence, determine address word " Campart Bitters " and any administrative area word with
And any place name suffix word mismatches, that is, is not present and the matched administrative area word of address word " Kimberley Road " or place name suffix
Word, therefore default label " empty (NULL) " is marked to address word " Campart Bitters ".
To the address word " road " in address above mentioned word sequence, determining and address word " road " matched place name suffix word is " road
Road ", and the level of place name suffix word " road " is " road (RD-SF) ", therefore marks place name suffix word layer to address word " road "
Grade label " road (RD-SF) ".
Similarly, to the address word " 45-47 " in address above mentioned word sequence, determining and address word " 45-47 " matched place name
Suffix word is " number ", and the level of place name suffix word " number " is " digital (CN-SF) ", therefore is marked to address word " 45-47 "
Infuse place name suffix word level label " digital (CN-SF) ".
To the address word " number " in address above mentioned word sequence, the determining and matched place name suffix word of address word " number " is " door
The trade mark ", and the level of place name suffix word " number " is " number (HN-SF) ", thus to address word " number " mark place name after
Sew word level label " number (HN-SF) ".
Other address words in address above mentioned word sequence are labeled using same procedure, obtain following final mark knot
Fruit:
Kowloon/the Tsim Sha Tsui the CT/TN Campart Bitters/road NULL/No. 45-47/CN-SF/HN-SF of RD-SF Campart Bitters/
The square the NULL/building POI-SF 1/CN-SF/LV-SF
Using the above-mentioned labelling schemes to address word sequence, according to the matched administrative area of address word in the word sequence of address
Word or place name suffix word mark corresponding address level label to address word, are each address in subsequent determining address word sequence
The address level of word is laid a good foundation.
Step S13: by each address word in the address word sequence and to each address in the address word sequence
The label that word marks respectively, input address level prediction model, each address word obtained in the address word sequence are respective
Address level.
It in one embodiment, is that the building address level that another embodiment of the application provides is predicted with reference to Fig. 4, Fig. 4
The flow chart of the method for model.As shown in figure 4, what address above mentioned level prediction model was obtained through the following steps:
Step S41: obtaining multiple sample address words, and each sample address word carries the label marked in advance and in advance mark
Address level;
Step S42: the preparatory mark respectively carried according to the multiple sample address word and the multiple sample address word
The label of note and the address level marked in advance, are trained preset model, obtain the address level prediction model.
In one embodiment, step S41 includes:
For each sample address text in multiple sample address texts, following steps are executed:
Multiple place names that the multiple administrative area words and the place name suffix dictionary for including based on the administrative area dictionary include
The sample address text is divided into sample address word sequence by suffix word;
To in the sample address word sequence with an administrative area word or matched each sample of place name suffix word
Location word marks corresponding address level label respectively, and, in the sample address word sequence with any administrative area word and appoint
The unmatched each sample address word of one place name suffix word marks default label respectively;
According to the relevance between sample address word each in the sample address word sequence, and to the address word order
The label that each address word in column marks respectively marks phase for each sample address word in the sample address word sequence respectively
The address level answered.
In the present embodiment, sample address text and address text to be processed, the two is substantially address text.In order to
Address above mentioned hierarchal model is obtained, first using a part of address text as sample address text, then by each sample address
Text executes step S11- step S12 as the address text to be processed in step S11- step S12, obtains forming the sample
The label marked in advance that the multiple sample address words and multiple sample address words of this address text respectively carry.
Illustratively, a sample address text is " 1 building, Kowloon Tsim Sha Tsui Kimberley Road 45-47 Campart Bitters square ", warp
Step S11- step S12 is crossed, obtained multiple sample address words and label such as 3 institute of table marked respectively to multiple sample address words
Show.
A sample address word of table more than 3 and the label that it is marked respectively
Sample address word | To the label of sample address word mark |
Kowloon | CT |
Tsim Sha Tsui | TN |
Campart Bitters | NULL |
Road | RD-SF |
45-47 | CN-SF |
Number | HN-SF |
Campart Bitters | NULL |
Square | POI-SF |
1 | CN-SF |
Building | LV-SF |
Then, for each sample address text, according between each sample address word for forming the sample address text
Relevance, and to the label that each address word marks respectively, mark corresponding address layer respectively for each sample address word
Grade.The process for marking address level can be by being accomplished manually, i.e., relevance between each sample address word of manual analysis, and ties
The label marked respectively to each address word is closed, judges the address level of each sample address word, and then with marking each sample
The address level of location word.
Specifically, for each sample address word of composition sample address text, if by step S12, to the sample
The label of location word mark is address level label (administrative area word level label or place name suffix word level label), then by the sample
The address level of address word is labeled as the address level that its address level label is characterized;If by step S12, to the sample
The label of location word mark be default label (such as: NULL), then according to sample address word sample address word adjacent thereto it
Between relevance, and to the address level label of sample address word mark adjacent thereto, mark the ground of the sample address word
Location level.
Illustratively, a sample address text is " 1 building, Kowloon Tsim Sha Tsui Kimberley Road 45-47 Campart Bitters square ", warp
Step S11- step S12 is crossed, sample address word " Kowloon " " Campart Bitters " " road " is obtained.By step S12, to sample address word
The address level label of " Kowloon " mark is " CT ", and therefore, the address level for sample address word " Kowloon " mark is " CT ";Together
Reason, by step S12, the address level label to sample address word " road " mark is " RD-SF ", is sample address word therefore
The address level of " road " mark is " RD ";By step S12, default label " NULL " is marked to sample address word " Campart Bitters ",
The address level label of sample address word " road " mark adjacent thereto is " RD-SF ", it is contemplated that sample address word " Campart Bitters "
Adjacent with sample address word " road ", the address level of the two is usually identical, therefore to the address layer of address word " Campart Bitters " mark
Grade is " RD ".Identical method is used to other sample address words, obtains multiple sample address words and to multiple sample address words
The label and address level marked respectively is as shown in table 4.
A sample address word of table more than 4 and the label and address level that it is marked respectively
Obtaining the label marked in advance that multiple sample address words and multiple sample address words respectively carry and in advance
After the address level of mark, preset model (such as CRF model) is trained, the preset model after training has pre-
The function of the address level for each address word that individual address text includes is surveyed, the preset model after the training is address level
Prediction model.
By adopting the above technical scheme, during trained to preset model, it is pre- a kind of characteristic parameter conduct has been newly increased
If the input of model: the label marked in advance that sample address word carries, the identification of thus obtained address level prediction model
Rate and accuracy rate are higher.Improve the recognition capability and accuracy of identification of address level prediction model.
After obtaining address level prediction model, for address text to be processed, step S11- step S12 is being executed, is being obtained
To after the label for forming each address word of address text to be processed and being marked respectively to each address word, will form to be processedly
Each address word of location text and the label marked respectively to each address word are input to address level prediction model, can be obtained
Form the respective address level of each address word of address text to be processed.
Illustratively, address text to be processed is " helping honest No. 41 Stalls of Pilkem Street in Kowloon ", by step S11- step
S12 obtains each address word and the label marked respectively to each address word as " Kowloon/CT helps honest/TN and shelters/NULL benefit
Gold/the street the NULL/building RD-SF mono-/CN-SF of No. 41/CN-SF/HN-SF/LV-SF ".With " Kowloon/CT assistant is honest/and TN shelters/
The NULL benefit gold/street the NULL/building RD-SF mono-/CN-SF of No. 41/CN-SF/HN-SF/LV-SF " is input, input address layer
Grade prediction model, obtained output are that " Kowloon/CT helps honest/TN and shelters/RD benefit gold/street RD/mono-/LV of No. 41/HN/HN of RD
Building/LV ", it is thus achieved that each address word for forming address text to be processed " helping honest No. 41 Stalls of Pilkem Street in Kowloon "
The prediction of address level.
By adopting the above technical scheme, the administrative area dictionary and place name suffix dictionary of the characteristic based on reflection address text carry out
Participle, and corresponding label is marked respectively to obtained each address word is segmented, it is then based on and each address word is marked respectively
The label of note predicts the address level of each address word, due to having newly increased " over the ground during the prediction of address level
This Consideration of the label of location word mark ", so improving the accuracy of address level prediction.
It is another embodiment of the application with reference to Fig. 5, Fig. 5 in another embodiment of the application in conjunction with above-described embodiment
The flow chart of the method for the processing address text of offer.As shown in figure 5, this method is also wrapped in addition to including step S11- step S13
Include following steps:
Step S14: for each address word in the address word sequence, compare every two neighbor address word respectively
Whether location level is identical;
Step S15: two neighbor address words identical to address level merge, and by the ground of the address word after merging
Location level is determined as the address level of two neighbor address words before merging.
In the present embodiment, address level prediction result can also be advanced optimized.Specifically, step is being executed
Rapid S13 after the address level for determining each address word, is detected adjacent for forming each address word of address text to be processed
Whether the address level of two address words is identical, if identical, merges the two address words, and keeps the address after merging
The address level of word is constant.
Illustratively, address text to be processed is " helping honest No. 41 Stalls of Pilkem Street in Kowloon ", by step S11- step
S13, it is obtaining the result is that " Kowloon/CT assistant honest/TN shelter/RD benefit gold/street the RD/building RD mono-/LV of No. 41/HN/HN/LV ",
Wherein, address word " sheltering " and address word " benefit gold " adjacent and the two address level are " RD ", address word " benefit gold " and address
Word " street " is mutually adjacent, and the address level of the two is " RD ", after merging, obtains " Pilkem Street/RD ";Similarly, right
Address word " 41 " and address word " number ", address word " one " are processed similarly with address word " building ", obtain " No. 41/HN " " Stall/
LV”。
Thus, for address text to be processed " helping honest No. 41 Stalls of Pilkem Street in Kowloon ", finally obtained address level
Prediction result is " Kowloon/CT helps honest/TN Pilkem Street/No. 41/HN of RD Stall/LV ".
By adopting the above technical scheme, level identical neighbor address word in address is merged, and determines the ground after merging
The address level of location word is constant, realizes the simplification to the address level prediction result of address text to be processed.
Based on the same inventive concept, one embodiment of the application provides a kind of device for handling address text.With reference to Fig. 6, Fig. 6
It is the schematic diagram of the device for the processing address text that one embodiment of the application provides.As shown in fig. 6, the device includes:
Address text to be processed is divided into ground for being based on administrative area dictionary and place name suffix dictionary by word segmentation module 601
Location word sequence, the administrative area dictionary include multiple administrative area words, and the place name suffix dictionary includes multiple place name suffix words;
Pre- labeling module 602, for in the address word sequence with an administrative area word or a place name suffix word
The each address word matched marks corresponding address level label respectively, and, in the address word sequence with any administrative area
Word and the unmatched each address word of any place name suffix word mark default label respectively;
Address level prediction module 603, for by each address word in the address word sequence and to the address word
The label that each address word in sequence marks respectively, input address level prediction model obtain in the address word sequence
Each respective address level of address word.
Optionally, described device further include:
Comparison module, for it is each to compare every two neighbor address word for each address word in the address word sequence
From address level it is whether identical;
Merging module, for being merged to identical two neighbor address words of address level, and by the address after merging
The address level of word is determined as the address level of two neighbor address words before merging.
Optionally, described device further include:
Module is obtained, for obtaining multiple sample address words, each sample address word carries the label marked in advance and pre-
The address level first marked;
Training module, it is pre- for respectively being carried according to the multiple sample address word and the multiple sample address word
The label first marked and the address level marked in advance, are trained preset model, obtain the address level prediction model.
Optionally, the acquisition module includes:
Submodule is divided, for being based on the administration for each sample address text in multiple sample address texts
Multiple place name suffix words that the multiple administrative area words and the place name suffix dictionary that area's dictionary includes include, by sample address text
Originally it is divided into sample address word sequence;
First mark submodule, each sample address text for being directed in multiple sample address texts, to the sample
With an administrative area word or the matched each sample address word of a place name suffix word in this address word sequence, mark is corresponding respectively
Address level label, and, in the sample address word sequence with any administrative area word and any place name suffix word not
Matched each sample address word marks default label respectively;
Second mark submodule, each sample address text for being directed in multiple sample address texts, according to described
Relevance in sample address word sequence between each sample address word, and to each address word in the address word sequence
The label marked respectively marks corresponding address level for each sample address word in the sample address word sequence respectively.
Optionally, the word segmentation module includes:
The address text to be processed is divided into initially by Preliminary division submodule for being based on the administrative area dictionary
Word sequence;
Divide submodule again, for be based on the place name suffix dictionary, by the initial word sequence with any administration
Word unmatched each address word in area's divides, and obtains the address word sequence.
Optionally, the word segmentation module includes:
Submodule is divided, for being based on the administrative area dictionary and the place name suffix dictionary, and is combined based on statistics letter
Address text to be processed is divided into address word sequence by the participle mode of breath.
Based on the same inventive concept, another embodiment of the application provides a kind of computer readable storage medium, stores thereon
There is computer program, the step in the method as described in any of the above-described embodiment of the application is realized when which is executed by processor
Suddenly.
Based on the same inventive concept, another embodiment of the application provides a kind of electronic equipment, including memory, processor and
The computer program that can be run on a memory and on a processor is stored, the processor realizes the application above-mentioned when executing
Step in method described in one embodiment.
For device embodiment, since it is basically similar to the method embodiment, related so being described relatively simple
Place illustrates referring to the part of embodiment of the method.
All the embodiments in this specification are described in a progressive manner, the highlights of each of the examples are with
The difference of other embodiments, the same or similar parts between the embodiments can be referred to each other.
It should be understood by those skilled in the art that, the embodiments of the present application may be provided as method, apparatus or calculating
Machine program product.Therefore, the embodiment of the present application can be used complete hardware embodiment, complete software embodiment or combine software and
The form of the embodiment of hardware aspect.Moreover, the embodiment of the present application can be used one or more wherein include computer can
With in the computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) of program code
The form of the computer program product of implementation.
The embodiment of the present application is referring to according to the method for the embodiment of the present application, terminal device (system) and computer program
The flowchart and/or the block diagram of product describes.It should be understood that flowchart and/or the block diagram can be realized by computer program instructions
In each flow and/or block and flowchart and/or the block diagram in process and/or box combination.It can provide these
Computer program instructions are set to general purpose computer, special purpose computer, Embedded Processor or other programmable data processing terminals
Standby processor is to generate a machine, so that being held by the processor of computer or other programmable data processing terminal devices
Capable instruction generates for realizing in one or more flows of the flowchart and/or one or more blocks of the block diagram
The device of specified function.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing terminal devices
In computer-readable memory operate in a specific manner, so that instruction stored in the computer readable memory generates packet
The manufacture of command device is included, which realizes in one side of one or more flows of the flowchart and/or block diagram
The function of being specified in frame or multiple boxes.
These computer program instructions can also be loaded into computer or other programmable data processing terminal devices, so that
Series of operation steps are executed on computer or other programmable terminal equipments to generate computer implemented processing, thus
The instruction executed on computer or other programmable terminal equipments is provided for realizing in one or more flows of the flowchart
And/or in one or more blocks of the block diagram specify function the step of.
Although preferred embodiments of the embodiments of the present application have been described, once a person skilled in the art knows bases
This creative concept, then additional changes and modifications can be made to these embodiments.So the following claims are intended to be interpreted as
Including preferred embodiment and all change and modification within the scope of the embodiments of the present application.
Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by
One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation
Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning
Covering non-exclusive inclusion, so that process, method, article or terminal device including a series of elements not only wrap
Those elements are included, but also including other elements that are not explicitly listed, or further includes for this process, method, article
Or the element that terminal device is intrinsic.In the absence of more restrictions, being wanted by what sentence "including a ..." limited
Element, it is not excluded that there is also other identical elements in process, method, article or the terminal device for including the element.
Above to method, apparatus, storage medium and the electronic equipment of a kind of processing address text provided herein, into
It has gone and has been discussed in detail, specific examples are used herein to illustrate the principle and implementation manner of the present application, the above implementation
The explanation of example is merely used to help understand the present processes and its core concept;Meanwhile for the general technology people of this field
Member, according to the thought of the application, there will be changes in the specific implementation manner and application range, in conclusion this explanation
Book content should not be construed as the limitation to the application.
Claims (10)
1. a kind of method for handling address text, which is characterized in that the described method includes:
Based on administrative area dictionary and place name suffix dictionary, address text to be processed is divided into address word sequence, the administrative area
Dictionary includes multiple administrative area words, and the place name suffix dictionary includes multiple place name suffix words;
To, with an administrative area word or the matched each address word of a place name suffix word, being marked respectively in the address word sequence
Corresponding address level label, and, in the address word sequence with any administrative area word and any place name suffix word not
Matched each address word marks default label respectively;
By each address word in the address word sequence and each address word in the address word sequence marked respectively
Label, input address level prediction model obtain the respective address level of each address word in the address word sequence.
2. the method according to claim 1, wherein each obtaining each address word in the address word sequence
From address level after, the method also includes:
For each address word in the address word sequence, compare the respective address level of every two neighbor address word whether phase
Together;
Two neighbor address words identical to address level merge, and the address level of the address word after merging is determined as
The address level of two neighbor address words before merging.
3. the method according to claim 1, wherein the method also includes:
Multiple sample address words are obtained, each sample address word carries the label marked in advance and the address level marked in advance;
The label marked in advance that is respectively carried according to the multiple sample address word and the multiple sample address word and pre-
The address level first marked, is trained preset model, obtains the address level prediction model.
4. according to the method described in claim 3, each sample address word is taken it is characterized in that, obtaining multiple sample address words
The label and the address level marked in advance that band marks in advance, comprising:
For each sample address text in multiple sample address texts, following steps are executed:
Multiple place name suffix that the multiple administrative area words and the place name suffix dictionary for including based on the administrative area dictionary include
The sample address text is divided into sample address word sequence by word;
To in the sample address word sequence with an administrative area word or the matched each sample address word of a place name suffix word,
Mark corresponding address level label respectively, and, in the sample address word sequence with any administrative area word and any
The name unmatched each sample address word of suffix word marks default label respectively;
According to the relevance between sample address word each in the sample address word sequence, and in the address word sequence
The label that marks respectively of each address word, marked respectively accordingly for each sample address word in the sample address word sequence
Address level.
5. the method according to claim 1, wherein be based on administrative area dictionary and place name suffix dictionary, will be wait locate
Reason address text is divided into address word sequence, comprising:
Based on the administrative area dictionary, the address text to be processed is divided into initial word sequence;
Based on the place name suffix dictionary, by the initial word sequence with the unmatched each address word of any administrative area word
It divides, obtains the address word sequence.
6. according to the method described in claim 5, it is characterized in that, the address level label includes administrative area word level label
With place name suffix word level label;To matched each with an administrative area word or a place name suffix word in the address word sequence
A address word marks corresponding address level label respectively, comprising:
To, with the matched each address word of an administrative area word, marking corresponding administrative area word layer respectively in the initial word sequence
Grade label;
To mismatched in the initial word sequence with any administrative area word and with a matched each address of place name suffix word
Word marks corresponding place name suffix word level label respectively.
7. the method according to claim 1, wherein be based on administrative area dictionary and place name suffix dictionary, will be wait locate
Reason address text is divided into address word sequence, comprising:
Based on the administrative area dictionary and the place name suffix dictionary, and the participle mode based on statistical information is combined, it will be wait locate
Reason address text is divided into address word sequence.
8. a kind of device for handling address text, which is characterized in that described device includes:
Address text to be processed is divided into address word order for being based on administrative area dictionary and place name suffix dictionary by word segmentation module
Column, the administrative area dictionary includes multiple administrative area words, and the place name suffix dictionary includes multiple place name suffix words;
Pre- labeling module, for matched each with an administrative area word or a place name suffix word in the address word sequence
Address word marks corresponding address level label respectively, and, in the address word sequence with any administrative area word and any
The unmatched each address word of place name suffix word marks default label respectively;
Address level prediction module, for by each address word in the address word sequence and in the address word sequence
The label that each address word marks respectively, input address level prediction model obtain each address in the address word sequence
The respective address level of word.
9. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is held by processor
The step in method as claimed in claim 1 is realized when row.
10. a kind of electronic equipment including memory, processor and stores the calculating that can be run on a memory and on a processor
Machine program, which is characterized in that the step of processor realizes method as claimed in claim 1 when executing.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910114666.2A CN109977395A (en) | 2019-02-14 | 2019-02-14 | Handle method, apparatus, electronic equipment and the readable storage medium storing program for executing of address text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910114666.2A CN109977395A (en) | 2019-02-14 | 2019-02-14 | Handle method, apparatus, electronic equipment and the readable storage medium storing program for executing of address text |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109977395A true CN109977395A (en) | 2019-07-05 |
Family
ID=67076968
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910114666.2A Withdrawn CN109977395A (en) | 2019-02-14 | 2019-02-14 | Handle method, apparatus, electronic equipment and the readable storage medium storing program for executing of address text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109977395A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111104802A (en) * | 2019-12-11 | 2020-05-05 | 中国平安财产保险股份有限公司 | Method for extracting address information text and related equipment |
CN111125365A (en) * | 2019-12-24 | 2020-05-08 | 京东数字科技控股有限公司 | Address data labeling method and device, electronic equipment and storage medium |
CN111581311A (en) * | 2020-04-21 | 2020-08-25 | 拉扎斯网络科技(上海)有限公司 | Data processing method and device, readable storage medium and electronic equipment |
CN112988989A (en) * | 2019-12-18 | 2021-06-18 | 中国移动通信集团四川有限公司 | Geographical name and address matching method and server |
CN113761137A (en) * | 2020-06-02 | 2021-12-07 | 阿里巴巴集团控股有限公司 | Method and device for extracting address information |
CN115081449A (en) * | 2022-08-23 | 2022-09-20 | 北京睿企信息科技有限公司 | Address identification method and system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106528526A (en) * | 2016-10-09 | 2017-03-22 | 武汉工程大学 | A Chinese address semantic tagging method based on the Bayes word segmentation algorithm |
CN108268445A (en) * | 2018-01-11 | 2018-07-10 | 苏宁云商集团股份有限公司 | A kind of method and device for handling address information |
CN109033086A (en) * | 2018-08-03 | 2018-12-18 | 银联数据服务有限公司 | A kind of address resolution, matched method and device |
CN109284358A (en) * | 2018-09-05 | 2019-01-29 | 普信恒业科技发展(北京)有限公司 | A kind of hierarchical method and apparatus of Chinese address noun |
-
2019
- 2019-02-14 CN CN201910114666.2A patent/CN109977395A/en not_active Withdrawn
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106528526A (en) * | 2016-10-09 | 2017-03-22 | 武汉工程大学 | A Chinese address semantic tagging method based on the Bayes word segmentation algorithm |
CN108268445A (en) * | 2018-01-11 | 2018-07-10 | 苏宁云商集团股份有限公司 | A kind of method and device for handling address information |
CN109033086A (en) * | 2018-08-03 | 2018-12-18 | 银联数据服务有限公司 | A kind of address resolution, matched method and device |
CN109284358A (en) * | 2018-09-05 | 2019-01-29 | 普信恒业科技发展(北京)有限公司 | A kind of hierarchical method and apparatus of Chinese address noun |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111104802A (en) * | 2019-12-11 | 2020-05-05 | 中国平安财产保险股份有限公司 | Method for extracting address information text and related equipment |
CN111104802B (en) * | 2019-12-11 | 2023-03-28 | 中国平安财产保险股份有限公司 | Method for extracting address information text and related equipment |
CN112988989A (en) * | 2019-12-18 | 2021-06-18 | 中国移动通信集团四川有限公司 | Geographical name and address matching method and server |
CN111125365A (en) * | 2019-12-24 | 2020-05-08 | 京东数字科技控股有限公司 | Address data labeling method and device, electronic equipment and storage medium |
CN111581311A (en) * | 2020-04-21 | 2020-08-25 | 拉扎斯网络科技(上海)有限公司 | Data processing method and device, readable storage medium and electronic equipment |
CN113761137A (en) * | 2020-06-02 | 2021-12-07 | 阿里巴巴集团控股有限公司 | Method and device for extracting address information |
CN113761137B (en) * | 2020-06-02 | 2024-01-09 | 阿里巴巴集团控股有限公司 | Method and device for extracting address information |
CN115081449A (en) * | 2022-08-23 | 2022-09-20 | 北京睿企信息科技有限公司 | Address identification method and system |
CN115081449B (en) * | 2022-08-23 | 2022-11-04 | 北京睿企信息科技有限公司 | Address identification method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109977395A (en) | Handle method, apparatus, electronic equipment and the readable storage medium storing program for executing of address text | |
CN105718586B (en) | The method and device of participle | |
CN105869642B (en) | A kind of error correction method and device of speech text | |
US8819012B2 (en) | Accessing anchors in voice site content | |
WO2017215370A1 (en) | Method and apparatus for constructing decision model, computer device and storage device | |
CN110991187B (en) | Entity linking method, device, electronic equipment and medium | |
WO2016165538A1 (en) | Address data management method and device | |
CN108959242A (en) | A kind of target entity recognition methods and device based on Chinese character part of speech feature | |
CN108121700A (en) | A kind of keyword extracting method, device and electronic equipment | |
US20090067719A1 (en) | System and method for automatic segmentation of ASR transcripts | |
CN106202041A (en) | A kind of method and apparatus of the entity alignment problem solved in knowledge mapping | |
JPWO2008152805A1 (en) | Image recognition apparatus and image recognition method | |
CN106570180A (en) | Artificial intelligence based voice searching method and device | |
CN105096934A (en) | Method for constructing speech feature library as well as speech synthesis method, device and equipment | |
US11669567B2 (en) | Method and system for providing audio content | |
CN112395390B (en) | Training corpus generation method of intention recognition model and related equipment thereof | |
US20130262090A1 (en) | System and method for reducing semantic ambiguity | |
Fiscus et al. | Multiple Dimension Levenshtein Edit Distance Calculations for Evaluating Automatic Speech Recognition Systems During Simultaneous Speech. | |
CN110990520B (en) | Address coding method and device, electronic equipment and storage medium | |
CN105701083A (en) | Text representation method and device | |
CN106205613A (en) | A kind of navigation audio recognition method and system | |
Gildea | Optimal parsing strategies for linear context-free rewriting systems | |
CN113868351A (en) | Address clustering method and device, electronic equipment and storage medium | |
CN107426610A (en) | Video information synchronous method and device | |
CN112149419A (en) | Method, device and system for normalized automatic naming of fields |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20190705 |
|
WW01 | Invention patent application withdrawn after publication |