CN104239355B - The data processing method and device of Search Engine-Oriented - Google Patents
The data processing method and device of Search Engine-Oriented Download PDFInfo
- Publication number
- CN104239355B CN104239355B CN201310250057.2A CN201310250057A CN104239355B CN 104239355 B CN104239355 B CN 104239355B CN 201310250057 A CN201310250057 A CN 201310250057A CN 104239355 B CN104239355 B CN 104239355B
- Authority
- CN
- China
- Prior art keywords
- participle
- word
- speech
- investigating
- speech tagging
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9537—Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/29—Geographical information databases
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Remote Sensing (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a kind of data processing method of Search Engine-Oriented and devices.The method includes:Address query string is segmented, the participle set of described address query string is obtained;Geographical word part-of-speech tagging is added for described address query string;Judge whether described address query string is segmented in the position of the geographical word part-of-speech tagging of addition, if it is, the participle in gathering according to participle, acquisition does not log in word.Correspondingly, the embodiment of the present invention additionally provides a kind of device of data processing method that realizing the Search Engine-Oriented.The present invention improves the efficiency and accuracy of unknown word identification.
Description
Technical field
The present invention relates to search engine technical field of data processing, more specifically to a kind of Search Engine-Oriented
Data processing method and device.
In background technology
Address searching engine is a kind of vertical search engine, the address formed by collection, tissue and processing address information
Participle dictionary provides address searching service to the user.To provide accurate search result, the prior art needs constantly to address point
Word dictionary carry out it is perfect, improve address participle dictionary a kind of mode be identify be not indexed to address participle dictionary in address
And the relevant information of the address is increased into address and segments library, the address not being indexed in the participle dictionary of address can be described as not
Posting term.
The identification of existing address unregistered word is the method based on morphological rule or statistics, wherein being based on morphological rule
Method be the identification carried out using artificial and cured morphology, but for the neologisms in the query string of address class, especially
It is trade name or brand name does not have fixed form and is susceptible to omission and the inaccurate situation of identification;The method of statistics, passes through
Frequency studies between individual character at word possibility, since the frequency that the neologisms of most of address class occur is relatively low, thus, this kind
Unknown word identification method equally exists the technological deficiency of statistical result inaccuracy.
Invention content
In view of this, the present invention provides a kind of data processing method and device of Search Engine-Oriented, improved not with realizing
The technical purpose of posting term recognition efficiency and accuracy.
An embodiment of the present invention provides a kind of data processing method of Search Engine-Oriented, the method includes:
Address query string is segmented, the participle set of described address query string is obtained;
In described address query string, geographical word part-of-speech tagging is added;
Judge whether described address query string is segmented in the position of the geographical word part-of-speech tagging of addition, if it is, according to
Participle in participle set, obtains unregistered word.
Further, the embodiment of the present invention additionally provides a kind of data processing equipment of Search Engine-Oriented, described device packet
It includes:
Participle unit obtains the participle set of described address query string for being segmented to address query string;
Unit is marked, in described address query string, adding geographical word part-of-speech tagging;
Position judgment unit is segmented, for judging described address query string whether in the position of the geographical word part-of-speech tagging of addition
By participle unit cutting, if it is, triggering unregistered word acquiring unit;
Unregistered word acquiring unit obtains unregistered word for the participle in gathering according to participle.
An embodiment of the present invention provides a kind of technical solution of the data processing of Search Engine-Oriented, the program passes through over the ground
Location query string is segmented, and the participle set of described address query string is obtained;Geographical word part of speech is added for described address query string
Mark;Again by judging whether address lookup string is segmented in the position of the geographical word part-of-speech tagging of addition, address lookup is judged
The word segmentation result of string whether there is ambiguity with geographical word part-of-speech tagging result, if address lookup string is in the geographical word part of speech mark of addition
The position of note is segmented, then illustrates that ambiguity is not present with geographical word part-of-speech tagging result in word segmentation result, illustrate in the query string of address
There are unregistered word, the participle in gathering further according to participle obtains unregistered word.It is very big that the embodiment of the present invention provides technical solution
The existing identification and statistics method of improving find the lower technological deficiency of unregistered word accuracy, reached raising unregistered word
Identify the technical purpose of accuracy;Meanwhile above-described embodiment is without carrying out large-scale Concordance and statistical disposition, have compared with
Fast unlisted word discovery speed.
Description of the drawings
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with
Obtain other attached drawings according to these attached drawings.
Fig. 1 is a kind of data processing method flow chart of Search Engine-Oriented disclosed by the embodiments of the present invention;
Whether Fig. 2 is disclosed by the embodiments of the present invention a kind of judge address lookup string in the position of addition geography word part-of-speech tagging
Set the method flow diagram segmented;
Fig. 3 be it is disclosed by the embodiments of the present invention it is a kind of judgement be individual character participle can with the previous of its or later divide
The method flow diagram that word combines;
Fig. 4 is a kind of data processing equipment composition schematic diagram of Search Engine-Oriented disclosed by the embodiments of the present invention.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, those of ordinary skill in the art are obtained every other without creative efforts
Embodiment shall fall within the protection scope of the present invention.
Fig. 1 is referred to, is a kind of data processing method of Search Engine-Oriented disclosed by the embodiments of the present invention, this method packet
Include following steps:
Step 10:Address query string is segmented, the participle set of described address query string is obtained;
In practical applications, described address query string can be obtained from station address inquiry log, can also be from other
It is obtained in the file of recording address query string, has no effect on the realization of the embodiment of the present invention.
Step 20:In described address query string, geographical word part-of-speech tagging is added;
Wherein, in described address query string, the realization process for adding geographical word part-of-speech tagging is specially:
Participle as unit of geographical word is carried out to address query string, and adds geographical word after the participle that cutting obtains
Part-of-speech tagging.
In practical applications, step 10 and step 20 may be performed simultaneously, and can also first carry out step 10 and execute step again
20, or first carry out step 20 and execute step 10 again, have no effect on the realization of the embodiment of the present invention.
Step 30:Judge whether address lookup string is segmented in the position of the geographical word part-of-speech tagging of addition, if it is, into
Enter step 40, if it is not, then terminating this flow;
Step 40:Participle in being gathered according to participle obtains unregistered word.
Wherein, by step 30 may determine that address lookup string word segmentation result and geographical word part-of-speech tagging result whether
There are ambiguities, if there is no ambiguity, illustrate there is unregistered word in the query string of address.
It is a kind of data processing method of Search Engine-Oriented provided in an embodiment of the present invention above, this method passes through judgement
Word segmentation result and geographical word part-of-speech tagging result whether there is ambiguity, whether there is unregistered word in the query string of address to determine,
And word segmentation result and geographical word part-of-speech tagging result unambiguously in the case of, the participle in being gathered according to participle and its part of speech mark
Note, obtains unregistered word, and this method greatly improves existing identification and statistics method and finds that unregistered word accuracy is lower
Technological deficiency has reached the technical purpose for improving unknown word identification accuracy;Meanwhile above-described embodiment is extensive without carrying out
Concordance and statistical disposition, have faster unlisted word discovery speed.
In practical applications, preset participle dictionary may be used and mechanical Chinese word segmentation is carried out to address query string and by hidden horse
Ke Erfu algorithms eliminate segmentation ambiguity, since the participle in preset participle dictionary has carried part-of-speech tagging (shown in table 1), because
This, can be completed the part of speech that can also be segmented to the participle of address query string by word segmentation processing.For example, address lookup string
For " the Tianjin hotels Ao Lanjide ", according to participle mode provided in an embodiment of the present invention, the participle of the obtained address lookup string
Collection is combined into " Tianjin/S indigo plant border/H morals/hotels H/U difficult to understand ", wherein S, H, U are the part of speech of participle, and meaning refers to table 1, and S is indicated
The part of speech of Tianjin is provincial geographical word, and H indicates the blue border of Austria, the part of speech of moral is core word, and U indicates that the part of speech in hotel is suffix classification
Word.
In practical applications, geographical word word can be added for described address query string according to Rules for Part of Speech Tagging shown in table 2
Property mark, i.e., only in address lookup string be geographical word word behind add part-of-speech tagging.For example, " the Tianjin hotels Ao Lanjide "
According to Rules for Part of Speech Tagging shown in table 2, obtained annotation results are " Tianjin/hotels CS Ao Lanjide/OP ".
The part-of-speech tagging that table 1 segments
According to illustrating above, address lookup string " the Tianjin hotels Ao Lanjide " is " day by segmenting obtained participle set
Tianjin/S indigo plant border/H morals/the hotels H/U " difficult to understand obtain geographical word part-of-speech tagging the result is that " Tianjin/CS is difficult to understand by geographical word part-of-speech tagging
The hotels Lan Jide/OP ", thus result can be seen that " the Tianjin hotels Ao Lanjide " in the position of the geographical word part-of-speech tagging of addition
(position between " Tianjin " and " Austria ", the position after " shop ") is segmented, and is existed not in this explanation " the Tianjin hotels Ao Lanjide "
Posting term needs further to obtain unregistered word according to its participle set.And address lookup string " Liuli Flyover east ", by segmenting
It is " in six/H bridges east/N " to participle set, geographical word part-of-speech tagging result " Liuli Flyover/Q is obtained by geographical word part-of-speech tagging
It is eastern ", thus result can be seen that " Liuli Flyover east " in the position (position between " bridge " and " east " of the geographical word part-of-speech tagging of addition
Set) do not segmented, there are ambiguities for the two, therefore no longer carry out unregistered word acquisition.
2 address rank table of table
In practical applications, judge whether address lookup string is specific by participle in the position of the geographical word part-of-speech tagging of addition
For:Judge whether the participle in the participle set of described address query string meets following regular 1 or rule 2, if address lookup string
Participle set in participle meet following regular 1 or rule 2, then illustrate address query string in the geographical word part-of-speech tagging of addition
Position segmented:
Rule 1:The length segmented in participle set is equal to from address lookup string from the lead-in of the participle to the lead-in
The length of partial address query string between first geographical word part-of-speech tagging afterwards;
Alternatively,
Rule 2:The length that segments is less than in address lookup string after the lead-in to the lead-in of the participle the in participle set
The length of partial address query string between the one geographical word part-of-speech tagging but participle segments the group obtained after combining with other
The length for closing participle is equal to from the partial address first after the lead-in to the lead-in of the participle geographical word part-of-speech tagging
The length of query string.
It should be noted that the word segmentation result of address lookup string can than address lookup string geographical word part-of-speech tagging result more
To be fine, therefore, it will usually which the case where occurring is that the combination participle length obtained after multiple participle combinations is segmented equal to from combination
Lead-in to the lead-in after partial address query string between first geographical word part-of-speech tagging length, but be not precluded ground
The geographical word part-of-speech tagging result of the location query string situation more finer than the word segmentation result of address lookup string, in such case
Under, if the participle in the participle set of address lookup string meets following regular 3, also illustrate address query string in the geographical word word of addition
Property mark position segmented:
Rule 3:The length that segments is more than in address lookup string after the lead-in to the lead-in of the participle the in participle set
The length of partial address query string between one geographical word part-of-speech tagging, but N after from the lead-in of the participle to the lead-in
The length of partial address query string between (N >=2) a geographical word part-of-speech tagging is equal to the length of the participle.
Below by taking address lookup string " the Tianjin hotels Ao Lanjide " as an example, in conjunction with attached drawing 2, to provided in an embodiment of the present invention
Judging address lookup string, whether the method flow segmented in the position of the geographical word part-of-speech tagging of addition describes in detail, point
Word set is combined into " Tianjin/S indigo plant border/H morals/hotels H/U difficult to understand ", and geographical word part-of-speech tagging is " Tianjin/hotels OS Ao Lanjide/OP ",
This method includes:
Step 301:Participle " Tianjin " is read in gathering from participle;
Step 302:The length for judging participle " Tianjin " and " day " word from address lookup string " the Tianjin hotels Ao Lanjide "
The length relation of partial address query string " Tianjin " after starting to " day " word between first geographical word part-of-speech tagging, obtains
Judging result is that the two is equal, enters step 303;
Step 303:Participle " indigo plant border difficult to understand " is read in gathering from participle;
Step 304:The length for judging participle " indigo plant border difficult to understand " and " Austria " from address lookup string " the Tianjin hotels Ao Lanjide "
The length of partial address query string " hotels Ao Lanjide " after word starts to " Austria " word between first geographical word part-of-speech tagging
Relationship, obtained judging result are that participle length is less than partial address inquiry string length, enter step 305;
Step 305:Participle " moral " is read in gathering from participle;
Step 306:Participle " indigo plant border difficult to understand " is combined with participle " moral ", obtains participle combination " indigo plant border difficult to understand moral ";
Step 307:Judge the length and partial address query string " hotels Ao Lanjide " of participle combination " indigo plant border difficult to understand moral "
Length relation, obtained judging result are that the length of participle combination is less than partial address inquiry string length, enter step 308;
Step 308:Participle " hotel " is read in gathering from participle;
Step 309:Participle " indigo plant border difficult to understand ", participle " moral " and " hotel " are combined, obtain segmenting combination " indigo plant border difficult to understand moral
Hotel ";
Step 309:Judge the length and partial address query string " indigo plant border difficult to understand moral wine of participle combination " hotels Ao Lanjide "
The length relation in shop ", obtained judging result is that the length of participle combination is equal to the length of partial address query string, due to participle
First participle in set meets above-mentioned regular 1, and remaining participle meets above-mentioned regular 2, and therefore, judgement obtains address lookup
String " the Tianjin hotels Ao Lanjide " is segmented in the position of the geographical word part-of-speech tagging of addition.
Below by taking address lookup string " Liuli Flyover east " as an example, judge address lookup string whether in addition geographical word the present invention
The method flow that the position of part-of-speech tagging is segmented briefly is introduced again, and participle collection is combined into " in six/H bridges east/N ", geographical
Word part-of-speech tagging result is " Liuli Flyover/east Q ":
From participle gather in read participle " in six ", due to " in six " length be less than " Liuli Flyover " length, then from point
Participle " bridge east " is read in set of words, and will be combined with " bridge east " " in six ", and participle combination " Liuli Flyover east " is obtained, due to
The length in " Liuli Flyover east " is more than " Liuli Flyover ", and the participle in participle set is unsatisfactory for any one in above three rule,
Therefore, judge that obtaining address lookup string " Liuli Flyover east " is not segmented in the position of the geographical word part-of-speech tagging of addition.
How to judge address lookup string whether in the geographical word part-of-speech tagging of addition to provided in an embodiment of the present invention above
The method that position is segmented is described in detail.Below in conjunction with specific example, to it is provided in an embodiment of the present invention how basis
The method that participle and its part-of-speech tagging in participle set obtain unregistered word describes in detail.
In the concrete realization, participle and its part-of-speech tagging during the step 40 is gathered according to participle obtain unregistered word,
Specifically, above-mentioned regular 2 participle and its part-of-speech tagging are continuously met in gathering according to participle, obtains unregistered word.
In practical applications, above-mentioned regular 2 participle and its part-of-speech tagging are continuously met in the set according to participle,
Unregistered word is obtained to specifically include:
Continuously meet regular 2 participle in traversal participle set, if it find that being the participle of individual character, then basis is divided
The part-of-speech tagging of word, be described in judgement individual character participle can with the previous of its or later participle combined, and will tie
The participle of conjunction exports after being combined according to sequence of the participle in address lookup string as unregistered word.
It is individual character being found that it should be noted that continuously meeting regular 2 participle in traversal participle set
After participle, if there are multiple participles before individual character, only judge that can individual character combine with its adjacent participle previous, than
Such as, participle 1, participle 2,3 (individual characters) of participle then only judge that can participle 3 be combined with participle 2;If also multiple after individual character
Participle, then need to judge whether individual character can combine with the participle after it, for example, 1 (individual character) of participle, participle 2, participle 3,
It then needs to judge that can participle 1 be combined with participle 2, if it can, also need to judge that participle 1, participle 2, participle 3 can combine, with
This analogizes, until find cannot in conjunction with participle or processing to the last one continuously met in regular 2 participle
Participle, terminates the flow.
Below in conjunction with attached drawing 3, the part-of-speech tagging according to participle is provided to the embodiment of the present invention, is point of individual character described in judgement
Word can with the previous of its or later participle combine method describe in detail, below by the previous of individual character or it
Participle afterwards is referred to as waiting investigating segmenting, and this method includes:
Step 4021:It waits investigating whether participle is individual character described in judgement, if it is, individual character can segment knot with waiting investigating
It closes, if it is not, then entering step 4022;
Step 4022:Judgement waits for whether the investigation participle is made of three or more individual characters, if it is, cannot combine,
If it is not, then entering step 4023 and step 4025;
Step 4023:It waits investigating whether the part of speech of participle is village described in judgement, if it is, entering step 4024, such as
Fruit is no, then enters step 4027;
Step 4024:Judge whether the tail word for waiting investigating participle is the word (for example, village, township, the village, village) for indicating village, such as
Fruit is cannot then to combine, if it is not, then can combine;
Step 4025:It waits investigating whether the part of speech of participle is road described in judgement, if it is, entering step 4026, such as
Fruit is no, then enters step 4027;
Step 4026:Judge whether the tail word for waiting investigating participle is the word (for example, road, street, lane, line) for indicating road, such as
Fruit is cannot then to combine, if it is not, then can combine;
In practical applications, it is road or the participle in village for part of speech, such as " great river " in " great river village ", " great river "
There may be " great river roads ", " great river street " " great river shop " etc. in segmenting dictionary, therefore, in this case, it is also necessary to sentence
Whether the end word of disconnected this kind of participle has apparent geographical feature, if end word is " village " " township " " village " or " road ", " road ",
Then show that this is village grade place name or link name, then the individual character cannot be combined with the participle.
Step 4027:Judge whether the part-of-speech tagging for waiting investigating participle is core word, determiner, point of interest word, classifier
In one, if it is not, then cannot combine, if it is, entering step 4028;
Step 4028:It waits investigating whether participle is high frequency words described in judgement, if it is, cannot combine, if it is not, then can
To combine.
Wherein, the enquiry frequency for waiting investigating participle can be recorded in preset participle dictionary, can judge described wait for accordingly
Investigate whether participle is high frequency words.
The side provided in an embodiment of the present invention that how to judge individual character and can be combined with its front and back participle is described above
Method.For " the Tianjin hotels Ao Lanjide " described earlier below, the method provided inventive embodiments is introduced.
Meeting above-mentioned regular 2 participle in the participle set of address lookup string " the Tianjin borders Ao Lan moral " includes:" indigo plant border/H difficult to understand
Moral/the hotels B/U " is specifically included therefore, it is necessary to find unregistered word in these participles:Traversal " indigo plant border/H morals/hotels B difficult to understand/
U " has found individual character " moral/B ", and it is " indigo plant border/H difficult to understand " to be segmented before " moral/B ", and " indigo plant border/H difficult to understand " is made of three individual characters, and part of speech was both
Not instead of village is nor road, therefore core word and be not high frequency words, Austria indigo plant border/H " and moral/B " is combined to obtain
" indigo plant border difficult to understand moral ";It is " hotel/U " to segment after " moral/B ", and " hotel/U " is made of two individual characters, part of speech neither village not yet
It is road, but " indigo plant border difficult to understand moral " is therefore combined with " hotel ", obtains " blue border moral difficult to understand by classifier and be not high frequency words
Hotel ", since " hotel " is the last one participle, " hotels Ao Lanjide " is exported as unregistered word.
Below to method provided in an embodiment of the present invention by taking address lookup string " Foshan remex RSL Badminton Stadium " as an example
It describes in detail.Address lookup string " Foshan remex RSL Badminton Stadium " passes through the participle set that Words partition system is handled
For " Foshan/C remex/Asias P/H lions dragon/P feathers/E arenas/U ", the geographical rank handled by geographical rank labeling system
It is labeled as " Foshan/OC remex RSL Badminton Stadium/OP ", wherein in " remex/Asias P/H lions dragon/P feathers/E arenas/U " satisfaction
Rule 2 is stated, " remex/Asias P/H lions dragon/P feathers/E arenas/U " is traversed, finds individual character " Asia ", adjacent participle is
" remex/P " and " lion dragon/P ", " remex/P " and " lion dragon/P " are the non-height that length is respectively less than 3 words and part of speech is point of interest word
Frequency word, therefore " remex/P " and " lion dragon/P " can be combined with " Asia ", obtain " remex RSL ", the latter participle of " lion dragon " is
" feather ", " feather " belongs to the short high frequency words of non-geographic class, not combinable, and therefore, first unknown word identification terminates, and result is
" remex RSL ";From " feather ", this word continues to begin stepping through again, and " feather " is non-individual character, then finds " arena ", " arena "
Belong to non-individual character, and belong to classifier, therefore this flow terminates, and finally finds one at " Foshan remex RSL Badminton Stadium "
Unregistered word " remex RSL ".
It is the data processing method of Search Engine-Oriented provided in an embodiment of the present invention above, below in conjunction with attached drawing to this hair
The device for the realization above method that bright embodiment provides describes in detail.
Fig. 4 is referred to, is a kind of data processing equipment of Search Engine-Oriented provided in an embodiment of the present invention, the device packet
It includes:
Participle unit 50 obtains the participle set of described address query string for being segmented to address query string;
Unit 51 is marked, in described address query string, adding geographical word part-of-speech tagging;
Position judgment unit 52 is segmented, for judging described address query string whether in the position of the geographical word part-of-speech tagging of addition
It sets by participle unit cutting, if it is, triggering unregistered word acquiring unit 53;
Unregistered word acquiring unit 53 obtains unregistered word for the participle in gathering according to participle.
In practical applications, the participle position judgment unit 52 is specifically used for:
Judge whether the participle in the participle set of described address query string meets following rules, if satisfied, then describedly
Location query string is in the position of the geographical word part-of-speech tagging of addition by participle unit cutting:
Rule 1:The length segmented in participle set is equal in address lookup string after the lead-in to the lead-in of the participle
The length of partial address query string between first geographical word part-of-speech tagging;
Alternatively,
Rule 2:The length segmented in participle set is less than in address lookup string after the lead-in to the lead-in of the participle
The length of partial address query string between first geographical word part-of-speech tagging but the participle are segmented with other to be obtained after combining
The length of combination participle be equal to from first after the lead-in to the lead-in of the participle geographical word part-of-speech tagging partly
The length of location query string.
In practical applications, the unregistered word acquiring unit 53 is specifically used for:
Continuously meet described regular 2 participle and its part-of-speech tagging in gathering according to participle, obtains unregistered word.
Preferably, in practical applications, the unregistered word acquiring unit 53 specifically includes:
Individual character finds subelement, continuously meets regular 2 participle for traversing in participle set, if it find that being
The participle of individual character then triggers unregistered word and obtains subelement;
Participle combines judgment sub-unit, and can the participle for according to the part-of-speech tagging of participle, described in judgement being individual character with
Its previous or later participle combines;
Unregistered word obtains subelement, for the participle to be combined point that can be combined that judgment sub-unit judges
Word exports after being combined according to sequence of the participle in address lookup string as unregistered word.
Preferably, previous or later the participle of the participle of the individual character is known as waiting investigating participle, then the participle knot
Judgment sub-unit is closed to specifically include:
Individual character judgment sub-unit, for judge it is described wait investigating whether participle is individual character, if it is, can combine, such as
Fruit is no, then triggers word length judging unit;
The word length judgment sub-unit, for judge it is described wait investigating whether participle is made of three or more individual characters, if
It is that cannot then combine, if it is not, then the first part of speech judgment sub-unit of triggering, described waits that the part of speech for investigating participle is for judging
No is the second part of speech judgment sub-unit of village and triggering, for judging whether the part of speech for waiting for that investigation segments is road;
If described wait that the part of speech for investigating participle is village and the tail word for waiting for that investigation segments is the word for indicating village,
It cannot combine;
If described wait that the part of speech for investigating participle is the word that village but the tail word for waiting for that investigation segments are not representing village,
It can then combine;
It is such as described to wait that the part of speech for investigating participle is road and the tail word for waiting for that investigation segments is the word for indicating street, then not
It can combine;
If described wait that the part of speech for investigating participle is the word that road but the tail word for waiting for that investigation segments are not representing street,
It can then combine;
If described wait that it is not village and road to investigate the part of speech of participle, triggers third part of speech judgment sub-unit, is used for
Judge whether the part-of-speech tagging for judging to wait investigating participle is core word, determiner, point of interest word or classifier, if not
That cannot then combine, if it is and it is described wait investigate participle be not high frequency words, then can combine.
Each embodiment is described by the way of progressive in this specification, the highlights of each of the examples are with other
The difference of embodiment, just to refer each other for identical similar portion between each embodiment.For device disclosed in embodiment
For, since it is corresponded to the methods disclosed in the examples, so description is fairly simple, related place is said referring to method part
It is bright.
For system embodiments, since it essentially corresponds to embodiment of the method, so describe fairly simple, it is related
Place illustrates referring to the part of embodiment of the method.
It should be noted that herein, relational terms such as first and second and the like are used merely to a reality
Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation
In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to
Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those
Element, but also include other elements that are not explicitly listed, or further include for this journey, method, article or equipment institute
Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that wrapping
Include in the process, method, article or equipment of the element that there is also other identical elements.
The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention.
Various modifications to these embodiments will be apparent to those skilled in the art, as defined herein
General Principle can in other embodiments be realized in the case where not departing from the spirit or scope of the embodiment of the present invention.Therefore,
The embodiment of the present invention is not intended to be limited to the embodiments shown herein, and be to fit to principles disclosed herein and
The consistent widest range of features of novelty.
Claims (10)
1. a kind of data processing method of Search Engine-Oriented, which is characterized in that the method includes:
Address query string is segmented, the participle set of described address query string is obtained;
In described address query string, geographical word part-of-speech tagging is added;
Judge whether described address query string is segmented in the position of the geographical word part-of-speech tagging of addition, if it is, according to participle
Participle in set obtains unregistered word.
2. the method as described in claim 1, which is characterized in that described to judge described address query string whether in the geographical word of addition
The position of part-of-speech tagging, which is segmented, to be specifically included:
Judge whether the participle in the participle set of described address query string meets following rules, if satisfied, then described address is looked into
String is ask to be segmented in the position of the geographical word part-of-speech tagging of addition:
Rule 1:The length segmented in participle set is equal in address lookup string from first after the lead-in to the lead-in of the participle
The length of partial address query string between a geography word part-of-speech tagging;
Alternatively,
Rule 2:The length segmented in participle set is less than in address lookup string from first after the lead-in to the lead-in of the participle
The length of partial address query string between a geography word part-of-speech tagging but participle segments the combination obtained after combining with other
The length of participle, which is equal to from the partial address first after the lead-in to the lead-in of the participle geographical word part-of-speech tagging, to be looked into
Ask the length of string.
3. method as claimed in claim 2, which is characterized in that the participle in the set according to participle obtains unregistered word
It specifically includes:
Continuously meet described regular 2 participle and its part-of-speech tagging in gathering according to participle, obtains unregistered word.
4. method as claimed in claim 3, which is characterized in that continuously meet described regular 2 in the set according to participle
Participle and its part-of-speech tagging, obtain unregistered word specifically include:
Continuously meet regular 2 participle in traversal participle set, if it find that being the participle of individual character, then according to participle
Part-of-speech tagging, described in judgement be individual character participle can with the previous of its or later participle combined, and will combine
Participle exports after being combined according to sequence of the participle in address lookup string as unregistered word.
5. method as claimed in claim 4, which is characterized in that previous or later the participle of the participle of the individual character is known as
Wait investigating participle, then the part-of-speech tagging according to participle, be described in judgement individual character participle can with the previous of its or it
Participle afterwards is combined and is specifically included:
It waits investigating whether participle is individual character described in judgement, if it is, can combine, divide if it is not, then waiting investigating described in judging
Whether word is made of three or more individual characters, if it is, cannot combine, if it is not, then waiting investigating the part of speech of participle described in judging
Whether it is to wait investigating whether the part of speech of participle is road described in village and judgement;
If described wait that the part of speech for investigating participle is village and the tail word for waiting for that investigation segments is the word for indicating village, cannot
In conjunction with;
If described wait that the part of speech for investigating participle is the word that village but the tail word for waiting for that investigation segments are not representing village, can
It is enough to combine;
It is such as described to wait that the part of speech for investigating participle is road and the tail word for waiting for that investigation segments is the word for indicating street, then it cannot tie
It closes;
If described wait that the part of speech for investigating participle is the word that road but the tail word for waiting for that investigation segments are not representing street, can
It is enough to combine;
If described wait that it is not village and road to investigate the part of speech of participle, judge whether the part-of-speech tagging for waiting investigating participle is core
Heart word, determiner, point of interest word or classifier, if it is not, then cannot combine, if it is and described wait that investigating participle is not
High frequency words can then combine.
6. a kind of data processing equipment of Search Engine-Oriented, which is characterized in that described device includes:
Participle unit obtains the participle set of described address query string for being segmented to address query string;
Unit is marked, in described address query string, adding geographical word part-of-speech tagging;
Position judgment unit is segmented, for judging whether described address query string is divided in the position of the geographical word part-of-speech tagging of addition
Word unit cutting, if it is, triggering unregistered word acquiring unit;
Unregistered word acquiring unit obtains unregistered word for the participle in gathering according to participle.
7. device as claimed in claim 6, which is characterized in that the participle position judgment unit is specifically used for:
Judge whether the participle in the participle set of described address query string meets following rules, if satisfied, then described address is looked into
String is ask in the position of the geographical word part-of-speech tagging of addition by participle unit cutting:
Rule 1:The length segmented in participle set is equal in address lookup string from first after the lead-in to the lead-in of the participle
The length of partial address query string between a geography word part-of-speech tagging;
Alternatively,
Rule 2:The length segmented in participle set is less than in address lookup string from first after the lead-in to the lead-in of the participle
The length of partial address query string between a geography word part-of-speech tagging but participle segments the combination obtained after combining with other
The length of participle, which is equal to from the partial address first after the lead-in to the lead-in of the participle geographical word part-of-speech tagging, to be looked into
Ask the length of string.
8. device as claimed in claim 7, which is characterized in that the unregistered word acquiring unit is specifically used for:
Continuously meet described regular 2 participle and its part-of-speech tagging in gathering according to participle, obtains unregistered word.
9. device as claimed in claim 8, which is characterized in that the unregistered word acquiring unit specifically includes:
Individual character finds subelement, continuously meets regular 2 participle for traversing in participle set, if it find that being individual character
Participle, then trigger unregistered word obtain subelement;
Participle combines judgment sub-unit, for be according to the part-of-speech tagging of participle, described in judgement individual character participle can with its it
Participle previous or later combines;
Unregistered word obtains subelement, for by the participle combine the participle that can be combined that judges of judgment sub-unit by
It is exported as unregistered word after being combined according to sequence of the participle in address lookup string.
10. device as claimed in claim 9, which is characterized in that previous or later the participle of the participle of the individual character claims
To wait investigating participle, then segments and specifically included in conjunction with judgment sub-unit:
Individual character judgment sub-unit, for judge it is described wait investigating whether participle is individual character, if it is, can combine, if not,
Then trigger word length judging unit;
The word length judgment sub-unit, for judge it is described wait investigating whether participle is made of three or more individual characters, if it is,
It cannot combine, if it is not, then the first part of speech judgment sub-unit of triggering, for judging whether the part of speech for waiting for that investigation segments is village
The second part of speech judgment sub-unit of the village and triggering, for judging whether the part of speech for waiting for that investigation segments is road;
If described wait that the part of speech for investigating participle is village and the tail word for waiting for that investigation segments is the word for indicating village, cannot
In conjunction with;
If described wait that the part of speech for investigating participle is the word that village but the tail word for waiting for that investigation segments are not representing village, can
It is enough to combine;
It is such as described to wait that the part of speech for investigating participle is road and the tail word for waiting for that investigation segments is the word for indicating street, then it cannot tie
It closes;
If described wait that the part of speech for investigating participle is the word that road but the tail word for waiting for that investigation segments are not representing street, can
It is enough to combine;
If described wait that it is not village and road to investigate the part of speech of participle, triggers third part of speech judgment sub-unit, for judging
It waits investigating whether the part-of-speech tagging of participle is core word, determiner, point of interest word or classifier, if it is not, then cannot tie
Close, if it is and it is described wait investigate participle be not high frequency words, then can combine.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310250057.2A CN104239355B (en) | 2013-06-21 | 2013-06-21 | The data processing method and device of Search Engine-Oriented |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310250057.2A CN104239355B (en) | 2013-06-21 | 2013-06-21 | The data processing method and device of Search Engine-Oriented |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104239355A CN104239355A (en) | 2014-12-24 |
CN104239355B true CN104239355B (en) | 2018-09-11 |
Family
ID=52227438
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310250057.2A Active CN104239355B (en) | 2013-06-21 | 2013-06-21 | The data processing method and device of Search Engine-Oriented |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104239355B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104679850B (en) * | 2015-02-13 | 2018-05-29 | 深圳市华傲数据技术有限公司 | Address structure method and device |
CN108763212A (en) * | 2018-05-23 | 2018-11-06 | 北京神州泰岳软件股份有限公司 | A kind of address information extraction method and device |
CN110110327B (en) * | 2019-04-26 | 2021-06-22 | 网宿科技股份有限公司 | Text labeling method and equipment based on counterstudy |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1197885A2 (en) * | 2000-10-12 | 2002-04-17 | QAS Limited | Method of and apparatus for retrieving data representing a postal address from a database of postal addresses |
CN101154226A (en) * | 2006-09-27 | 2008-04-02 | 腾讯科技(深圳)有限公司 | Method for adding unlisted word to word stock of input method and its character input device |
CN102298585A (en) * | 2010-06-24 | 2011-12-28 | 高德软件有限公司 | Address splitting and level marking method and device |
CN103186524A (en) * | 2011-12-30 | 2013-07-03 | 高德软件有限公司 | Address name identification method and device |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101510221B (en) * | 2009-02-17 | 2012-05-30 | 北京大学 | Enquiry statement analytical method and system for information retrieval |
US8271525B2 (en) * | 2009-10-09 | 2012-09-18 | Verizon Patent And Licensing Inc. | Apparatuses, methods and systems for a smart address parser |
CN102929870B (en) * | 2011-08-05 | 2016-06-29 | 北京百度网讯科技有限公司 | A kind of set up the method for participle model, the method for participle and device thereof |
CN103077164B (en) * | 2012-12-27 | 2016-05-11 | 新浪网技术(中国)有限公司 | Text analyzing method and text analyzer |
-
2013
- 2013-06-21 CN CN201310250057.2A patent/CN104239355B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1197885A2 (en) * | 2000-10-12 | 2002-04-17 | QAS Limited | Method of and apparatus for retrieving data representing a postal address from a database of postal addresses |
CN101154226A (en) * | 2006-09-27 | 2008-04-02 | 腾讯科技(深圳)有限公司 | Method for adding unlisted word to word stock of input method and its character input device |
CN102298585A (en) * | 2010-06-24 | 2011-12-28 | 高德软件有限公司 | Address splitting and level marking method and device |
CN103186524A (en) * | 2011-12-30 | 2013-07-03 | 高德软件有限公司 | Address name identification method and device |
Non-Patent Citations (1)
Title |
---|
地址要素识别机制的地名地址分词算法;赵阳阳 等;《测绘科学》;20130319;第38卷(第5期);第74-76页 * |
Also Published As
Publication number | Publication date |
---|---|
CN104239355A (en) | 2014-12-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102184230B (en) | The methods of exhibiting of a kind of Search Results and device | |
CN103186524B (en) | A kind of place name identification method and apparatus | |
CN109145169A (en) | A kind of address matching method based on statistics participle | |
WO2016107417A1 (en) | Method and device for exploiting travel route on basis of tourist destination area | |
CN109189959A (en) | A kind of method and device constructing image data base | |
CN106488400B (en) | Generate the method and device of geography fence | |
CN104239355B (en) | The data processing method and device of Search Engine-Oriented | |
CN108427965A (en) | A kind of hot spot region method for digging based on road network cluster | |
CN105843850B (en) | Search optimization method and device | |
CN103218375B (en) | A kind of POI compensation process and device | |
CN109561386A (en) | A kind of Urban Residential Trip activity pattern acquisition methods based on multi-source location data | |
CN103207901B (en) | A kind of method and apparatus that IP address ownership place is obtained based on search engine | |
CN109165273A (en) | General Chinese address matching method facing big data environment | |
CN106485211B (en) | A kind of line of text accurate positioning method based on binary tree | |
CN109344263A (en) | A kind of address matching method | |
CN104239321B (en) | A kind of data processing method and device of Search Engine-Oriented | |
CN102298585A (en) | Address splitting and level marking method and device | |
CN107463711A (en) | A kind of tag match method and device of data | |
CN107203526A (en) | A kind of query string semantic requirement analysis method and device | |
CN109033225A (en) | Chinese address identifying system | |
CN108241713A (en) | A kind of inverted index search method based on polynary cutting | |
CN102479230A (en) | Method and device for extracting geographical feature words | |
CN109359186A (en) | A kind of method, apparatus and computer readable storage medium of determining address information | |
CN109446399A (en) | A kind of video display entity search method | |
CN110309432A (en) | Method, map point of interest processing method are determined based on the synonym of point of interest |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20200514 Address after: 310052 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province Patentee after: Alibaba (China) Co.,Ltd. Address before: 102200, No. 8, No., Changsheng Road, Changping District science and Technology Park, Beijing, China. 1-5 Patentee before: AUTONAVI SOFTWARE Co.,Ltd. |