CN104239355B - The data processing method and device of Search Engine-Oriented - Google Patents

The data processing method and device of Search Engine-Oriented Download PDF

Info

Publication number
CN104239355B
CN104239355B CN201310250057.2A CN201310250057A CN104239355B CN 104239355 B CN104239355 B CN 104239355B CN 201310250057 A CN201310250057 A CN 201310250057A CN 104239355 B CN104239355 B CN 104239355B
Authority
CN
China
Prior art keywords
participle
word
speech
investigating
speech tagging
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310250057.2A
Other languages
Chinese (zh)
Other versions
CN104239355A (en
Inventor
郭涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Autonavi Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Autonavi Software Co Ltd filed Critical Autonavi Software Co Ltd
Priority to CN201310250057.2A priority Critical patent/CN104239355B/en
Publication of CN104239355A publication Critical patent/CN104239355A/en
Application granted granted Critical
Publication of CN104239355B publication Critical patent/CN104239355B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9537Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Remote Sensing (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of data processing method of Search Engine-Oriented and devices.The method includes:Address query string is segmented, the participle set of described address query string is obtained;Geographical word part-of-speech tagging is added for described address query string;Judge whether described address query string is segmented in the position of the geographical word part-of-speech tagging of addition, if it is, the participle in gathering according to participle, acquisition does not log in word.Correspondingly, the embodiment of the present invention additionally provides a kind of device of data processing method that realizing the Search Engine-Oriented.The present invention improves the efficiency and accuracy of unknown word identification.

Description

The data processing method and device of Search Engine-Oriented
Technical field
The present invention relates to search engine technical field of data processing, more specifically to a kind of Search Engine-Oriented Data processing method and device.
In background technology
Address searching engine is a kind of vertical search engine, the address formed by collection, tissue and processing address information Participle dictionary provides address searching service to the user.To provide accurate search result, the prior art needs constantly to address point Word dictionary carry out it is perfect, improve address participle dictionary a kind of mode be identify be not indexed to address participle dictionary in address And the relevant information of the address is increased into address and segments library, the address not being indexed in the participle dictionary of address can be described as not Posting term.
The identification of existing address unregistered word is the method based on morphological rule or statistics, wherein being based on morphological rule Method be the identification carried out using artificial and cured morphology, but for the neologisms in the query string of address class, especially It is trade name or brand name does not have fixed form and is susceptible to omission and the inaccurate situation of identification;The method of statistics, passes through Frequency studies between individual character at word possibility, since the frequency that the neologisms of most of address class occur is relatively low, thus, this kind Unknown word identification method equally exists the technological deficiency of statistical result inaccuracy.
Invention content
In view of this, the present invention provides a kind of data processing method and device of Search Engine-Oriented, improved not with realizing The technical purpose of posting term recognition efficiency and accuracy.
An embodiment of the present invention provides a kind of data processing method of Search Engine-Oriented, the method includes:
Address query string is segmented, the participle set of described address query string is obtained;
In described address query string, geographical word part-of-speech tagging is added;
Judge whether described address query string is segmented in the position of the geographical word part-of-speech tagging of addition, if it is, according to Participle in participle set, obtains unregistered word.
Further, the embodiment of the present invention additionally provides a kind of data processing equipment of Search Engine-Oriented, described device packet It includes:
Participle unit obtains the participle set of described address query string for being segmented to address query string;
Unit is marked, in described address query string, adding geographical word part-of-speech tagging;
Position judgment unit is segmented, for judging described address query string whether in the position of the geographical word part-of-speech tagging of addition By participle unit cutting, if it is, triggering unregistered word acquiring unit;
Unregistered word acquiring unit obtains unregistered word for the participle in gathering according to participle.
An embodiment of the present invention provides a kind of technical solution of the data processing of Search Engine-Oriented, the program passes through over the ground Location query string is segmented, and the participle set of described address query string is obtained;Geographical word part of speech is added for described address query string Mark;Again by judging whether address lookup string is segmented in the position of the geographical word part-of-speech tagging of addition, address lookup is judged The word segmentation result of string whether there is ambiguity with geographical word part-of-speech tagging result, if address lookup string is in the geographical word part of speech mark of addition The position of note is segmented, then illustrates that ambiguity is not present with geographical word part-of-speech tagging result in word segmentation result, illustrate in the query string of address There are unregistered word, the participle in gathering further according to participle obtains unregistered word.It is very big that the embodiment of the present invention provides technical solution The existing identification and statistics method of improving find the lower technological deficiency of unregistered word accuracy, reached raising unregistered word Identify the technical purpose of accuracy;Meanwhile above-described embodiment is without carrying out large-scale Concordance and statistical disposition, have compared with Fast unlisted word discovery speed.
Description of the drawings
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with Obtain other attached drawings according to these attached drawings.
Fig. 1 is a kind of data processing method flow chart of Search Engine-Oriented disclosed by the embodiments of the present invention;
Whether Fig. 2 is disclosed by the embodiments of the present invention a kind of judge address lookup string in the position of addition geography word part-of-speech tagging Set the method flow diagram segmented;
Fig. 3 be it is disclosed by the embodiments of the present invention it is a kind of judgement be individual character participle can with the previous of its or later divide The method flow diagram that word combines;
Fig. 4 is a kind of data processing equipment composition schematic diagram of Search Engine-Oriented disclosed by the embodiments of the present invention.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other without creative efforts Embodiment shall fall within the protection scope of the present invention.
Fig. 1 is referred to, is a kind of data processing method of Search Engine-Oriented disclosed by the embodiments of the present invention, this method packet Include following steps:
Step 10:Address query string is segmented, the participle set of described address query string is obtained;
In practical applications, described address query string can be obtained from station address inquiry log, can also be from other It is obtained in the file of recording address query string, has no effect on the realization of the embodiment of the present invention.
Step 20:In described address query string, geographical word part-of-speech tagging is added;
Wherein, in described address query string, the realization process for adding geographical word part-of-speech tagging is specially:
Participle as unit of geographical word is carried out to address query string, and adds geographical word after the participle that cutting obtains Part-of-speech tagging.
In practical applications, step 10 and step 20 may be performed simultaneously, and can also first carry out step 10 and execute step again 20, or first carry out step 20 and execute step 10 again, have no effect on the realization of the embodiment of the present invention.
Step 30:Judge whether address lookup string is segmented in the position of the geographical word part-of-speech tagging of addition, if it is, into Enter step 40, if it is not, then terminating this flow;
Step 40:Participle in being gathered according to participle obtains unregistered word.
Wherein, by step 30 may determine that address lookup string word segmentation result and geographical word part-of-speech tagging result whether There are ambiguities, if there is no ambiguity, illustrate there is unregistered word in the query string of address.
It is a kind of data processing method of Search Engine-Oriented provided in an embodiment of the present invention above, this method passes through judgement Word segmentation result and geographical word part-of-speech tagging result whether there is ambiguity, whether there is unregistered word in the query string of address to determine, And word segmentation result and geographical word part-of-speech tagging result unambiguously in the case of, the participle in being gathered according to participle and its part of speech mark Note, obtains unregistered word, and this method greatly improves existing identification and statistics method and finds that unregistered word accuracy is lower Technological deficiency has reached the technical purpose for improving unknown word identification accuracy;Meanwhile above-described embodiment is extensive without carrying out Concordance and statistical disposition, have faster unlisted word discovery speed.
In practical applications, preset participle dictionary may be used and mechanical Chinese word segmentation is carried out to address query string and by hidden horse Ke Erfu algorithms eliminate segmentation ambiguity, since the participle in preset participle dictionary has carried part-of-speech tagging (shown in table 1), because This, can be completed the part of speech that can also be segmented to the participle of address query string by word segmentation processing.For example, address lookup string For " the Tianjin hotels Ao Lanjide ", according to participle mode provided in an embodiment of the present invention, the participle of the obtained address lookup string Collection is combined into " Tianjin/S indigo plant border/H morals/hotels H/U difficult to understand ", wherein S, H, U are the part of speech of participle, and meaning refers to table 1, and S is indicated The part of speech of Tianjin is provincial geographical word, and H indicates the blue border of Austria, the part of speech of moral is core word, and U indicates that the part of speech in hotel is suffix classification Word.
In practical applications, geographical word word can be added for described address query string according to Rules for Part of Speech Tagging shown in table 2 Property mark, i.e., only in address lookup string be geographical word word behind add part-of-speech tagging.For example, " the Tianjin hotels Ao Lanjide " According to Rules for Part of Speech Tagging shown in table 2, obtained annotation results are " Tianjin/hotels CS Ao Lanjide/OP ".
The part-of-speech tagging that table 1 segments
According to illustrating above, address lookup string " the Tianjin hotels Ao Lanjide " is " day by segmenting obtained participle set Tianjin/S indigo plant border/H morals/the hotels H/U " difficult to understand obtain geographical word part-of-speech tagging the result is that " Tianjin/CS is difficult to understand by geographical word part-of-speech tagging The hotels Lan Jide/OP ", thus result can be seen that " the Tianjin hotels Ao Lanjide " in the position of the geographical word part-of-speech tagging of addition (position between " Tianjin " and " Austria ", the position after " shop ") is segmented, and is existed not in this explanation " the Tianjin hotels Ao Lanjide " Posting term needs further to obtain unregistered word according to its participle set.And address lookup string " Liuli Flyover east ", by segmenting It is " in six/H bridges east/N " to participle set, geographical word part-of-speech tagging result " Liuli Flyover/Q is obtained by geographical word part-of-speech tagging It is eastern ", thus result can be seen that " Liuli Flyover east " in the position (position between " bridge " and " east " of the geographical word part-of-speech tagging of addition Set) do not segmented, there are ambiguities for the two, therefore no longer carry out unregistered word acquisition.
2 address rank table of table
In practical applications, judge whether address lookup string is specific by participle in the position of the geographical word part-of-speech tagging of addition For:Judge whether the participle in the participle set of described address query string meets following regular 1 or rule 2, if address lookup string Participle set in participle meet following regular 1 or rule 2, then illustrate address query string in the geographical word part-of-speech tagging of addition Position segmented:
Rule 1:The length segmented in participle set is equal to from address lookup string from the lead-in of the participle to the lead-in The length of partial address query string between first geographical word part-of-speech tagging afterwards;
Alternatively,
Rule 2:The length that segments is less than in address lookup string after the lead-in to the lead-in of the participle the in participle set The length of partial address query string between the one geographical word part-of-speech tagging but participle segments the group obtained after combining with other The length for closing participle is equal to from the partial address first after the lead-in to the lead-in of the participle geographical word part-of-speech tagging The length of query string.
It should be noted that the word segmentation result of address lookup string can than address lookup string geographical word part-of-speech tagging result more To be fine, therefore, it will usually which the case where occurring is that the combination participle length obtained after multiple participle combinations is segmented equal to from combination Lead-in to the lead-in after partial address query string between first geographical word part-of-speech tagging length, but be not precluded ground The geographical word part-of-speech tagging result of the location query string situation more finer than the word segmentation result of address lookup string, in such case Under, if the participle in the participle set of address lookup string meets following regular 3, also illustrate address query string in the geographical word word of addition Property mark position segmented:
Rule 3:The length that segments is more than in address lookup string after the lead-in to the lead-in of the participle the in participle set The length of partial address query string between one geographical word part-of-speech tagging, but N after from the lead-in of the participle to the lead-in The length of partial address query string between (N >=2) a geographical word part-of-speech tagging is equal to the length of the participle.
Below by taking address lookup string " the Tianjin hotels Ao Lanjide " as an example, in conjunction with attached drawing 2, to provided in an embodiment of the present invention Judging address lookup string, whether the method flow segmented in the position of the geographical word part-of-speech tagging of addition describes in detail, point Word set is combined into " Tianjin/S indigo plant border/H morals/hotels H/U difficult to understand ", and geographical word part-of-speech tagging is " Tianjin/hotels OS Ao Lanjide/OP ", This method includes:
Step 301:Participle " Tianjin " is read in gathering from participle;
Step 302:The length for judging participle " Tianjin " and " day " word from address lookup string " the Tianjin hotels Ao Lanjide " The length relation of partial address query string " Tianjin " after starting to " day " word between first geographical word part-of-speech tagging, obtains Judging result is that the two is equal, enters step 303;
Step 303:Participle " indigo plant border difficult to understand " is read in gathering from participle;
Step 304:The length for judging participle " indigo plant border difficult to understand " and " Austria " from address lookup string " the Tianjin hotels Ao Lanjide " The length of partial address query string " hotels Ao Lanjide " after word starts to " Austria " word between first geographical word part-of-speech tagging Relationship, obtained judging result are that participle length is less than partial address inquiry string length, enter step 305;
Step 305:Participle " moral " is read in gathering from participle;
Step 306:Participle " indigo plant border difficult to understand " is combined with participle " moral ", obtains participle combination " indigo plant border difficult to understand moral ";
Step 307:Judge the length and partial address query string " hotels Ao Lanjide " of participle combination " indigo plant border difficult to understand moral " Length relation, obtained judging result are that the length of participle combination is less than partial address inquiry string length, enter step 308;
Step 308:Participle " hotel " is read in gathering from participle;
Step 309:Participle " indigo plant border difficult to understand ", participle " moral " and " hotel " are combined, obtain segmenting combination " indigo plant border difficult to understand moral Hotel ";
Step 309:Judge the length and partial address query string " indigo plant border difficult to understand moral wine of participle combination " hotels Ao Lanjide " The length relation in shop ", obtained judging result is that the length of participle combination is equal to the length of partial address query string, due to participle First participle in set meets above-mentioned regular 1, and remaining participle meets above-mentioned regular 2, and therefore, judgement obtains address lookup String " the Tianjin hotels Ao Lanjide " is segmented in the position of the geographical word part-of-speech tagging of addition.
Below by taking address lookup string " Liuli Flyover east " as an example, judge address lookup string whether in addition geographical word the present invention The method flow that the position of part-of-speech tagging is segmented briefly is introduced again, and participle collection is combined into " in six/H bridges east/N ", geographical Word part-of-speech tagging result is " Liuli Flyover/east Q ":
From participle gather in read participle " in six ", due to " in six " length be less than " Liuli Flyover " length, then from point Participle " bridge east " is read in set of words, and will be combined with " bridge east " " in six ", and participle combination " Liuli Flyover east " is obtained, due to The length in " Liuli Flyover east " is more than " Liuli Flyover ", and the participle in participle set is unsatisfactory for any one in above three rule, Therefore, judge that obtaining address lookup string " Liuli Flyover east " is not segmented in the position of the geographical word part-of-speech tagging of addition.
How to judge address lookup string whether in the geographical word part-of-speech tagging of addition to provided in an embodiment of the present invention above The method that position is segmented is described in detail.Below in conjunction with specific example, to it is provided in an embodiment of the present invention how basis The method that participle and its part-of-speech tagging in participle set obtain unregistered word describes in detail.
In the concrete realization, participle and its part-of-speech tagging during the step 40 is gathered according to participle obtain unregistered word, Specifically, above-mentioned regular 2 participle and its part-of-speech tagging are continuously met in gathering according to participle, obtains unregistered word.
In practical applications, above-mentioned regular 2 participle and its part-of-speech tagging are continuously met in the set according to participle, Unregistered word is obtained to specifically include:
Continuously meet regular 2 participle in traversal participle set, if it find that being the participle of individual character, then basis is divided The part-of-speech tagging of word, be described in judgement individual character participle can with the previous of its or later participle combined, and will tie The participle of conjunction exports after being combined according to sequence of the participle in address lookup string as unregistered word.
It is individual character being found that it should be noted that continuously meeting regular 2 participle in traversal participle set After participle, if there are multiple participles before individual character, only judge that can individual character combine with its adjacent participle previous, than Such as, participle 1, participle 2,3 (individual characters) of participle then only judge that can participle 3 be combined with participle 2;If also multiple after individual character Participle, then need to judge whether individual character can combine with the participle after it, for example, 1 (individual character) of participle, participle 2, participle 3, It then needs to judge that can participle 1 be combined with participle 2, if it can, also need to judge that participle 1, participle 2, participle 3 can combine, with This analogizes, until find cannot in conjunction with participle or processing to the last one continuously met in regular 2 participle Participle, terminates the flow.
Below in conjunction with attached drawing 3, the part-of-speech tagging according to participle is provided to the embodiment of the present invention, is point of individual character described in judgement Word can with the previous of its or later participle combine method describe in detail, below by the previous of individual character or it Participle afterwards is referred to as waiting investigating segmenting, and this method includes:
Step 4021:It waits investigating whether participle is individual character described in judgement, if it is, individual character can segment knot with waiting investigating It closes, if it is not, then entering step 4022;
Step 4022:Judgement waits for whether the investigation participle is made of three or more individual characters, if it is, cannot combine, If it is not, then entering step 4023 and step 4025;
Step 4023:It waits investigating whether the part of speech of participle is village described in judgement, if it is, entering step 4024, such as Fruit is no, then enters step 4027;
Step 4024:Judge whether the tail word for waiting investigating participle is the word (for example, village, township, the village, village) for indicating village, such as Fruit is cannot then to combine, if it is not, then can combine;
Step 4025:It waits investigating whether the part of speech of participle is road described in judgement, if it is, entering step 4026, such as Fruit is no, then enters step 4027;
Step 4026:Judge whether the tail word for waiting investigating participle is the word (for example, road, street, lane, line) for indicating road, such as Fruit is cannot then to combine, if it is not, then can combine;
In practical applications, it is road or the participle in village for part of speech, such as " great river " in " great river village ", " great river " There may be " great river roads ", " great river street " " great river shop " etc. in segmenting dictionary, therefore, in this case, it is also necessary to sentence Whether the end word of disconnected this kind of participle has apparent geographical feature, if end word is " village " " township " " village " or " road ", " road ", Then show that this is village grade place name or link name, then the individual character cannot be combined with the participle.
Step 4027:Judge whether the part-of-speech tagging for waiting investigating participle is core word, determiner, point of interest word, classifier In one, if it is not, then cannot combine, if it is, entering step 4028;
Step 4028:It waits investigating whether participle is high frequency words described in judgement, if it is, cannot combine, if it is not, then can To combine.
Wherein, the enquiry frequency for waiting investigating participle can be recorded in preset participle dictionary, can judge described wait for accordingly Investigate whether participle is high frequency words.
The side provided in an embodiment of the present invention that how to judge individual character and can be combined with its front and back participle is described above Method.For " the Tianjin hotels Ao Lanjide " described earlier below, the method provided inventive embodiments is introduced.
Meeting above-mentioned regular 2 participle in the participle set of address lookup string " the Tianjin borders Ao Lan moral " includes:" indigo plant border/H difficult to understand Moral/the hotels B/U " is specifically included therefore, it is necessary to find unregistered word in these participles:Traversal " indigo plant border/H morals/hotels B difficult to understand/ U " has found individual character " moral/B ", and it is " indigo plant border/H difficult to understand " to be segmented before " moral/B ", and " indigo plant border/H difficult to understand " is made of three individual characters, and part of speech was both Not instead of village is nor road, therefore core word and be not high frequency words, Austria indigo plant border/H " and moral/B " is combined to obtain " indigo plant border difficult to understand moral ";It is " hotel/U " to segment after " moral/B ", and " hotel/U " is made of two individual characters, part of speech neither village not yet It is road, but " indigo plant border difficult to understand moral " is therefore combined with " hotel ", obtains " blue border moral difficult to understand by classifier and be not high frequency words Hotel ", since " hotel " is the last one participle, " hotels Ao Lanjide " is exported as unregistered word.
Below to method provided in an embodiment of the present invention by taking address lookup string " Foshan remex RSL Badminton Stadium " as an example It describes in detail.Address lookup string " Foshan remex RSL Badminton Stadium " passes through the participle set that Words partition system is handled For " Foshan/C remex/Asias P/H lions dragon/P feathers/E arenas/U ", the geographical rank handled by geographical rank labeling system It is labeled as " Foshan/OC remex RSL Badminton Stadium/OP ", wherein in " remex/Asias P/H lions dragon/P feathers/E arenas/U " satisfaction Rule 2 is stated, " remex/Asias P/H lions dragon/P feathers/E arenas/U " is traversed, finds individual character " Asia ", adjacent participle is " remex/P " and " lion dragon/P ", " remex/P " and " lion dragon/P " are the non-height that length is respectively less than 3 words and part of speech is point of interest word Frequency word, therefore " remex/P " and " lion dragon/P " can be combined with " Asia ", obtain " remex RSL ", the latter participle of " lion dragon " is " feather ", " feather " belongs to the short high frequency words of non-geographic class, not combinable, and therefore, first unknown word identification terminates, and result is " remex RSL ";From " feather ", this word continues to begin stepping through again, and " feather " is non-individual character, then finds " arena ", " arena " Belong to non-individual character, and belong to classifier, therefore this flow terminates, and finally finds one at " Foshan remex RSL Badminton Stadium " Unregistered word " remex RSL ".
It is the data processing method of Search Engine-Oriented provided in an embodiment of the present invention above, below in conjunction with attached drawing to this hair The device for the realization above method that bright embodiment provides describes in detail.
Fig. 4 is referred to, is a kind of data processing equipment of Search Engine-Oriented provided in an embodiment of the present invention, the device packet It includes:
Participle unit 50 obtains the participle set of described address query string for being segmented to address query string;
Unit 51 is marked, in described address query string, adding geographical word part-of-speech tagging;
Position judgment unit 52 is segmented, for judging described address query string whether in the position of the geographical word part-of-speech tagging of addition It sets by participle unit cutting, if it is, triggering unregistered word acquiring unit 53;
Unregistered word acquiring unit 53 obtains unregistered word for the participle in gathering according to participle.
In practical applications, the participle position judgment unit 52 is specifically used for:
Judge whether the participle in the participle set of described address query string meets following rules, if satisfied, then describedly Location query string is in the position of the geographical word part-of-speech tagging of addition by participle unit cutting:
Rule 1:The length segmented in participle set is equal in address lookup string after the lead-in to the lead-in of the participle The length of partial address query string between first geographical word part-of-speech tagging;
Alternatively,
Rule 2:The length segmented in participle set is less than in address lookup string after the lead-in to the lead-in of the participle The length of partial address query string between first geographical word part-of-speech tagging but the participle are segmented with other to be obtained after combining The length of combination participle be equal to from first after the lead-in to the lead-in of the participle geographical word part-of-speech tagging partly The length of location query string.
In practical applications, the unregistered word acquiring unit 53 is specifically used for:
Continuously meet described regular 2 participle and its part-of-speech tagging in gathering according to participle, obtains unregistered word.
Preferably, in practical applications, the unregistered word acquiring unit 53 specifically includes:
Individual character finds subelement, continuously meets regular 2 participle for traversing in participle set, if it find that being The participle of individual character then triggers unregistered word and obtains subelement;
Participle combines judgment sub-unit, and can the participle for according to the part-of-speech tagging of participle, described in judgement being individual character with Its previous or later participle combines;
Unregistered word obtains subelement, for the participle to be combined point that can be combined that judgment sub-unit judges Word exports after being combined according to sequence of the participle in address lookup string as unregistered word.
Preferably, previous or later the participle of the participle of the individual character is known as waiting investigating participle, then the participle knot Judgment sub-unit is closed to specifically include:
Individual character judgment sub-unit, for judge it is described wait investigating whether participle is individual character, if it is, can combine, such as Fruit is no, then triggers word length judging unit;
The word length judgment sub-unit, for judge it is described wait investigating whether participle is made of three or more individual characters, if It is that cannot then combine, if it is not, then the first part of speech judgment sub-unit of triggering, described waits that the part of speech for investigating participle is for judging No is the second part of speech judgment sub-unit of village and triggering, for judging whether the part of speech for waiting for that investigation segments is road;
If described wait that the part of speech for investigating participle is village and the tail word for waiting for that investigation segments is the word for indicating village, It cannot combine;
If described wait that the part of speech for investigating participle is the word that village but the tail word for waiting for that investigation segments are not representing village, It can then combine;
It is such as described to wait that the part of speech for investigating participle is road and the tail word for waiting for that investigation segments is the word for indicating street, then not It can combine;
If described wait that the part of speech for investigating participle is the word that road but the tail word for waiting for that investigation segments are not representing street, It can then combine;
If described wait that it is not village and road to investigate the part of speech of participle, triggers third part of speech judgment sub-unit, is used for Judge whether the part-of-speech tagging for judging to wait investigating participle is core word, determiner, point of interest word or classifier, if not That cannot then combine, if it is and it is described wait investigate participle be not high frequency words, then can combine.
Each embodiment is described by the way of progressive in this specification, the highlights of each of the examples are with other The difference of embodiment, just to refer each other for identical similar portion between each embodiment.For device disclosed in embodiment For, since it is corresponded to the methods disclosed in the examples, so description is fairly simple, related place is said referring to method part It is bright.
For system embodiments, since it essentially corresponds to embodiment of the method, so describe fairly simple, it is related Place illustrates referring to the part of embodiment of the method.
It should be noted that herein, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also include other elements that are not explicitly listed, or further include for this journey, method, article or equipment institute Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that wrapping Include in the process, method, article or equipment of the element that there is also other identical elements.
The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention. Various modifications to these embodiments will be apparent to those skilled in the art, as defined herein General Principle can in other embodiments be realized in the case where not departing from the spirit or scope of the embodiment of the present invention.Therefore, The embodiment of the present invention is not intended to be limited to the embodiments shown herein, and be to fit to principles disclosed herein and The consistent widest range of features of novelty.

Claims (10)

1. a kind of data processing method of Search Engine-Oriented, which is characterized in that the method includes:
Address query string is segmented, the participle set of described address query string is obtained;
In described address query string, geographical word part-of-speech tagging is added;
Judge whether described address query string is segmented in the position of the geographical word part-of-speech tagging of addition, if it is, according to participle Participle in set obtains unregistered word.
2. the method as described in claim 1, which is characterized in that described to judge described address query string whether in the geographical word of addition The position of part-of-speech tagging, which is segmented, to be specifically included:
Judge whether the participle in the participle set of described address query string meets following rules, if satisfied, then described address is looked into String is ask to be segmented in the position of the geographical word part-of-speech tagging of addition:
Rule 1:The length segmented in participle set is equal in address lookup string from first after the lead-in to the lead-in of the participle The length of partial address query string between a geography word part-of-speech tagging;
Alternatively,
Rule 2:The length segmented in participle set is less than in address lookup string from first after the lead-in to the lead-in of the participle The length of partial address query string between a geography word part-of-speech tagging but participle segments the combination obtained after combining with other The length of participle, which is equal to from the partial address first after the lead-in to the lead-in of the participle geographical word part-of-speech tagging, to be looked into Ask the length of string.
3. method as claimed in claim 2, which is characterized in that the participle in the set according to participle obtains unregistered word It specifically includes:
Continuously meet described regular 2 participle and its part-of-speech tagging in gathering according to participle, obtains unregistered word.
4. method as claimed in claim 3, which is characterized in that continuously meet described regular 2 in the set according to participle Participle and its part-of-speech tagging, obtain unregistered word specifically include:
Continuously meet regular 2 participle in traversal participle set, if it find that being the participle of individual character, then according to participle Part-of-speech tagging, described in judgement be individual character participle can with the previous of its or later participle combined, and will combine Participle exports after being combined according to sequence of the participle in address lookup string as unregistered word.
5. method as claimed in claim 4, which is characterized in that previous or later the participle of the participle of the individual character is known as Wait investigating participle, then the part-of-speech tagging according to participle, be described in judgement individual character participle can with the previous of its or it Participle afterwards is combined and is specifically included:
It waits investigating whether participle is individual character described in judgement, if it is, can combine, divide if it is not, then waiting investigating described in judging Whether word is made of three or more individual characters, if it is, cannot combine, if it is not, then waiting investigating the part of speech of participle described in judging Whether it is to wait investigating whether the part of speech of participle is road described in village and judgement;
If described wait that the part of speech for investigating participle is village and the tail word for waiting for that investigation segments is the word for indicating village, cannot In conjunction with;
If described wait that the part of speech for investigating participle is the word that village but the tail word for waiting for that investigation segments are not representing village, can It is enough to combine;
It is such as described to wait that the part of speech for investigating participle is road and the tail word for waiting for that investigation segments is the word for indicating street, then it cannot tie It closes;
If described wait that the part of speech for investigating participle is the word that road but the tail word for waiting for that investigation segments are not representing street, can It is enough to combine;
If described wait that it is not village and road to investigate the part of speech of participle, judge whether the part-of-speech tagging for waiting investigating participle is core Heart word, determiner, point of interest word or classifier, if it is not, then cannot combine, if it is and described wait that investigating participle is not High frequency words can then combine.
6. a kind of data processing equipment of Search Engine-Oriented, which is characterized in that described device includes:
Participle unit obtains the participle set of described address query string for being segmented to address query string;
Unit is marked, in described address query string, adding geographical word part-of-speech tagging;
Position judgment unit is segmented, for judging whether described address query string is divided in the position of the geographical word part-of-speech tagging of addition Word unit cutting, if it is, triggering unregistered word acquiring unit;
Unregistered word acquiring unit obtains unregistered word for the participle in gathering according to participle.
7. device as claimed in claim 6, which is characterized in that the participle position judgment unit is specifically used for:
Judge whether the participle in the participle set of described address query string meets following rules, if satisfied, then described address is looked into String is ask in the position of the geographical word part-of-speech tagging of addition by participle unit cutting:
Rule 1:The length segmented in participle set is equal in address lookup string from first after the lead-in to the lead-in of the participle The length of partial address query string between a geography word part-of-speech tagging;
Alternatively,
Rule 2:The length segmented in participle set is less than in address lookup string from first after the lead-in to the lead-in of the participle The length of partial address query string between a geography word part-of-speech tagging but participle segments the combination obtained after combining with other The length of participle, which is equal to from the partial address first after the lead-in to the lead-in of the participle geographical word part-of-speech tagging, to be looked into Ask the length of string.
8. device as claimed in claim 7, which is characterized in that the unregistered word acquiring unit is specifically used for:
Continuously meet described regular 2 participle and its part-of-speech tagging in gathering according to participle, obtains unregistered word.
9. device as claimed in claim 8, which is characterized in that the unregistered word acquiring unit specifically includes:
Individual character finds subelement, continuously meets regular 2 participle for traversing in participle set, if it find that being individual character Participle, then trigger unregistered word obtain subelement;
Participle combines judgment sub-unit, for be according to the part-of-speech tagging of participle, described in judgement individual character participle can with its it Participle previous or later combines;
Unregistered word obtains subelement, for by the participle combine the participle that can be combined that judges of judgment sub-unit by It is exported as unregistered word after being combined according to sequence of the participle in address lookup string.
10. device as claimed in claim 9, which is characterized in that previous or later the participle of the participle of the individual character claims To wait investigating participle, then segments and specifically included in conjunction with judgment sub-unit:
Individual character judgment sub-unit, for judge it is described wait investigating whether participle is individual character, if it is, can combine, if not, Then trigger word length judging unit;
The word length judgment sub-unit, for judge it is described wait investigating whether participle is made of three or more individual characters, if it is, It cannot combine, if it is not, then the first part of speech judgment sub-unit of triggering, for judging whether the part of speech for waiting for that investigation segments is village The second part of speech judgment sub-unit of the village and triggering, for judging whether the part of speech for waiting for that investigation segments is road;
If described wait that the part of speech for investigating participle is village and the tail word for waiting for that investigation segments is the word for indicating village, cannot In conjunction with;
If described wait that the part of speech for investigating participle is the word that village but the tail word for waiting for that investigation segments are not representing village, can It is enough to combine;
It is such as described to wait that the part of speech for investigating participle is road and the tail word for waiting for that investigation segments is the word for indicating street, then it cannot tie It closes;
If described wait that the part of speech for investigating participle is the word that road but the tail word for waiting for that investigation segments are not representing street, can It is enough to combine;
If described wait that it is not village and road to investigate the part of speech of participle, triggers third part of speech judgment sub-unit, for judging It waits investigating whether the part-of-speech tagging of participle is core word, determiner, point of interest word or classifier, if it is not, then cannot tie Close, if it is and it is described wait investigate participle be not high frequency words, then can combine.
CN201310250057.2A 2013-06-21 2013-06-21 The data processing method and device of Search Engine-Oriented Active CN104239355B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310250057.2A CN104239355B (en) 2013-06-21 2013-06-21 The data processing method and device of Search Engine-Oriented

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310250057.2A CN104239355B (en) 2013-06-21 2013-06-21 The data processing method and device of Search Engine-Oriented

Publications (2)

Publication Number Publication Date
CN104239355A CN104239355A (en) 2014-12-24
CN104239355B true CN104239355B (en) 2018-09-11

Family

ID=52227438

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310250057.2A Active CN104239355B (en) 2013-06-21 2013-06-21 The data processing method and device of Search Engine-Oriented

Country Status (1)

Country Link
CN (1) CN104239355B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104679850B (en) * 2015-02-13 2018-05-29 深圳市华傲数据技术有限公司 Address structure method and device
CN108763212A (en) * 2018-05-23 2018-11-06 北京神州泰岳软件股份有限公司 A kind of address information extraction method and device
CN110110327B (en) * 2019-04-26 2021-06-22 网宿科技股份有限公司 Text labeling method and equipment based on counterstudy

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1197885A2 (en) * 2000-10-12 2002-04-17 QAS Limited Method of and apparatus for retrieving data representing a postal address from a database of postal addresses
CN101154226A (en) * 2006-09-27 2008-04-02 腾讯科技(深圳)有限公司 Method for adding unlisted word to word stock of input method and its character input device
CN102298585A (en) * 2010-06-24 2011-12-28 高德软件有限公司 Address splitting and level marking method and device
CN103186524A (en) * 2011-12-30 2013-07-03 高德软件有限公司 Address name identification method and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101510221B (en) * 2009-02-17 2012-05-30 北京大学 Enquiry statement analytical method and system for information retrieval
US8271525B2 (en) * 2009-10-09 2012-09-18 Verizon Patent And Licensing Inc. Apparatuses, methods and systems for a smart address parser
CN102929870B (en) * 2011-08-05 2016-06-29 北京百度网讯科技有限公司 A kind of set up the method for participle model, the method for participle and device thereof
CN103077164B (en) * 2012-12-27 2016-05-11 新浪网技术(中国)有限公司 Text analyzing method and text analyzer

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1197885A2 (en) * 2000-10-12 2002-04-17 QAS Limited Method of and apparatus for retrieving data representing a postal address from a database of postal addresses
CN101154226A (en) * 2006-09-27 2008-04-02 腾讯科技(深圳)有限公司 Method for adding unlisted word to word stock of input method and its character input device
CN102298585A (en) * 2010-06-24 2011-12-28 高德软件有限公司 Address splitting and level marking method and device
CN103186524A (en) * 2011-12-30 2013-07-03 高德软件有限公司 Address name identification method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
地址要素识别机制的地名地址分词算法;赵阳阳 等;《测绘科学》;20130319;第38卷(第5期);第74-76页 *

Also Published As

Publication number Publication date
CN104239355A (en) 2014-12-24

Similar Documents

Publication Publication Date Title
CN102184230B (en) The methods of exhibiting of a kind of Search Results and device
CN103186524B (en) A kind of place name identification method and apparatus
CN109145169A (en) A kind of address matching method based on statistics participle
WO2016107417A1 (en) Method and device for exploiting travel route on basis of tourist destination area
CN109189959A (en) A kind of method and device constructing image data base
CN106488400B (en) Generate the method and device of geography fence
CN104239355B (en) The data processing method and device of Search Engine-Oriented
CN108427965A (en) A kind of hot spot region method for digging based on road network cluster
CN105843850B (en) Search optimization method and device
CN103218375B (en) A kind of POI compensation process and device
CN109561386A (en) A kind of Urban Residential Trip activity pattern acquisition methods based on multi-source location data
CN103207901B (en) A kind of method and apparatus that IP address ownership place is obtained based on search engine
CN109165273A (en) General Chinese address matching method facing big data environment
CN106485211B (en) A kind of line of text accurate positioning method based on binary tree
CN109344263A (en) A kind of address matching method
CN104239321B (en) A kind of data processing method and device of Search Engine-Oriented
CN102298585A (en) Address splitting and level marking method and device
CN107463711A (en) A kind of tag match method and device of data
CN107203526A (en) A kind of query string semantic requirement analysis method and device
CN109033225A (en) Chinese address identifying system
CN108241713A (en) A kind of inverted index search method based on polynary cutting
CN102479230A (en) Method and device for extracting geographical feature words
CN109359186A (en) A kind of method, apparatus and computer readable storage medium of determining address information
CN109446399A (en) A kind of video display entity search method
CN110309432A (en) Method, map point of interest processing method are determined based on the synonym of point of interest

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20200514

Address after: 310052 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Patentee after: Alibaba (China) Co.,Ltd.

Address before: 102200, No. 8, No., Changsheng Road, Changping District science and Technology Park, Beijing, China. 1-5

Patentee before: AUTONAVI SOFTWARE Co.,Ltd.