CN102023984B - Method and device for screening duplicated entity data - Google Patents

Method and device for screening duplicated entity data Download PDF

Info

Publication number
CN102023984B
CN102023984B CN2009101705511A CN200910170551A CN102023984B CN 102023984 B CN102023984 B CN 102023984B CN 2009101705511 A CN2009101705511 A CN 2009101705511A CN 200910170551 A CN200910170551 A CN 200910170551A CN 102023984 B CN102023984 B CN 102023984B
Authority
CN
China
Prior art keywords
participle
database
speech
title
entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN2009101705511A
Other languages
Chinese (zh)
Other versions
CN102023984A (en
Inventor
莫正华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taobao China Software Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN2009101705511A priority Critical patent/CN102023984B/en
Publication of CN102023984A publication Critical patent/CN102023984A/en
Priority to HK11105866.7A priority patent/HK1152126A1/en
Application granted granted Critical
Publication of CN102023984B publication Critical patent/CN102023984B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a method and a system for screening duplicated entity data. The method comprises the following steps that: 1, a server acquires entity data to be screened; 2, the server compares entity names between names of the entity data to be screened and names of a predetermined amount of entity data in a database one by one to acquire a score; 3, the server determines duplication of the entity data to be screened and the entity data in the compared database through a compared score and a preset standard score; and 4, the server adds the unduplicated entity data to be screened into the database. By the method, the duplicated entity data can be efficiently screened out.

Description

The method and apparatus of screening duplicated entity data
Technical field
The present invention relates to Internet technical field, particularly a kind of method and apparatus of screening duplicated entity data.
Background technology
Search engine technique can be realized according to certain strategy, use specific computer program to collect the information on internet, after information being organized and processed, for the user provides retrieval service.Search engine technique provides the website of search service for the user better is provided the search service of information at one's side since being born in internet, further released the life search.The life search refers to clear and definite life information in search engine, for the life information advanced treating, for the user brings great convenience.As choose the labels such as classification, region of a life, and then use search engine, the user that can assist search finds classification life information at one's side easily.At present, the information type that can search for is more, comprises housing, working opportunity, train ticket, article trading and food and drink etc.
The website of life search is provided, needs pre-stored a large amount of abundant data that have in its database.The life search that realizes this class entity of shop of take is example, and website need to be collected the information in shop as much as possible in advance.Yet, along with service for life industry development, website also needs constantly the shop data in site databases to be expanded, upgrade.
Thing followed problem is, if shop data to be imported are identical with certain the existing shop data in site databases, be that shop reality to be imported exists in site databases,, after importing new shop data, to cause the repetition of data in site databases, obviously, the data of these repetitions can be wasted the storage space of database preciousness.On the other hand, when the user initiates the shop search, repeat the shop data and exist in site databases, like this, the shop data of these repetitions can be returned to the user in website, and these repetition shop data of returning have in fact only represented less shop, be that Search Results is not that the user wishes the result obtained like this, also be unfavorable for user's search experience.In order to guarantee that the user experiences, for the user provides search service accurately, just must guarantee the uniqueness in all shops in database.
Yet, also there is no the method for efficient screening duplicated entity data in prior art.
Summary of the invention
The purpose of the embodiment of the present application is to provide a kind of method and apparatus of screening duplicated entity data, to realize efficient screening duplicated entity data.
For solving the problems of the technologies described above, the embodiment of the present application provides a kind of method and system of screening duplicated entity data to be achieved in that
A kind of method of screening duplicated entity data comprises:
S1: server obtains solid data to be screened;
S2: in the title of the solid data that server will be to be screened and database, the title of the solid data of scheduled volume compares the entity title one by one by following manner:
Utilizing the participle dictionary of preset different parts of speech to treat the entity title of screening in entity title and database carries out participle and determines part of speech;
To insert respectively predetermined template through the trade name to be screened of the also definite part of speech of participle and the entity title in database;
The whether identical scoring that obtains the comparison of entity title of word by entity trade name corresponding part of speech in described template in trade name relatively to be screened and database;
S3: server is by relatively scoring and preassigned assign to judge that solid data described to be screened and the solid data in database relatively repeat;
S4: server will be judged as unduplicated described solid data to be screened and be added into database.
A kind of device of screening duplicated entity data comprises:
Acquiring unit, for obtaining solid data to be screened;
The title comparing unit, for the title of the solid data of title that will solid data be screened and database scheduled volume by following manner entity title relatively one by one:
Utilizing the participle dictionary of preset different parts of speech to treat the entity title of screening in entity title and database carries out participle and determines part of speech; To insert respectively predetermined template through the trade name to be screened of the also definite part of speech of participle and the entity title in database; The whether identical scoring that obtains the comparison of entity title of word by entity trade name corresponding part of speech in described template in trade name relatively to be screened and database;
Judging unit, for assigning to judge with preassigned whether solid data described to be screened and the solid data of database relatively repeat by relatively marking;
Adding device, be added into described database for being judged as unduplicated solid data.
The technical scheme provided from above the embodiment of the present application, server obtains solid data to be screened, server will be to be screened solid data and database in the solid data of scheduled volume the entity title is compared one by one and is marked, server is by relatively scoring and preassigned assign to judge that solid data described to be screened and the solid data in database repeat, can screen out efficiently the solid data of repetition, and unduplicated solid data is added in described database.
The accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present application or technical scheme of the prior art, below will the accompanying drawing of required use in embodiment or description of the Prior Art be briefly described, apparently, the accompanying drawing the following describes is only some embodiment that put down in writing in the application, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain according to these accompanying drawings other accompanying drawing.
Fig. 1 is the system construction drawing that the application relates to;
The process flow diagram that Fig. 2 is the application's screening duplicated entity data embodiment of the method;
Fig. 3 is the application postgresql database cluster topology figure;
Fig. 4 is entity title implementation process flow diagram relatively in the application S202;
Fig. 5 is that the application adopts the principle schematic diagram of preferential coupling to mechanism's word classification;
The block diagram that Fig. 6 is the application's one server example;
The block diagram that Fig. 7 is another server example of the application;
The block diagram that Fig. 8 is another server example of the application;
The block diagram that Fig. 9 is another server example of the application;
The block diagram that Figure 10 is another server example of the application.
Embodiment
The embodiment of the present application provides a kind of method and system of screening duplicated entity data.
In order to make those skilled in the art person understand better the technical scheme in the application, below in conjunction with the accompanying drawing in the embodiment of the present application, technical scheme in the embodiment of the present application is clearly and completely described, obviously, described embodiment is only the application's part embodiment, rather than whole embodiment.Embodiment based in the application, those of ordinary skills, not making under the creative work prerequisite the every other embodiment obtained, should belong to the scope of protection of the invention.
The system architecture that the application relates to can as shown in Figure 1, comprise server and database.
Below introduce the application's screening duplicated entity data embodiment of the method, Fig. 2 shows the flow process of the method embodiment, as shown in Figure 2:
S201: server obtains solid data to be screened.
The entity here can be embodied as shop, and the shop entity of take in follow-up describes as example.
The data of shop entity to be screened comprise the data of the attributes such as shop title, address, shop and shop telephone number.These data can be used as shop relatively to be screened whether with database in the foundation that repeats of existing shop.
S202: the solid data of scheduled volume entity title relatively one by one in the solid data that server will be to be screened and database, this one by one comparison procedure comprise:
Utilizing the participle dictionary of preset different parts of speech to treat the entity title of screening in entity title and database carries out participle and determines part of speech; Wherein, the entity title participle in database, determine that part of speech can fulfil ahead of schedule;
To insert respectively predetermined template through the trade name to be screened of the also definite part of speech of participle and the entity title in database;
The whether identical scoring that obtains the comparison of entity title of word by entity trade name corresponding part of speech in described template in trade name relatively to be screened and database.
S203: server is by relatively scoring and preassigned assign to judge whether solid data described to be screened and the solid data in database repeat.
Below specifically introduce entity title implementation relatively in S202, it is that the title of entity to be screened and the entity title in database are compared, and this comparison procedure comprises the step shown in Fig. 4:
S401: utilize the participle dictionary of preset different parts of speech to treat the entity title of screening in entity title and database and carry out participle and determine part of speech.
By the participle dictionary, trade name is carried out to participle, by each word in trade name by the predefine dictionary under it, and, according to the part of speech of this predefine dictionary, be the word mark part of speech in trade name (set different parts of speech in the example here and be labeled as S, P, C, O, N, X etc.).
For example, " Xi'an China the moral food and drink De Keshi of company limited high-new dining room " such trade name can obtain " Xi'an _ S Hua De _ X food and drink company limited _ O De Keshi _ N high-new _ S dining room _ O " such result by the trade name word segmentation processing.
For the entity title is carried out to participle, and can identify each participle identity property (place, mechanism, industry etc.) after participle, need minute dictionary of pre-save by parts of speech classification, such as the region dictionary, the place dictionary, mechanism's dictionary, industry dictionary, orientation dictionary, main trade name, minute trade name etc.In order to improve the participle accuracy rate, add brand and divide dictionary, collected the shop brand name in shop in this dictionary, for example, McDonald, KFC, the brand name in the shops such as outer husband's family.
In the example of the present embodiment, entity title (as the shop title) is done to following dimension cutting:
Figure GDA00002711190400051
Table 2
Below respectively each minute top dictionary introduced:
The region dictionary: what wherein include is the whole of China each province, city, town, township, the title in village.Also can retain in addition a participle of removing regional predicate in this dictionary, for example, can include Zhejiang Province and Zhejiang, Hangzhou and Hangzhou, Xianoshan City and Xiaoshan, Huaxi Village and West China simultaneously.Economize, city, town, ,Qi, village, township also is embodied in this dictionary as the region participle.This dictionary is relatively fixing can great changes have taken place.
The place dictionary: wherein include Zhong Yi street, place, ,Lu, Jie, lane, community, inner, the Nong, main road, Village, market, square, mansion, factory building, shop, crossing, the place noun of lane ending, street, ,Lu, Jie, lane, community, inner, the Nong, main road, Village, market, square, mansion, factory building, shop, crossing, lane itself is also the participle in this dictionary.
Mechanism's dictionary: all mechanisms word is divided into to 8 classifications.Respectively integrated service, amusement and recreation, shopping square, food and drink cuisines, living information, trip/tourism, atural object, company/factory.In addition, for the classification in shop, have fuzzy region, a shop likely not only belongs to the living information classification but also belong to trip/tourism classification exactly.In this case mechanism's word classification is adopted the principle of preferential coupling, as shown in Figure 5.Shown in Fig. 5, mechanism's word classification priority level is arranged from high to low according to the direction of arrow indication.For example, there is mechanism's participle " Internet bar " both likely to appear in the amusement and recreation classification, also likely appear in living information.Shown in Fig. 5 priority, the limited level of amusement and recreation is higher than the priority of living information, and therefore, " Internet bar " should belong to mechanism's participle of amusement and recreation class.The application does not do restriction to the sorting technique of mechanism and the preferential matching principle of mechanism's classification.
The industry dictionary: be similar to shop, branch, ,Yuan, pavilion, home office, an independently dictionary can be put in the public participle of a plurality of categorys of employment.For example: branch, building materials company Xiaoshan, east, Hangzhou, see branch, upper city with knowing to distinguish the flavor of, " branch " this word is arranged in trade name, but " branch " both belonged to company/factory, the food and drink of belonging to was arranged, so " branch " income therein in dictionary, must can not be indexed to " branch " this word in an independent file dictionary without industry mechanism.The word that is similar to " branch " also has, ,Dui, building, shop, and city ,Zuo, society, OK, shop, meeting, hall, chamber, paving etc.
Main trade name dictionary, branch thesaurus: in order to improve the participle accuracy rate, add brand and divide dictionary, collected the shop brand name in all shops of public praise net (www.koubei.com) in this storehouse, for example, McDonald, KFC, the brand name in outer husband's family shop.
Word in dictionary has following characteristics, and the brand word is less than two words, or is greater than not including of six words.Can not exist with shop industry or mechanism's dictionary in the participle that existed.
Here part of speech is made to introduction:
N brand word
O mechanism word
C industry word
S region word
P place word
The word that X does not separate
SN branch name brand word
PS region word
FN cum rights participle
For FN, certain XX bath center, hotel for example, wherein the part of speech of XX is exactly FN, then certain XX of company parking lot for example, and the part of speech of XX wherein is exactly FN.
Utilize above-mentioned participle dictionary to treat the entity title of screening in entity title and database and carry out participle and determine part of speech, specifically can carry out as follows:
A1: utilize the participle dictionary of preset different parts of speech to treat the entity title of screening in entity title and database and carry out forward participle and reverse participle.
The forward participle is, according to dictionary, trade name is from left to right carried out to participle, and oppositely participle is, according to dictionary, trade name is carried out to participle from right to left.For example, by trade name, be that " result that ”De shop, potato restaurant space flight bridge shop obtains according to the forward participle is " potato _ restaurant _ boat _ overline bridge _ shop ", and the result obtained according to reverse participle is " potato _ restaurant _ space flight _ bridge _ shop "." potato restaurant space flight bridge shop " carries out respectively forward participle and vocabulary that oppositely participle obtains as shown in following table 3 and table 4:
Participle Word length Part of speech
Potato 2 N
Restaurant 2 C
Space flight 2 N
Bridge 1 N
Shop 1 N
Table 3. forward word segmentation result
Participle Word length Part of speech
Potato 2 N
Restaurant 2 C
Boat 1 N
Overline bridge 2 N
Shop 1 N
Table 4. is word segmentation result oppositely
Here participle adopts short-circuiting method, and each word only has a part of speech.So-called short-circuiting method, be exactly at first step participle, in dictionary, searches the word of coupling, if found the word of coupling just to exit, and then turns to the coupling of next word, so just can guarantee that the part of speech that separates word according to dictionary does not have ambiguity at once.
Maximum coupling number of words can be set as 10, and the word that from left to right this is to the maximum to 10 words carries out participle.Disposable this 10 words that obtain, in dictionary, whether these 10 words of checking are words, if, the word mark part of speech and the end that these 10 words are formed, if not, from left to right get 9 words, whether checking can form a word, if can in dictionary, find the word of coupling through checking, finish.If not former word string being reduced to a word, then mated in dictionary, by that analogy until only remain next word in word string.
By top example, obvious, obtained different word segmentation result by forward participle and reverse participle, so just need a selection adopt the treatment scheme of which result as last word segmentation result.This flow process is exactly next step disambiguation that need to illustrate.
A2: forward participle and the vocabulary that oppositely participle obtains are carried out to the disambiguation processing to obtain unique word segmentation result.
Here the qi that disappears is processed and is comprised following three rules:
Rule I: the word that marks off non-X part of speech of take is the main qi that disappeared.
For example: Shanghai _ S southern station _ P restaurant _ O, upper _ X Hainan _ S station _ X restaurant _ O, wherein the ambiguity word is " Shanghai _ S southern station _ P ", " upper _ X Hainan _ S station _ X ", at this moment getting first word that separates non-X part of speech is that " Shanghai _ S southern station _ P " is correct result.
Rule II: the minimum word of the X part of speech of take is the main qi that disappeared.As the result that " cuisines shop " carried out to forward participle and reverse participle according to A1 is respectively " cuisines _ C shop _ O ", " U.S. _ X eats shop _ O ", and in the part of speech that the former separates " cuisines _ C shop _ O ", the word of X part of speech is minimum, according to regular II, with " cuisines _ C shop _ O ", be as the criterion.
Rule III: take reverse word segmentation result as correct result.
Participle Word length Part of speech
Potato 2 N
Restaurant 2 C
Boat 1 N
Overline bridge 2 N
Shop 1 N
The word segmentation result that table 5. is disappeared after qi is processed and obtains
In addition, the participle priority can be set as follows: brand (N), and mechanism (O), industry (C), without industry mechanism (O), place (P), region (S).For example, if it is " little shop " that a word is arranged in dictionary in the dictionary of brand storehouse and mechanism database, a shop is now just in time arranged " the little shop of cuisines ", result after participle is " the little shop _ N of cuisines _ C " so, rather than " the little shop _ O of cuisines _ C ", because the priority of the word in the priority ratio dictionary O dictionary in the N dictionary is high.
Can also comprise A3: the monosyllabic word processing procedure.
May there is the part of None-identified after disambiguation, just need to carry out the monosyllabic word processing when the length of the X part of speech part that can not identify only has a word, to identify its part of speech.For example:
If right monosyllabic word is catering industry suffix word, left word is industry word or brand word, and left word and right word merge, and part of speech is as the criterion with left word.
As above example, word segmentation result be " space flight C_ bridge X_ shop O ", will become a new word " space flight bridge shop _ C " with left word and the merging of right word after " bridge " disambiguation of centre
If right word is catering industry prefix word, right back word (first word after right word) is industry word or brand word, and right word and right back word are merged, and with right back word part of speech, is as the criterion.
If right word is mechanism's suffix word, left word is mechanism's word or brand word, and by a left side, right word merges, with left word part of speech for Whom.
If right word is mechanism's prefix word, right back word is mechanism or brand part of speech, and right word and right back word merge, and with right back word part of speech, are as the criterion.
S402: will insert respectively predetermined template through the trade name to be screened of the also definite part of speech of participle and the entity title in database.
For example, the form of template can be as follows, front and back sequentially the sort number of this part of speech of numeral wherein in this template:
0:N 1:S 2:P 3:PS 4:C (describing main shop industry) 5:O 6:SN (describing the industry of SN) 7:C 8:O9:FN
By the word segmentation result obtained, aforementioned through participle and determine the trade name to be screened of part of speech and the entity title in database, according to part of speech, insert among template.
For example, shop to be selected is respectively " assistant is stepped on the international beauty treatment of Coronis foundation shop " and " assistant is stepped on Coronis ".
" assistant is stepped on the international beauty treatment of Coronis foundation shop " participle is divided into: the 0-N|1-C|2-SN|3-O(brand | industry | and branch | mechanism)
" assistant is stepped on Coronis " participle is divided into: the 0-N(brand)
By these two after the shop title of selecting is inserted template result be:
Figure GDA00002711190400101
Table 6
S403: by the whether identical scoring that obtains the comparison of entity title of word of entity trade name corresponding part of speech in described template in trade name relatively to be screened and database.
Whether the word by entity trade name corresponding part of speech in described template in trade name relatively to be screened and database is identical, can obtain the scoring relatively of entity title.For example:
Trade name to be screened and the entity trade name of inserting template are compared, and concrete, each position compares, and the 1st, identical, the 0th, different; Every comparative result is left in a scale-of-two array, by the scale-of-two array be converted into decimal number as a comparison the part of result return.
The mark shop is each minute trade name in comparison procedure in twos, main industry, and whether the main frame word-building is equal state, and whether what is called is equal state, and within two minutes, whether whether trade name be all empty or be all that data are arranged.
Trade name comparison appraisal result is that the comparative result according to each part of speech of trade name decides, and comparison rule is as follows:
Figure GDA00002711190400111
As for how obtaining the scoring relatively of entity title, only provide some exemplary examples here, for example:
I: if the unequal words of trade name main body in shop directly return to 0;
II: if comparative result is 511(511, by Binary Conversion, get, conversion regime is with reference to hereinafter discussing), and the comparative result on 4 positions is all 1, trade name judgment result is that fully equally so, returns to 100;
Otherwise:
The result of each bit comparison and 16 is carried out and operation, and result is not equal to 16, returns to 75; Or,
The result of each bit comparison and 32 carry out with operation or with 64 carry out and operation, if any more than one result, be 0, return to 75; Or,
In each bit comparison result, the state (SN) of minute trade name is returned as 0, returns to 75; Or,
The result of each bit comparison and 1 is carried out and operation, if result is 0, returns to 2; Or,
Each bit comparison with 8 carry out and operation, if result is 0, return to 2; Or,
If the result of each bit comparison is greater than 0, return to 75; Or,
Other situation returns to 0.
According to above-mentioned rule, each participle part of speech in table 6 is compared, can obtain following table 7, wherein y means identical:
N S P PS C O SN C O FN
y y y y y y y y y y
Table 7
Each part of speech comparative result is as implied above.
If two identical participle comparative results of part of speech are y weights are 1, and score will be removed top brand word N.Staying 9 bit comparison results is 111111111, must be divided into 1 * 2 9+ 1 * 2 8+ 1 * 2 7+ 1 * 2 6+ 1 * 2 5+ 1 * 2 4+ 1 * 2 3+ 1 * 2 2+ 1 * 2+1 * 2 0=511.Final score is 511, illustrates that trade name is fully equal, returns to 100.
Again for example, remain the selection shop as follows:
The shop title The address, shop Telephone number
The crystal moon design centre No. 217, Wen Yilu (Hangzhou Normal College opposite slightly to the right or left) 88828282
The crystal moon is sent out skill salon No. 217, Wen Yilu 88828282
Table 8
In above form, at first two shops compare trade name are carried out to participle:
" crystal moon design centre " participle is divided into: the 1-N|2-O(brand | mechanism)
" the crystal moon is sent out skill salon " participle is divided into: the 1-N|2-O(brand | mechanism)
According to the trade name comparison rule, each participle part of speech after participle is compared as following table:
N S P PS C O SN C O FN
y y y y y n y y y y
Table 9
According to comparative result master trade name, mechanism is unequal, and comparative result is 495, with 2 5Doing with the result operated is 0, so trade name is unequal.
But, for the example in table 8, according to address, shop and telephone number, clearly, these two entities of " crystal moon design centre " and " the crystal moon is sent out skill salon " are for repeating entity.For this situation, on the basis of relatively marking in the entity title, can also comprise: determine according to entity title comparative result and telephone number whether solid data repeats, or determine according to entity title comparative result and address whether solid data repeats.In example as top table 8, although the scoring of title comparative result for not repeating, reference phone numbers is identical, can determine that result is repetition.Again or, although the scoring of title comparative result for not repeating, reference address is identical, can determine that result is repetition.
In S401, can also comprise:
S401 ': will carry out respectively pattern match through trade name to be screened and the entity title in database of participle definite part of speech, the pattern rules that find this rule to mate from the pattern rules file, thereby by part of speech on the mark of word segmentation in title.
After S401 processes, also have our None-identified of part word in trade name, and the word that is designated N may be shop master's trade name, may be also that trade name is divided in shop, and minute trade name and main trade name role in the title comparison process of shop are different.Like this, just word segmentation result need to be matched to different patterns.Such as, the rule of " Xi'an _ S Hua De _ X food and drink company limited _ OD De Keshi _ N high-new _ S dining room _ OD " is S-X-O-N-S-O, in obtaining result, rule is S-X-O-N-S-O, most result can be regarded as a brand by second X, the 4th N regards as SN branch title, so, can be 0-S|1-N|2-O|3-SN|4-S|5-O by the S-X-O-N-S-O rule match, can enumerate 400 nearly through these rules of statistics, account for more than 90% of GREV.
Below some common rule that exist:
Figure GDA00002711190400131
Figure GDA00002711190400141
Wherein, X means not determine the participle of part of speech.
More than to have intercepted a certain section rule in the rule file, the form of every rule be take equal sign as being divided into left and right two parts, the left side be word segmentation result (the part of speech sign order that has marked part of speech after the trade name participle, sequencing according to participle in trade name is arranged), right-hand component is the mode of rule that left side word segmentation result is matched to according to the principle of probability statistics.
The process that generates this part of rule file is as follows:
B1: word-dividing mode participle and the mark part of speech for trade name that will have all Yellow Pages shop now.
B2: by word segmentation result according to parts of speech classification.
For example, what by all word segmentation result attributes, be " X-O-O " assigns in a group, and adds up the number in every group.
B3: reject the low frequency group that number is no more than and do not make subsequent treatment.Because these shop titles do not have general character thereby do not have reference value.
Can also introduce manually-operated, rely on artificial experience, which part of speech is " X " part in the judgement word segmentation result should be designated as.Also likely, the participle of former " C " before this, " O ", " P " part of speech should be likely other parts of speech according to judgement.
The effect of rule match is exactly by likely error of original machine judgement, and by realizing artificial screening verification create-rule file, the false judgment of when robotization is processed, effectively correcting machine, therefore can improve the correctness of final participle judgement.
In this treatment step is carried out, when obtaining a rule, just from the pattern rules file, search the pattern rules that this rule is mated, just can be by part of speech on the mark of word segmentation in trade name after finding.
Described embodiment of the method can also comprise:
S204: server will be judged as unduplicated solid data and be added in described database.
In S202, take shop is example as entity, the existing Yellow Page of storage shop data can adopt the postgresql database, because postgresql supports full-text index, and support distributed type assemblies, due to these two advantages, can store existing shop data with the postgresql database in this step.Fig. 3 shows postgresql database cluster topology figure.Before carrying out the shop duplicate removal, at first need to deposit existing all Yellow Pages shop data in four data nodes, which back end is concrete shop data be kept on, is to formulate according to the hash algorithmic rule of reserving in advance, and rule is as follows:
Shop id is the guid of 32, get major key first, when primary value is 0~3 to be stored on the newkoubei node, when primary value is 4~7 to be stored on the datanode1 node, when primary value is that 8~b is stored on the datanode2 node, when primary value is that c~f is stored on the newkoubei node.
In S202, the solid data of scheduled volume in described database, can be all solid datas in database.Mean like this in S202 and need solid datas all in solid data to be screened and database is carried out to comparison one by one.
In addition, the solid data of scheduled volume in database, can be the part solid data in database, for example, in the present embodiment, provides the suspection solid data of this part solid data for screening from all solid datas of database.Mean like this, solid data that will be to be screened in S202 contrasts one by one with the suspection solid data filtered out.
Entity be take shop as example, and the part solid data is specially here suspects shop.Suspect shop, the full-text index mechanism based on the above-mentioned postgresql of utilization database is carried out the full text matched and searched, can find the shop data of all couplings.
In the system constructing process, can, by the shop data importing of all Yellow Pages in the distributed data base shown in Fig. 3, according to Data dissemination hash algorithm, all Yellow Pages shop data can be distributed in four data nodes of Fig. 3 fifty-fifty.
Can preserve the table in Yellow Page shop in each back end, the structure of table can be as follows:
Figure GDA00002711190400151
Table 1
The data type of the indexcontent field in above form is the tsvector type, can be by shop title and address name by word segmentation processing in the system initialization process, generate using space character as the character string of demarcating the initialization value as the indexcontent field.
Search the flow process of suspecting shop and be divided into following steps:
The shop name in Yellow Page shop to be found is connected with space with a minute trade name, generates the participle string with the space boundary;
By participle string obtained above, with the indexcontent(inverted index type field in table 1) carry out matched and searched;
Merge the Query Result on each back end, generate and suspect shop Query Result set.For example four data nodes here.
Generate to suspect the shop query results, can mate and obtain according to participle string indexcontent in 1 on each back end.Detailed process can be: the inquiry string with space-separated at first generated by previous step and city, place, shop id(city) combination producing sql statement.For example, there is the name in a shop to be called " crystal moon design centre " obtains " the crystal moon _ N sends out skill _ X; design centre _ OB " result by after the trade name participle, then the participle attribute is removed, be assembled into " crystal moon design centre ", then assembling generates the sql statement:
Figure GDA00002711190400171
With this sql statement, the data node is inquired about, the result set of generation is exactly that object of suspicion is repeated in next step shop needed.
The technical scheme provided from above the embodiment of the present application, server obtains solid data to be screened, server will be to be screened solid data and database in the solid data of scheduled volume the entity title is compared one by one and is marked, server assigns to judge that with preassigned solid data described to be screened and the solid data in database repeat by relatively marking, and can screen out efficiently the solid data of repetition.
Below introduce the embodiment of a kind of server of the present invention, Fig. 6 shows the block diagram of this server example, comprising:
Acquiring unit 61, for obtaining solid data to be screened;
Title comparing unit 62, for the title of the solid data of title that will solid data be screened and database scheduled volume by following manner entity title relatively one by one:
Utilizing the participle dictionary of preset different parts of speech to treat the entity title of screening in entity title and database carries out participle and determines part of speech; To insert respectively predetermined template through the trade name to be screened of the also definite part of speech of participle and the entity title in database; The whether identical scoring that obtains the comparison of entity title of word by entity trade name corresponding part of speech in described template in trade name relatively to be screened and database; Wherein, the entity title participle in database, determine that part of speech can fulfil ahead of schedule;
Judging unit 63, for assigning to judge with preassigned whether solid data described to be screened and the solid data of database relatively repeat by relatively marking;
Adding device 64, be added into described database for being judged as unduplicated solid data.
Preferably, in described database, the solid data of scheduled volume comprises all solid datas or the part solid data in database.
Preferably, described part solid data comprises the data of the suspection entity screened from all solid datas of described database.
Preferably, described server can also also comprise screening unit 65 as shown in Figure 7, for the shop name of finding out to be checked is connected with space with a minute trade name, generates the participle string with the space boundary;
By above-mentioned resulting participle string, with the indexcontent(in table 1 be the inverted index type field) carry out matched and searched;
Merge the Query Result on four data nodes, generate and suspect shop Query Result set.
Preferably, described server can also be as shown in Figure 8, and described title comparing unit 62 comprises:
Participle unit 621, utilize the participle dictionary of preset different parts of speech to treat the entity title of screening in entity title and database and carry out participle and determine part of speech;
Modular unit 622, will insert respectively predetermined template through the trade name to be screened of the also definite part of speech of participle and the entity title in database;
Scoring unit 623, by the whether identical scoring that obtains the comparison of entity title of word of entity trade name corresponding part of speech in described template in trade name relatively to be screened and database.
Preferably, described server can also be as shown in Figure 9, described participle unit 621 comprises forward participle unit 6211 and reverse participle unit 6212, utilizes respectively the participle dictionary of preset different parts of speech to treat the entity title of screening in entity title and database and carries out forward participle and reverse participle;
Correspondingly, described participle unit 621 also comprises the qi unit 6213 that disappears, for forward participle and the vocabulary that oppositely participle obtains are carried out to the disambiguation processing to obtain unique word segmentation result.
Preferably, described server can also be as shown in figure 10, also comprise pattern matching unit 624, for carrying out respectively pattern match through the trade name to be screened of the also definite part of speech of participle and the entity title of database, the pattern rules that find described title to mate from preset pattern rules file, thereby by part of speech on the mark of word segmentation in described title.
Being divided into various unit with function while for convenience of description, describing above device describes respectively.Certainly, when enforcement is of the present invention, can realize the function of each unit in same or a plurality of software and/or hardware.
As seen through the above description of the embodiments, those skilled in the art can be well understood to the mode that the present invention can add essential general hardware platform by software and realizes.Understanding based on such, the part that technical scheme of the present invention contributes to prior art in essence in other words can embody with the form of software product, this computer software product can be stored in storage medium, as ROM/RAM, magnetic disc, CD etc., comprise that some instructions are with so that a computer equipment (can be personal computer, server, or the network equipment etc.) carry out the described method of some part of each embodiment of the present invention or embodiment.
Each embodiment in this instructions all adopts the mode of going forward one by one to describe, and between each embodiment, identical similar part is mutually referring to getting final product, and each embodiment stresses is the difference with other embodiment.Especially, for system embodiment, due to it, substantially similar in appearance to embodiment of the method, so description is fairly simple, relevant part gets final product referring to the part explanation of embodiment of the method.
The present invention can be used in numerous general or special purpose computingasystem environment or configuration.For example: personal computer, server computer, handheld device or portable set, plate equipment, multicomputer system, the system based on microprocessor, set top box, programmable consumer-elcetronics devices, network PC, small-size computer, mainframe computer, comprise distributed computing environment of above any system or equipment etc.
The present invention can describe in the general context of the computer executable instructions of being carried out by computing machine, for example program module.Usually, program module comprises the routine carrying out particular task or realize particular abstract data type, program, object, assembly, data structure etc.Also can in distributed computing environment, put into practice the present invention, in these distributed computing environment, be executed the task by the teleprocessing equipment be connected by communication network.In distributed computing environment, program module can be arranged in the local and remote computer-readable storage medium that comprises memory device.
Although described the present invention by embodiment, those of ordinary skills know, the present invention has many distortion and variation and do not break away from spirit of the present invention, wish that appended claim comprises these distortion and variation and do not break away from spirit of the present invention.

Claims (14)

1. the method for a screening duplicated entity data, is characterized in that, comprising:
S1: server obtains solid data to be screened;
S2: the title of the solid data that server will be to be screened and the title of the solid data in database compare the entity title one by one by following manner:
Utilizing the participle dictionary of preset different parts of speech to treat the entity title of screening in entity title and database carries out participle and determines part of speech;
To insert respectively predetermined template through the trade name to be screened of the also definite part of speech of participle and the entity title in database;
In trade name relatively to be screened and database, whether the word of entity trade name corresponding part of speech in described template is identical, obtains the scoring of entity title comparison according to the comparative result of each part of speech of trade name;
S3: server is by relatively scoring and preassigned assign to judge whether solid data described to be screened and the solid data in database relatively repeat;
S4: server will be judged as unduplicated described solid data to be screened and be added into database.
2. the method for claim 1, is characterized in that, in described database, the solid data of scheduled volume comprises all solid datas or the part solid data in database.
3. method as claimed in claim 2, is characterized in that, described part solid data comprises the suspection solid data screened from all solid datas of described database.
4. method as claimed in claim 3, is characterized in that, the described suspection solid data screened is obtained by following mode:
The shop name in Yellow Page shop to be found is connected with space with a minute trade name, generates the participle string with the space boundary;
By above-mentioned resulting participle string, with the inverted index type field, carry out matched and searched;
Merge the Query Result on each back end, generate and suspect shop Query Result set.
5. the method for claim 1, is characterized in that, the participle dictionary of the different parts of speech that described server by utilizing is preset is treated the entity title of screening in entity title and database and carried out participle and determine part of speech, comprising:
The participle dictionary of the different parts of speech that server by utilizing is preset is treated the entity title of screening in entity title and database and is carried out forward participle and reverse participle;
Server carries out the disambiguation processing to obtain unique word segmentation result to forward participle and the vocabulary that oppositely participle obtains.
6. the method for claim 1, is characterized in that, the participle dictionary of the different parts of speech that described server by utilizing is preset is treated during the entity title of screening in entity title and database carries out participle and determine after part of speech, also comprises:
Server will carry out respectively pattern match through trade name to be screened and the entity title in database of participle definite part of speech, the pattern rules that find described title to mate from preset pattern rules file, thereby by part of speech on the mark of word segmentation in the entity title in entity title described to be screened and database.
7. the device of a screening duplicated entity data, is characterized in that, comprising:
Acquiring unit, for obtaining solid data to be screened;
The title comparing unit, for the title of the solid data of title that will solid data be screened and database scheduled volume by following manner entity title relatively one by one:
Utilizing the participle dictionary of preset different parts of speech to treat the entity title of screening in entity title and database carries out participle and determines part of speech; To insert respectively predetermined template through the trade name to be screened of the also definite part of speech of participle and the entity title in database; In trade name relatively to be screened and database, whether the word of entity trade name corresponding part of speech in described template is identical, obtains the scoring of entity title comparison according to the comparative result of each part of speech of trade name;
Judging unit, for assigning to judge with preassigned whether solid data described to be screened and the solid data of database relatively repeat by relatively marking;
Adding device, be added into described database for being judged as unduplicated solid data.
8. device as claimed in claim 7, is characterized in that, in described database, the solid data of scheduled volume comprises all solid datas or the part solid data in database.
9. device as claimed in claim 8, is characterized in that, described part solid data comprises the data of the suspection entity screened from all solid datas of described database.
10. device as claimed in claim 8, is characterized in that, also comprises the screening unit, for the shop name in the Yellow Page shop by be found, with a minute trade name, with space, is connected, and generates the participle string with the space boundary;
By above-mentioned resulting participle string, with the inverted index type field, carry out matched and searched;
Merge the Query Result on each back end, generate and suspect shop Query Result set.
11. device as claimed in claim 7, is characterized in that, also comprises adding device, for being judged as unduplicated solid data, is added into described database.
12. device as claimed in claim 7, is characterized in that, described title comparing unit comprises:
The participle unit, utilize the participle dictionary of preset different parts of speech to treat the entity title of screening in entity title and database and carry out participle and determine part of speech;
Modular unit, will insert respectively predetermined template through the trade name to be screened of the also definite part of speech of participle and the entity title in database;
The scoring unit, by the whether identical scoring that obtains the comparison of entity title of word of entity trade name corresponding part of speech in described template in trade name relatively to be screened and database.
13. device as claimed in claim 12, it is characterized in that, described participle unit comprises forward participle unit and reverse participle unit, utilizes respectively the participle dictionary of preset different parts of speech to treat the entity title of screening in entity title and database and carries out forward participle and reverse participle;
Correspondingly, described participle unit also comprises the qi unit that disappears, for forward participle and the vocabulary that oppositely participle obtains are carried out to the disambiguation processing to obtain unique word segmentation result.
14. device as claimed in claim 12, it is characterized in that, described title comparing unit also comprises pattern matching unit, for carrying out respectively pattern match through the trade name to be screened of the also definite part of speech of participle and the entity title of database, the pattern rules that find described title to mate from preset pattern rules file, thereby by part of speech on the mark of word segmentation in the entity title in entity title described to be screened and database.
CN2009101705511A 2009-09-10 2009-09-10 Method and device for screening duplicated entity data Active CN102023984B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN2009101705511A CN102023984B (en) 2009-09-10 2009-09-10 Method and device for screening duplicated entity data
HK11105866.7A HK1152126A1 (en) 2009-09-10 2011-06-10 Method and apparatus of discriminating reduplicate entity data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009101705511A CN102023984B (en) 2009-09-10 2009-09-10 Method and device for screening duplicated entity data

Publications (2)

Publication Number Publication Date
CN102023984A CN102023984A (en) 2011-04-20
CN102023984B true CN102023984B (en) 2013-12-04

Family

ID=43865292

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009101705511A Active CN102023984B (en) 2009-09-10 2009-09-10 Method and device for screening duplicated entity data

Country Status (2)

Country Link
CN (1) CN102023984B (en)
HK (1) HK1152126A1 (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102722562B (en) * 2012-06-01 2014-11-26 浙江灵玖天下软件有限公司 Organization information integrating and updating method on basis of Internet
CN103544165A (en) * 2012-07-12 2014-01-29 腾讯科技(深圳)有限公司 Neologism mining method and system
CN105718595A (en) * 2016-02-17 2016-06-29 无线生活(杭州)信息科技有限公司 Keyword extraction method and device
CN106528748B (en) * 2016-10-27 2019-09-20 百度在线网络技术(北京)有限公司 It is a kind of for determining the method and apparatus of region dictionary
CN107133335B (en) * 2017-05-15 2020-06-02 北京航空航天大学 Repeated record detection method based on word segmentation and indexing technology
CN108376365B (en) * 2018-03-22 2021-06-18 中国银行股份有限公司 Bank number determining method and device
CN109003133B (en) * 2018-07-20 2022-10-14 创新先进技术有限公司 Off-line store identification method and device
CN109033370A (en) * 2018-07-27 2018-12-18 阿里巴巴集团控股有限公司 A kind of method and device that searching similar shop, the method and device of shop access
CN109726312B (en) * 2018-12-25 2021-10-08 广州虎牙信息科技有限公司 Regular expression detection method, device, equipment and storage medium
CN109977951B (en) * 2019-03-22 2021-10-15 北京泰迪熊移动科技有限公司 Method, device and storage medium for identifying store name of service door
CN109977287B (en) * 2019-03-28 2021-02-02 国家计算机网络与信息安全管理中心 Method for judging identity of real estate data of different information sources
CN111639253B (en) * 2020-05-22 2023-08-01 北京百度网讯科技有限公司 Data weight judging method, device, equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101499056A (en) * 2008-01-28 2009-08-05 徐文新 Backward reference sentence pattern language analysis method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101499056A (en) * 2008-01-28 2009-08-05 徐文新 Backward reference sentence pattern language analysis method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘哲等.一种中文地址类相似重复信息的检测方法.《小型微型计算机系统》.2008,第29卷(第4期),第726-729页. *

Also Published As

Publication number Publication date
HK1152126A1 (en) 2012-02-17
CN102023984A (en) 2011-04-20

Similar Documents

Publication Publication Date Title
CN102023984B (en) Method and device for screening duplicated entity data
CN102349072B (en) Identifying query aspects
CN103488648B (en) A kind of multilingual mixed index method and system
US9141642B2 (en) Determining core geographical information in a document
CN110929125B (en) Search recall method, device, equipment and storage medium thereof
US20020156779A1 (en) Internet search engine
CN106547796A (en) The execution method and device of data base
CN103605752A (en) Address matching method based on semantic recognition
CN101350013A (en) Method and system for searching geographical information
CN102483748A (en) Query parsing for map search
CN103678576A (en) Full-text retrieval system based on dynamic semantic analysis
CN111639253B (en) Data weight judging method, device, equipment and storage medium
US9990410B2 (en) Data sanitization and normalization and geocoding methods
Pham et al. The structure of the computer science knowledge network
CN106021336A (en) A method for automatic administrative district division for mass address information
CN106383836A (en) Ascribing actionable attributes to data describing personal identity
CN106815265B (en) Method and device for searching referee document
Christen et al. A probabilistic geocoding system based on a national address file
CN102799586B (en) A kind of escape degree defining method for search results ranking and device
CN103020038A (en) Internet public opinion regional relevance computing method
CN104750673A (en) Text matching and filtering method and text matching and filtering device
CN107608981B (en) Character matching method and system based on regular expression
CN106407221B (en) Address data retrieval method and device
WO2001065410A2 (en) Search engine for spatial data indexing
CN104657487A (en) Licence plate recommendation method and device based on user licence plate querying behavior

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 1152126

Country of ref document: HK

C14 Grant of patent or utility model
GR01 Patent grant
REG Reference to a national code

Ref country code: HK

Ref legal event code: GR

Ref document number: 1152126

Country of ref document: HK

TR01 Transfer of patent right

Effective date of registration: 20211110

Address after: Room 554, floor 5, building 3, No. 969, Wenyi West Road, Wuchang Street, Yuhang District, Hangzhou City, Zhejiang Province

Patentee after: Taobao (China) Software Co., Ltd

Address before: P.O. Box 847, 4th floor, capital building, Grand Cayman, British Cayman Islands

Patentee before: Alibaba Group Holdings Limited

TR01 Transfer of patent right