CN102768659B - Method and system for identifying repeated account - Google Patents

Method and system for identifying repeated account Download PDF

Info

Publication number
CN102768659B
CN102768659B CN201110113252.1A CN201110113252A CN102768659B CN 102768659 B CN102768659 B CN 102768659B CN 201110113252 A CN201110113252 A CN 201110113252A CN 102768659 B CN102768659 B CN 102768659B
Authority
CN
China
Prior art keywords
account
feature
similarity
information
parameter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201110113252.1A
Other languages
Chinese (zh)
Other versions
CN102768659A (en
Inventor
冯景华
陈超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201110113252.1A priority Critical patent/CN102768659B/en
Publication of CN102768659A publication Critical patent/CN102768659A/en
Priority to HK12113367.4A priority patent/HK1172706A1/en
Application granted granted Critical
Publication of CN102768659B publication Critical patent/CN102768659B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a system for identifying a repeated account. The method includes: acquiring feature information of a first account and a second account saved by a website server; calculating similarity between the features in the feature information of the first account and the features in the feature information of the second account; using the obtained similarity as an input parameter of a preset identifying model, calculating similarity of the feature information of the first account and the feature information of the second account according to the preset identifying model, and judging whether the first account and the second account are the repeated account or not according to the obtained similarity. By the method and the system, the problem that the repeated account cannot be identified in the prior art is solved, the repeated account is identified accurately, and operating speed is improved.

Description

Repeat account automatic identifying method and system
Technical field
The application relates to internet information field, repeats account automatic identifying method and system in particular to one.
Background technology
In the process of current internet use, duplicate message is one of problem affecting user search experience most and increase the weight of search engine server search burden, wherein, for e-commerce website, the account repeated can cause the duplication of labour of buyer user when contacting seller, and the good seller's user profile of part also can be caused to can not get exposure; Simultaneously owing to repeating the existence of account number in a large number, making user increase the weight of the search burden of search engine when carrying out information inquiry, slow down the search speed of search engine.
In the prior art, the general following steps that adopt identify repetition account:
S1: server obtains account to be identified;
S2: the title of account to be identified is compared title with the title of the account of scheduled volume in database by following manner by server one by one:
Utilize the participle dictionary of preset different parts of speech that the account title in the title of account to be identified and database is carried out to participle and determined part of speech;
To determine that the solid shop/brick and mortar store name in the trade name that the account number to be identified of part of speech is corresponding and database inserts predetermined template respectively through participle;
The trade name corresponding by account number more to be identified and the whether identical scoring obtaining account title and compare of the word of entity trade name corresponding part of speech in described template in database;
S3: server assigns to judge that described account to be identified repeats with the account in the database compared with preassigned by comparing scoring;
S4: server will be judged as that unduplicated described account to be identified is added into database.
Said method identifies repetition account by judging that whether account title is identical, but, it will be appreciated by persons skilled in the art that in ecommerce, seller's account generally comprises multiple characteristic information, such as, account title, the Business Name that this account is corresponding, company introduction, contact method, access behavior etc.Account title is identical and cannot judge whether this account repeats exactly, such as, the account name of account A is called Apple, the various apples such as red fuji apple are mainly sold by the said firm, and the account title of account B is also Apple, iphone mainly sells in the said firm, the electronic products such as ipad, visible, the characteristic information of account A and account B should be obviously different, if but whether only compare account title identical, then can think that account A and account B is for repeating account, thus cause account identification error.Because the identification repeating account number is inaccurate, cause the existence repeating account number in a large number, not can solve the problem of the search burden of search engine server, therefore, be badly in need of the recognition accuracy of a kind of raising account, thus alleviate search engine server search burden, accelerate the scheme of search speed.
Summary of the invention
The application aims to provide a kind of repetition account automatic identifying method and system, cannot correctly identify repetition account to solve in prior art, thus causes the problem increasing the weight of search engine server search burden.
According to an aspect of the application, provide a kind of repetition account automatic identifying method, it comprises: the characteristic information of the first account that the server obtaining website is preserved and the second account; Similarity between each characteristic parameter calculating characteristic of correspondence in each characteristic parameter of the feature in the characteristic information of the first account and the characteristic information of the second account; According to pre-assigned weight parameter, the similarity that matching obtains between each feature of the first account each feature corresponding with the second account is carried out to the similarity between each characteristic parameter; Judge whether the first account and the second account are repetition account according to the similarity between each feature that each feature of the first account is corresponding with the second account.
According to the another aspect of the application, provide a kind of repetition account automatic recognition system, it comprises: acquiring unit, the characteristic information of the first account that the server for obtaining website is preserved and the second account, wherein, characteristic information comprises following characteristics one or a combination set of: the essential information feature of account, the product information feature of account institute release product and the behavioural information feature of account; Computing unit, for characteristic of correspondence in the characteristic information of each characteristic parameter and the second account of calculating the feature in the characteristic information of the first account each characteristic parameter between similarity, and according to pre-assigned weight parameter, the similarity that matching obtains between each feature of the first account each feature corresponding with the second account is carried out to the similarity between each characteristic parameter; According to the similarity between each feature of the first account each feature corresponding with the second account, judging unit, for judging whether the first account and the second account are repetition account.
There is in the application following beneficial effect:
1) by the similarity of the multiple features between matching two accounts, the application judges whether two accounts are repetition, can effectively avoid owing to judging the inaccurate and problem duplicate message of mistake being supplied to user that is that cause, thus reach the object accurately identifying and repeat account, further mitigate the processing pressure of search engine server when processes user queries request, improve search speed;
2) characteristic information in the application comprises multiple feature, such as, the essential information feature of account, the product information feature of account institute release product and the behavioural information feature of account, utilize above-mentioned characteristic information can carry out Similarity Measure from multidimensional angle, avoid the unicity of the dimension adopted when repetition account calculates, improve the accuracy of repetition account identification;
3) the application is by training model of cognition, saves the cycle index of calculating, thus improves the arithmetic speed of system when carrying out the identification of repetition account, saves computing time.
Accompanying drawing explanation
Accompanying drawing described herein is used to provide further understanding of the present application, and form a application's part, the schematic description and description of the application, for explaining the application, does not form the improper restriction to the application.In the accompanying drawings:
Fig. 1 is a kind of preferred structure schematic diagram of the repetition account automatic recognition system according to the embodiment of the present application;
Fig. 2 is the another kind of preferred structure schematic diagram of the repetition account automatic recognition system according to the embodiment of the present application;
Fig. 3 is a kind of preferred flow charts of the repetition account automatic identifying method according to the embodiment of the present application;
Fig. 4 is the another kind of preferred flow charts of the repetition account automatic identifying method according to the embodiment of the present application.
Embodiment
Hereinafter also describe the application in detail with reference to accompanying drawing in conjunction with the embodiments.It should be noted that, when not conflicting, the embodiment in the application and the feature in embodiment can combine mutually.
Before the further details of each embodiment describing the application, the counting system structure that of the principle that can be used for realizing the application is suitable is described with reference to Fig. 1.In the following description, except as otherwise noted, the symbol otherwise with reference to the action performed by one or more computing machine and operation represents each embodiment describing the application.Thus, be appreciated that processing unit that this kind of action performed sometimes referred to as computing machine and operation comprise computing machine is to the manipulation of electric signal representing data with structured form.It is safeguarded in this manipulation transforms data or the position in the accumulator system of computing machine, and this reshuffles or changes the operation of computing machine in the mode that those skilled in the art understands.The data structure of service data is the physical location of the storer of the particular community that the form with data defines.But although describe the application in above-mentioned context, it does not also mean that restrictive, as understood by those skilled in the art, hereinafter described action and each side of operation also can realize with hardware.
Turn to accompanying drawing, wherein identical reference number refers to identical element, and the principle of the application is shown in a suitable computing environment and realizes.Below describe the embodiment based on described the application, and should not think about the alternative embodiment clearly do not described herein and limit the application.
Fig. 1 shows the schematic diagram of the example computer architecture that can be used for these equipment.For purposes of illustration, the architecture of painting is only an example of proper environment, not proposes any limitation to the usable range of the application or function.This computing system should be interpreted as, to the arbitrary assembly shown in Fig. 1 or its combination, there is any dependence or demand yet.
The principle of the application can use other universal or special calculating or communication environment or configuration to operate.Be applicable to the well-known computing system of the application, the example of environment and configuration includes but not limited to, personal computer, server, multicomputer system, system based on micro-process, minicomputer, mainframe computer and comprise the distributed computing environment of arbitrary said system or equipment.
In the configuration that it is the most basic, the repetition account automatic recognition system 100 in Fig. 1 generally includes at least one processing unit 102 and storer 104.Processing unit 102 can be, but not limited to Micro-processor MCV, programmable logic device (PLD) FPGA etc., and storer 104 can be volatibility (as RAM), non-volatile (as ROM, flash memory etc.) or both a certain combinations.In the present specification and claims, " repeat account automatic recognition system " to be defined as executive software, firmware or microcode to come any nextport hardware component NextPort of practical function or the combination of nextport hardware component NextPort.It can be even distributed for repeating account automatic recognition system 100, to realize distributed function.
As used in this application, term " module ", " assembly " or " unit " can refer in the software object repeating account automatic recognition system 100 performs or routine.Different assembly described herein, module, unit, engine and service can be implemented as in the object or the process that repeat account automatic recognition system 100 performs (such as, as independent thread).Although system and method described herein preferably realizes with software, the realization of the combination of hardware or software and hardware also may and conceived.
As used in this application, term " cuts word " or " part-of-speech tagging " is the common method of natural language processing.Cut word and exactly Chinese text sequence is divided into significant word.Part-of-speech tagging, exactly to the word obtained after cutting word, assigns the part of speech that suitable, such as verb, noun etc.In ecommerce, conventional has product word, model word, brand word etc.In this application, the operation of " cutting word " or " part-of-speech tagging " is performed by system.Certainly, the application is also not limited thereto, also can by artificial mode, or mode that is artificial and system in combination performs the operation of " cutting word " or " part-of-speech tagging ".
Repeat account automatic recognition system 100 and can also comprise permission main frame as the communication unit 106 being undertaken communicating by network 108 and other system and equipment.Communication unit 106 can for wire transmission equipment, as cable network communication interface and chip, or be radio transmission apparatus, as RF, infrared, bluetooth equipment etc.
Embodiment 1
Fig. 2 is the another kind of preferred structure schematic diagram of the repetition account automatic recognition system according to the embodiment of the present application, and preferably, each assembly shown in Fig. 2 can be, but not limited to be realized by the processing unit 102 shown in Fig. 1.As shown in Figure 2, repeat account automatic recognition system to comprise: acquiring unit 202, the characteristic information of the first account that the server for obtaining website is preserved and the second account, wherein, described characteristic information comprises following characteristics one or a combination set of: the essential information feature of account, the product information feature of account institute release product and the behavioural information feature of account; Computing unit 204, for characteristic of correspondence in the characteristic information of each characteristic parameter and described second account of calculating the feature in the characteristic information of described first account each characteristic parameter between similarity, and according to pre-assigned weight parameter, the similarity that matching obtains between each feature of described first account each feature corresponding with described second account is carried out to the similarity between each characteristic parameter described; For the similarity between each feature that each feature according to described first account is corresponding with described second account, judging unit 206, judges whether described first account and described second account are repetition account.
In the preferred embodiment of the application, judge whether two accounts are repetition by the similarity of the multiple features between matching two accounts, can effectively avoid owing to judging the inaccurate and problem duplicate message of mistake being supplied to user that is that cause, thus reach the object accurately identifying and repeat account, further increase the Experience Degree of user when use web search business, ecommerce etc., alleviate processing pressure during search engine server process inquiry request, improve inquiry velocity.In addition, characteristic information in the application comprises multiple feature, such as, the essential information feature of account, the product information feature of account institute release product and the behavioural information feature of account, utilize above-mentioned characteristic information can carry out Similarity Measure from multidimensional angle, avoid the unicity of the dimension adopted when repetition account calculates, improve the accuracy of repetition account identification.
Preferably, computing unit 204 comprises: the first acquisition module 2041, second acquisition module 2042 connected successively, selection module 2043, first computing module 2044.In the preferred embodiment of the application, the first acquisition module 2041, second acquisition module 2042, selection module 2043, first computing module 2044 adopt the method for cosine angle to calculate the similarity between characteristic parameter, specifically describe as follows:
In each characteristic parameter of feature in the characteristic information calculating described first account and the characteristic information of described second account characteristic of correspondence each characteristic parameter between similarity time, the first acquisition module 2041 obtains by first group of keyword A fisrt feature parameter being cut to word and obtain 1, A 2... A mand obtain by carrying out part-of-speech tagging to described first group of keyword and carrying out to each keyword in described first group of keyword first group of weights W that weight allocation obtains according to part of speech a1, W a2... W aM, wherein, described fisrt feature parameter is a characteristic parameter of the feature in the characteristic information of described first account; Second acquisition module 2042 obtains and obtains second group of keyword B by cutting word to second feature parameter 1, B 2... B nand obtain by carrying out part-of-speech tagging to described second group of keyword and carrying out to each keyword in described second group of keyword second group of weights W that weight allocation obtains according to part of speech b1, W b2... W bN, wherein, described second feature parameter is a characteristic parameter of the feature in the characteristic information of described second account.
After getting above-mentioned parameter, module 2043 is selected to select keyword C identical between described first group of keyword and described second group of keyword 1... C h, the weights W of H>=1 and correspondence c1... W cH.Then, the first computing module 2044 is by fisrt feature parameter described in following formulae discovery and the similarity df between described second feature parameter:
df = d 1 ( da × db )
Wherein, d1=W c1× W c1+ ... W cH× W cH;
da=W A1×W A1+…W AM×W AM
db=W B1×W B1+…W BN×W BN
The method of above-mentioned cosine angle can utilize different weights to calculate the similarity between characteristic parameter, instead of solely carries out Similarity Measure, thus obtains the similarity between two characteristic parameters exactly.Certainly, the method for the cosine angle in the application is a kind of example, and the application is not limited only to this, can also be carried out the calculating of similarity by other similar methods.
As shown in Figure 2, computing unit 204 also comprises: the second computing module 2045.Similarity between each characteristic parameter of each characteristic parameter of the fisrt feature to described first account second feature corresponding with described second account is carried out in the process of matching, second computing module 2045 can adopt the mode of linear fit, that is, matching can be carried out by following formula:
d=c1×W c1+c2×W c2…+cq×W cq,q≥1
Wherein, d is the similarity between the fisrt feature of the described first account second feature corresponding with described second account;
C1, c2 ... cq is the similarity between each characteristic parameter of described fisrt feature and each characteristic parameter of described second feature;
W c1, W c2w cqfor pre-assigned weight.
Certainly, above-mentioned linear fit is a kind of mode, and the application is not limited only to this.
For example, the essential information feature of the first account comprises parameter: CompanyAddress (A1), company introduction (A2) and firm telephone (A3), the essential information feature of the second account comprises characteristic parameter: CompanyAddress (B1), company introduction (B2) and firm telephone (B3).In the process of the similarity of the essential information feature of calculating first account and the essential information feature of the second account, first the first computing module 2041 calculates the similarity C3 between similarity C2, A3 and B3 between similarity C1, A2 and B2 between A1 and B1; Then fitting module 2042 is by carrying out the similarity that linear fit obtains the essential information feature of the first account and the essential information feature of the second account to C1, C2 and C3.In concrete realization, can adopt the computing method of cosine angle to calculate the similarity between the parameters in the essential information feature of parameters in the essential information feature of the first account and the second account, its detailed process can the computation process about table 1-table 4 in reference example 3.In addition, about above-mentioned concrete fit procedure, also can the computation process about table 1-table 4 in reference example 3.
In above-mentioned preferred embodiment, owing to carrying out the Fitting Calculation to obtain the similarity between a pair characteristic information feature for the similarity of each characteristic parameter, therefore, ensure that the accuracy of the Similarity Measure between a pair characteristic information feature.
Further, judging unit 206 comprises: the 3rd computing module 2061 connected successively and judge module 2062.Similarity between each feature that each feature according to described first account is corresponding with described second account judges whether described first account and described second account are in the process of repetition account, similarity between each feature that 3rd computing module 2061 is corresponding with described second account using each feature of described first account, as the input parameter of predetermined model of cognition, calculates the similarity between the characteristic information of described first account and the characteristic information of described second account by described predetermined model of cognition; According to obtained similarity, judge module 2062 judges whether described first account and described second account are repetition account.
Preferably, the 3rd computing module 2061 comprises: the training submodule connected successively and calculating sub module.Calculated in the process of the similarity between the characteristic information of described first account and the characteristic information of described second account by described predetermined model of cognition, training submodule is trained described predetermined model of cognition by the training parameter of predetermined quantity, wherein, each described training parameter comprises: as input parameter two each features of account between similarity, and, as output parameter described two accounts pre-set between similarity; Then, similarity in the characteristic information of each feature in the characteristic information of described first account and described second account between characteristic of correspondence as input parameter, is obtained the similarity between the characteristic information of described first account and the characteristic information of described second account by the described predetermined model of cognition after training by calculating sub module.The application, by training model of cognition, saves the cycle index of calculating, thus improves the arithmetic speed of system when carrying out the identification of repetition account, saves computing time.In this preferred embodiment, for concrete training process, can the computation process about table 1-table 4 in reference example 3.
In addition, judge module 2062 comprises: judge submodule, for judging whether the similarity between the characteristic information of described first account and the characteristic information of described second account is greater than predetermined threshold, and the similarity between the characteristic information and the characteristic information of described second account of described first account is when being greater than described predetermined threshold, judge that described first account and described second account are for repeat account.In the preferred embodiment of the application, by the mode of threshold decision, can effectively judge repetition account.Certainly, the judgment mode in the application is not limited only to this.
Preferably, acquiring unit 202 comprise following one of at least: the first acquisition module 2021, for obtaining the essential information of described first account and described second account; Word and part-of-speech tagging are cut to the described essential information of described first account, and according to the part of speech of mark, each keyword that word obtains is cut to the described essential information by described first account and carry out weight allocation, to obtain the essential information feature of described first account; Word and part-of-speech tagging are cut to the described essential information of described second account, and according to the part of speech of mark, each keyword that word obtains is cut to the described essential information by described second account and carry out weight allocation, to obtain the essential information feature of described second account; Second acquisition module 2022, for obtaining the product information of described first account and described second account; Word and part-of-speech tagging are cut to the product information of described first account, according to the part of speech of mark, each keyword that word obtains is cut to the described product information by described first account and carry out number percent statistics, and using the product information feature of described statistics as described first account institute release product; Word and part-of-speech tagging are cut to the product information of described second account, according to the part of speech of mark, each keyword that word obtains is cut to the described product information by described second account and carry out number percent statistics, and using the product information feature of described statistics as described second account institute release product; Or the 3rd acquisition module 2023, for obtaining the identification information Cookie ID used when described first account and described second account log in described website, using the Cookie ID of described first account that the gets behavioural information feature as described first account, using the Cookie ID of described second account that the gets behavioural information feature as described second account.In the preferred embodiment of the application, by above-mentioned steps, useful characteristic information can be obtained, make the judgement of similarity more accurate.
Preferably, above-mentioned repetition account automatic recognition system also comprises: communication unit 208, for judging that the first account and the second account are for after repeating account, send indication information to user, wherein, indication information is used to indicate the first account and the second account for repeat account.In the preferred embodiment of the application, by above-mentioned advice method, user can be managed neatly to account, improve the Experience Degree of user.
Embodiment 2
Based on the repetition account automatic recognition system shown in Fig. 1 and Fig. 2, present invention also provides a kind of repetition account automatic identifying method, as shown in Figure 3, the repetition account automatic identifying method in the present embodiment comprises:
S302, the characteristic information of the first account that the server obtaining website is preserved and the second account; Preferably, can be, but not limited to by the processing unit 102 in Fig. 1 or the acquiring unit 202 in Fig. 2 to perform the step of S302;
S304, the similarity between each characteristic parameter calculating characteristic of correspondence in each characteristic parameter of the feature in the characteristic information of described first account and the characteristic information of described second account; Preferably, can be, but not limited to by the processing unit 102 in Fig. 1 or the computing unit 204 in Fig. 2 to perform the step of S304;
S306, carries out to the similarity between each characteristic parameter described the similarity that matching obtains between each feature of described first account each feature corresponding with described second account according to pre-assigned weight parameter; Preferably, can be, but not limited to by the processing unit 102 in Fig. 1 or the computing unit 204 in Fig. 2 to perform the step of S306;
According to the similarity between each feature that each feature of described first account is corresponding with described second account, S308, judges whether described first account and described second account are repetition account; Preferably, can be, but not limited to by the processing unit 102 in Fig. 1 or the judging unit 206 in Fig. 2 to perform the step of S306.
In the preferred embodiment of the application, judge whether two accounts are repetition by the similarity of the multiple features between matching two accounts, can effectively avoid owing to judging the inaccurate and problem duplicate message of mistake being supplied to user that is that cause, thus reach the object accurately identifying and repeat account, further increase the Experience Degree of user when use web search business, ecommerce etc.
Preferably, above-mentioned characteristic information comprise in following characteristics one of at least: the essential information feature of account, the product information feature of account institute release product or the behavioural information feature of account.Characteristic information in the application comprises multiple feature, such as, the essential information feature of account, the product information feature of account institute release product and the behavioural information feature of account, utilize above-mentioned characteristic information can carry out Similarity Measure from multidimensional angle, avoid the unicity of the dimension adopted when repetition account calculates, improve the accuracy of repetition account identification.
Preferably, the first acquisition module 2041, second acquisition module 2042 in Fig. 2, selection module 2043, first computing module 2044 adopt the method for cosine angle to calculate the similarity between characteristic parameter, namely, the similarity between the second feature parameter being calculated characteristic of correspondence in the fisrt feature parameter of the feature in the characteristic information of described first account and the characteristic information of described second account by following steps:
S1, obtains by the first group of keyword A described fisrt feature parameter being cut to word and obtain 1, A 2... A mand obtain by carrying out part-of-speech tagging to described first group of keyword and carrying out to each keyword in described first group of keyword first group of weights W that weight allocation obtains according to part of speech a1, W a2... W aM;
S2, obtains and obtains second group of keyword B by cutting word to described second feature parameter 1, B 2... B nand obtain by carrying out part-of-speech tagging to described second group of keyword and carrying out to each keyword in described second group of keyword second group of weights W that weight allocation obtains according to part of speech b1, W b2... W bN;
S3, selects keyword C identical between described first group of keyword and described second group of keyword 1... C h, the weights W of H>=1 and correspondence c1... W cH;
S4, by fisrt feature parameter described in following formulae discovery and the similarity df between described second feature parameter:
df = d 1 ( da × db )
Wherein, d1=W c1× W c1+ ... W cH× W cH;
da=W A1×W A1+…W AM×W AM
db=W B1×W B1+…W BN×W BN
The method of above-mentioned cosine angle can utilize different weights to calculate the similarity between characteristic parameter, instead of solely carries out Similarity Measure, thus obtains the similarity between two characteristic parameters exactly.Certainly, the method for the cosine angle in the application is a kind of example, and the application is not limited only to this, can also be carried out the calculating of similarity by other similar methods.
Preferably, the second computing module 2045 can adopt the mode of linear fit to carry out matching by following steps to the similarity between each characteristic parameter of each characteristic parameter of the fisrt feature of described first account second feature corresponding with described second account:
d=c1×W c1+c2×W c2…+cq×W cq,q≥1
Wherein, d is the similarity between the fisrt feature of the described first account second feature corresponding with described second account;
C1, c2 ... cq is the similarity between each characteristic parameter of described fisrt feature and each characteristic parameter of described second feature;
W c1, W c2w cqfor pre-assigned weight.
Certainly, above-mentioned linear fit is a kind of mode, and the application is not limited only to this.
For example, the essential information feature (fisrt feature) of the first account comprises parameter: CompanyAddress (A1), company introduction (A2) and firm telephone (A3), the essential information feature (second feature) of the second account comprises parameter: CompanyAddress (B1), company introduction (B2) and firm telephone (B3).In the process of the similarity of the essential information feature of calculating first account and the essential information feature of the second account, first the first computing module 2041 calculates the similarity C3 between similarity C2, A3 and B3 between similarity C1, A2 and B2 between A1 and B1; Then fitting module 2042 is by carrying out the similarity that matching obtains the essential information feature of the first account and the essential information feature of the second account to C1, C2 and C3.In concrete realization, can adopt the computing method of cosine angle to calculate the similarity between the parameters in the essential information feature of parameters in the essential information feature of the first account and the second account, its detailed process can the computation process about table 1-table 4 in reference example 3.In addition, about above-mentioned concrete fit procedure, also can the computation process about table 1-table 4 in reference example 3.
In above-mentioned preferred embodiment, owing to carrying out the Fitting Calculation to obtain the similarity between a pair characteristic information feature for the similarity of each parameter, therefore, ensure that the accuracy of the Similarity Measure between a pair characteristic information feature.
Preferably, judge that whether described first account and described second account be that the step of repetition account comprises according to the similarity between each feature that each feature of described first account is corresponding with described second account: the similarity between each feature corresponding with described second account using each feature of described first account, as the input parameter of predetermined model of cognition, calculates the similarity between the characteristic information of described first account and the characteristic information of described second account by described predetermined model of cognition; Judge whether described first account and described second account are repetition account according to obtained similarity.
Preferably, the step of the similarity calculated between the characteristic information of described first account and the characteristic information of described second account by described predetermined model of cognition is comprised: trained described predetermined model of cognition by the training parameter of predetermined quantity, wherein, each described training parameter comprises: as input parameter two each features of account between similarity, and, as output parameter described two accounts pre-set between similarity; Using the similarity in the characteristic information of each feature in the characteristic information of described first account and described second account between characteristic of correspondence as input parameter, obtain the similarity between the characteristic information of described first account and the characteristic information of described second account by the described predetermined model of cognition after training.The application, by training model of cognition, saves the cycle index of calculating, thus improves the arithmetic speed of system when carrying out the identification of repetition account, saves computing time.In this preferred embodiment, for concrete training process, can the computation process about table 1-table 4 in reference example 3.
Preferably, judge that whether described first account and described second account be that the step of repetition account comprises according to obtained similarity: judge whether the similarity between the characteristic information of described first account and the characteristic information of described second account is greater than predetermined threshold; If the similarity between the characteristic information of the characteristic information of described first account and described second account is greater than described predetermined threshold, then judge that described first account and described second account are for repeat account.
Preferably, can by but be not limited to the essential information feature that the first acquisition module 2021 obtains the first account and the second account by the following method: the essential information obtaining the first account and the second account; Word and part-of-speech tagging are cut to the described essential information of described first account, and according to the part of speech of mark, each keyword that word obtains is cut to the described essential information by described first account and carry out weight allocation, to obtain the essential information feature of described first account; Word and part-of-speech tagging are cut to the described essential information of described second account, and according to the part of speech of mark, each keyword that word obtains is cut to the described essential information by described second account and carry out weight allocation, to obtain the essential information feature of described second account.
Preferably, can be, but not limited to the product information feature being obtained the first account and the second account institute release product by the second acquisition module 2022 in the processing unit 102 in Fig. 1 or Fig. 2 by the following method: the product information obtaining the first account and the second account; Word and part-of-speech tagging are cut to the product information of described first account, according to the part of speech of mark, each keyword that word obtains is cut to the described product information by described first account and carry out number percent statistics, and using the product information feature of described statistics as described first account institute release product; Word and part-of-speech tagging are cut to the product information of described second account, according to the part of speech of mark, each keyword that word obtains is cut to the described product information by described second account and carry out number percent statistics, and using the product information feature of described statistics as described second account institute release product.In the preferred embodiment of the application, by above-mentioned steps, useful characteristic information can be obtained, make the judgement of similarity more accurate.
Preferably, can be, but not limited to the behavioural information feature being obtained the first account and the second account by the 3rd acquisition module 2023 in the processing unit 102 in Fig. 1 or Fig. 2 by the following method: the identification information (Cookie ID) used when obtaining the first account and the second account Website login, using the Cookie ID of the first account that the gets behavioural information feature as the first account, using the Cookie ID of the second account that the gets behavioural information feature as the second account.In the preferred embodiment of the application, by above-mentioned steps, useful characteristic information can be obtained, make the judgement of similarity more accurate.
Preferably, judging that the first account and the second account are for after repeating account, above-mentioned repetition account automatic identifying method also comprises: can be, but not limited to send indication information by the communication unit 106 in Fig. 1 or the communication unit 208 in Fig. 2 to user, wherein, indication information is used to indicate the first account and the second account for repeat account.In the preferred embodiment of the application, by above-mentioned advice method, user can be managed neatly to account, improve the Experience Degree of user.
Embodiment 3
Based on the repetition account automatic recognition system shown in Fig. 1 and Fig. 2, present invention also provides another kind of repetition account automatic identifying method, as shown in Figure 4, the repetition account automatic identifying method in the present embodiment comprises:
S402-S406, obtain account essential information, user's historical behavior information, product information etc. (can claim this stage be information collection and processing stage).Preferably, can be, but not limited to by the processing unit 102 in Fig. 1 or the acquiring unit 202 in Fig. 2 to perform the step of S402-S406
Preferably, the essential information of account includes but not limited to: the essential informations such as Business Name, brief introduction, contact method, geographic position.
Preferably, by extract account the offer information sent out obtain product information corresponding to this account.
Preferably, by obtaining account and logging in website the Cookie ID used time obtains user's historical behavior information of this account.
S408-S414, the essential information feature of this account is extracted from account essential information, from user's historical behavior information, extract the behavioural information feature of this account, from product information, extract product information feature that this account issues (this stage can be claimed to be the characterisation stage of information).Preferably, can be, but not limited to by the processing unit 102 in Fig. 1 or the computing unit 204 in Fig. 2 to perform S408-S414.
Preferably, after collecting above-mentioned essential information, then by text handling method, carry out cutting word and part-of-speech tagging, the essential information feature needed for formation.
Preferably, word and part-of-speech tagging are cut to described product information, and the information after part-of-speech tagging is added up, obtain product information feature.
Preferably, using the Cookie ID of account that the gets behavioural information feature as this account.Like this, by analyzing the historical behavior of user, analyzing the contact between account, thus obtaining the behavioural information feature of this account.
Whether S416, be automatically identified as by the way of machine learning and repeat, and according to the result of machine learning, the account of all repetitions can be identified.Preferably, can be, but not limited to by the processing unit 102 in Fig. 1 or the computing unit 204 in Fig. 2 and judging unit 206 to perform S416.
Preferably, in conjunction with tripartite's region feature that characterization obtains, describing account from multiple dimension, is exactly calculate the similarity between character pair below.Concrete grammar is as follows respectively:
1) calculate the similarity between essential information feature by the way of cosine angle, then by these similar value of method matching of machine learning, obtain the similarity between final essential information feature.
Particularly, after carrying out characterization to essential information, can obtain one group of essential information characteristic sequence, it comprises: the weight that the id of feature and this id is corresponding, and wherein, the part of speech of the frequency that weight occurs according to id and id calculates.Then, for characteristic sequence, utilize the algorithm of cosine angle, a similarity of final each essential information feature can be calculated.The similarity of each essential information feature of matching, just can obtain the similarity between final essential information feature.The embodiment that concrete operations can describe with reference to rear continued 1-4.
2) add up two account like products account for this account send out the accounting of product, calculate the similarity of like products portioned product distribution, the product of product slates similarity and product accounting, obtains the similarity between product information feature.
Preferably, the algorithm that the similarity between product information feature also can make use of cosine angle calculates.Particularly, first obtain the id of often kind of product, to the quantity accounting of product representing the weight of this id, wherein, quantity accounting is obtained by the way of statistics.Use the information comprising product id and id weight to form product information characteristic sequence, then utilize the algorithm of cosine angle to calculate similarity.The embodiment that concrete operations can describe with reference to rear continued 1-4.
3) utilize the information such as historical behavior information and contact method, can obtain whether associating between multiple account, obtain the similarity between the behavioural information feature between multiple account.
The application, after above-mentioned three similarities of acquisition, adopts SVM (Support Vector Machines, support vector machine) model of cognition to carry out feature fitting, obtains the similarity between two accounts.For example, first extract the account of a part, mark between two, to this part account extraction tripartite's region feature as above, and receive the markup information of user's input, learn out the SVM model of cognition of repetition account.When classifying, three features of input two accounts, SVM model of cognition can provide a similar value, represents the repetition degree of these two accounts, is identified as repetition higher than certain threshold values.By the first vector clusters method of class, can do lower classification to all accounts, obtain final result, this result can be used for each bar product line to use.Certainly, the application is not limited only to carry out feature identification with SVM model of cognition, can also realize the application with other model of cognition.
The application's preferred embodiment, by identifying the repetition account of same company or individual's registration, facilitates user and platform to manage multiple account.After identifying repetition account, website platform can notify user, clearly tells the repetition account of user, and reminding user goes amendment and management, accepts the feedback of user simultaneously.Further, if feedback instruction merges above-mentioned repetition account, but the instruction merged incorrect, website platform can be revised this merging instruction by preset program, to perform combine command indicated by user better.
The repetition account automatic identifying method described based on each embodiment above-mentioned and system, the following describes concrete repetition account and automatically identify example.
Suppose there are 4 companies, specifying information is respectively as shown in following table 1-4:
Table 1
Table 2
Table 3
Table 4
For above-mentioned 4 accounts, obtained essential information feature, the behavioural information characteristic sum product information feature of 4 accounts by said method, then, according to the feature of above-mentioned three aspects, calculate the similarity between two between account by SVM model of cognition.In above process, the markup information of user's input can be received, such as, the account A of user's input, the similarity relation of B, C, D, specific as follows, A B 1; A C 1; A D 0; B D 0; C D 1 (wherein, 0 represents non-duplicate, and 1 represents repetition).Before SVM training, first extract the characteristic information of A, B, C, D tetra-accounts respectively.
Below for account A, the process of essential information characterization is described.
1) for the essential information feature of account, first, word and part-of-speech tagging are cut to the essential information of each account, and gives weight.For Business Name, the result after the Business Name " Hangzhou Jia Hua Science and Technology Ltd. " of account A cuts word is: Hangzhou, good China, science and technology, limited, company; Part-of-speech tagging is Hangzhou (zoning), good China (core institution name), science and technology (industry), limited (generic word), company (common).Then, according to factors such as parts of speech, give each word weight (this weight information can be pre-entered by user and obtain), suppose that result is: Hangzhou=1.95, good China=3.1, science and technology=0.8, limited=0.4, company=0.2.In like manner can other dimensions of characterization essential information, such as, company introduction, contact method etc.In addition, for the product information feature of this account institute release product, by text techniques as above, the product that can extract A is: mobile phone, MP3, digital camera etc., and the accounting come out is respectively: 40%, 35%, 25%.By above-mentioned statistics, obtain product information and be characterized as: mobile phone=0.4, MP3=0.35, digital camera=0.25.In addition, the behavioural information feature of this account comprises: the userid of this account, conventional cookieid etc.
2) after characterization, the similarity of the character pair between two accounts is calculated.Following account A and account B (similarity relation is AB 1) is example, describes the similarity utilizing the algorithm of cosine angle to calculate the Business Name between account A and B.Particularly, the Business Name of the A obtained after characterization is characterized as: Hangzhou=1.95, good China=3.1, science and technology=0.8, limited=0.4, company=0.2; The Business Name of B is characterized as: Hangzhou=1.95, good China=3.1, science and technology=0.8, limited=0.4, company=0.2, sales department=0.6.
Here, the computing method of cosine angle are described for Business Name.As from the foregoing, the feature that account A is identical with in the Business Name of B is: Hangzhou=1.95, good China=3.1, science and technology=0.8, limited=0.4, company=0.2.Then calculate the score of same characteristic features in the Business Name of account A and B, its formula adopted be same characteristic features respective weights product with, namely dl=1.95*1.95+3.1*3.1+0.8*0.8+0.4*0.4+0.2*0.2; Then, calculate the score of A, B feature respectively, the formula adopted is the characteristic weight sum of products, da=1.95*1.95+3.1*3.1+0.8*0.8+0.4*0.4+0.2*0.2, db=1.95*1.95+3.1*3.1+0.8*0.8+0.4*0.4+0.2*0.2+0.6*0.6.Final score is df=dl/ (sqrt (da) * sqrt (db)), and wherein, sqrt (da) refers to the evolution of da.
By the algorithm of cosine angle, the similarity that can obtain the Business Name between above-mentioned A and B is 0.96.In like manner, can be calculated the similarity between other essential information features between A and B by identical method, wherein, other essential information features comprise: company introduction, contact method etc.Finally, similarity between each essential information feature coming between matching account A and B by weight parameter obtains the similarity between the essential information feature of final account A and B, in the present embodiment, the method of matching can adopt linear fit method, specifically, suppose that the weight of Business Name c1 is 0.55, the weight of company introduction c2 is 0.35, the weight of contact method c3 is 0.1, the similarity d calculating essential information feature is: d=c1*0.55+c2*0.35+c3*0.1, such as, be 0.948.Further, if contact method is identical, then the repetition possibility of two accounts is comparatively large, can be further processed above-mentioned similarity d, and such as, the similarity d of final essential information feature must be divided into: d=d*0.73+0.27.
In like manner, above-mentioned cosine angle calcu-lation method and above-mentioned fit procedure can be utilized to calculate the similarity of account A and other character pairs of account B, comprising: the similarity between product information feature and the similarity between behavioural information feature.Finally, can obtain the similarity of three features, such as, the similarity of three features of account A and account B is respectively 0.948,0.87,0.95.
After similarity between the feature having calculated all marks, training SVM model.Such as, the learning content of similarity relation AB 1 correspondence is (0.948,0.87,0.95,1), namely, (0.948,0.87,0.95) is input parameter during training SVM model, 1 is the desired output valve obtained during training SVM model, adjusted the parameter of SVM model inside by above-mentioned input parameter and output valve, arrive the object of training.In like manner, SVM model can be trained further according to the learning content of similarity relation A C 1, A D 0, B D 0 and C D 1.Train the parameter that adopts more, it is more accurate that the parameter of SVM model inside can be adjusted.
After having trained SVM model, just two accounts are judged below, for example, suppose to need to judge whether B and C bis-accounts repeat, then first can extract three characteristic informations of B C according to the method described above, then calculating B C characteristic of correspondence similarity, is such as (0.927,0.865,0.94).By these three values to SVM model, can obtain a rreturn value, as being 0.97, judge whether this rreturn value is greater than the threshold values of setting, if be greater than, then account B and C is then judged to repetition account.
Just example above, in the project of reality, can employ that a large amount of account mark samples learns.
Certainly, by simply mating the information of member or, artificial mode, also can realize the identification to multiple account, but recognition efficiency is very low, and accuracy rate and recall rate are not high.
For the technological challenge faced at present, optimized allocation of resources and the needs improving search experience, the application develops the model automatically identifying and repeat account, by the automatic identification technology of high-accuracy high recall rate, identify multiple repetition accounts of same company or individual's registration, can by the application of results of identification to each bar product line.
Obviously, those skilled in the art should be understood that, each module of above-mentioned the application or each step can realize with general calculation element, they can concentrate on single calculation element, or be distributed on network that multiple calculation element forms, alternatively, they can realize with the executable program code of calculation element, thus they storages can be performed by calculation element in the storage device, or they are made into each integrated circuit modules respectively, or the multiple module in them or step are made into single integrated circuit module to realize.Like this, the application is not restricted to any specific hardware and software combination.
The foregoing is only the preferred embodiment of the application, be not limited to the application, for a person skilled in the art, the application can have various modifications and variations.Within all spirit in the application and principle, any amendment done, equivalent replacement, improvement etc., within the protection domain that all should be included in the application.

Claims (14)

1. repeat an account automatic identifying method, it is characterized in that, comprising:
The characteristic information of the first account that the server obtaining website is preserved and the second account, wherein, described characteristic information comprises the combination of following characteristics: the essential information feature of account, the product information feature of account institute release product and the behavioural information feature of account;
Similarity between each characteristic parameter calculating characteristic of correspondence in each characteristic parameter of the feature in the characteristic information of described first account and the characteristic information of described second account;
According to pre-assigned weight parameter, the similarity that matching obtains between each feature of described first account each feature corresponding with described second account is carried out to the similarity between each characteristic parameter described;
Judge whether described first account and described second account are repetition account according to the similarity between each feature that each feature of described first account is corresponding with described second account;
Wherein, judge that whether described first account and described second account be that the step of repetition account comprises according to the similarity between each feature that each feature of described first account is corresponding with described second account: the similarity between each feature corresponding with described second account using each feature of described first account, as the input parameter of predetermined model of cognition, calculates the similarity between the characteristic information of described first account and the characteristic information of described second account by described predetermined model of cognition; Judge whether described first account and described second account are repetition account according to obtained similarity;
Judging that described first account and described second account are for after repeating account, send indication information to user, wherein, described indication information is used to indicate the first account and the second account for repeat account.
2. method according to claim 1, it is characterized in that, the similarity between the second feature parameter being calculated characteristic of correspondence in the fisrt feature parameter of the feature in the characteristic information of described first account and the characteristic information of described second account by following steps:
Obtain by the first group of keyword A described fisrt feature parameter being cut to word and obtain 1, A 2... A mand obtain by carrying out part-of-speech tagging to described first group of keyword and carrying out to each keyword in described first group of keyword first group of weights W that weight allocation obtains according to part of speech a1, W a2... W aM;
Obtain and obtain second group of keyword B by cutting word to described second feature parameter 1, B 2... B nand obtain by carrying out part-of-speech tagging to described second group of keyword and carrying out to each keyword in described second group of keyword second group of weights W that weight allocation obtains according to part of speech b1, W b2... W bN;
Select keyword C identical between described first group of keyword and described second group of keyword 1... C h, the weights W of H>=1 and correspondence c1... W cH;
By fisrt feature parameter described in following formulae discovery and the similarity df between described second feature parameter
Wherein, d1=W c1× W c1+ ... W cH× W cH;
da=W A1×W A1+…W AM×W AM
db=W B1×W B1+…W BN×W BN
3. method according to claim 1, is characterized in that, carries out matching by following steps to the similarity between each characteristic parameter of each characteristic parameter of the fisrt feature of described first account second feature corresponding with described second account:
d=c1×W c1+c2×W c2…+cq×W cq,q≥1
Wherein, d is the similarity between the fisrt feature of the described first account second feature corresponding with described second account;
C1, c2 ... cq is the similarity between each characteristic parameter of described fisrt feature and each characteristic parameter of described second feature;
W c1, W c2w cqfor pre-assigned weight.
4. method according to claim 1, is characterized in that, the step of the similarity calculated between the characteristic information of described first account and the characteristic information of described second account by described predetermined model of cognition is comprised:
By the training parameter of predetermined quantity, described predetermined model of cognition is trained, wherein, each described training parameter comprises: as input parameter two each features of account between similarity, and, as output parameter described two accounts pre-set between similarity;
Using the similarity in the characteristic information of each feature in the characteristic information of described first account and described second account between characteristic of correspondence as input parameter, obtain the similarity between the characteristic information of described first account and the characteristic information of described second account by the described predetermined model of cognition after training.
5. method according to claim 1, is characterized in that, judges that whether described first account and described second account be that the step of repetition account comprises according to obtained similarity:
Judge whether the similarity between the characteristic information of described first account and the characteristic information of described second account is greater than predetermined threshold;
If the similarity between the characteristic information of the characteristic information of described first account and described second account is greater than described predetermined threshold, then judge that described first account and described second account are for repeat account.
6. method according to any one of claim 1 to 5, is characterized in that, obtains the essential information feature of described first account and described second account by the following method:
Obtain the essential information of described first account and described second account;
Word and part-of-speech tagging are cut to the described essential information of described first account, and according to the part of speech of mark, each keyword that word obtains is cut to the described essential information by described first account and carry out weight allocation, to obtain the essential information feature of described first account;
Word and part-of-speech tagging are cut to the described essential information of described second account, and according to the part of speech of mark, each keyword that word obtains is cut to the described essential information by described second account and carry out weight allocation, to obtain the essential information feature of described second account.
7. method according to any one of claim 1 to 5, is characterized in that, obtains the product information feature of described first account and described second account institute release product by the following method:
Obtain the product information of described first account and described second account;
Word and part-of-speech tagging are cut to the product information of described first account, according to the part of speech of mark, each keyword that word obtains is cut to the described product information by described first account and carry out number percent statistics, and using the product information feature of described statistics as described first account institute release product;
Word and part-of-speech tagging are cut to the product information of described second account, according to the part of speech of mark, each keyword that word obtains is cut to the described product information by described second account and carry out number percent statistics, and using the product information feature of described statistics as described second account institute release product.
8. method according to any one of claim 1 to 5, is characterized in that, obtains the behavioural information feature of described first account and described second account by the following method:
Obtain the identification information Cookie ID used when described first account and described second account log in described website;
Using the Cookie ID of described first account that the gets behavioural information feature as described first account, using the Cookie ID of described second account that the gets behavioural information feature as described second account.
9. repeat an account automatic recognition system, it is characterized in that, comprising:
Acquiring unit, the characteristic information of the first account that the server for obtaining website is preserved and the second account, wherein, described characteristic information comprises the combination of following characteristics: the essential information feature of account, the product information feature of account institute release product and the behavioural information feature of account;
Computing unit, for characteristic of correspondence in the characteristic information of each characteristic parameter and described second account of calculating the feature in the characteristic information of described first account each characteristic parameter between similarity, and according to pre-assigned weight parameter, the similarity that matching obtains between each feature of described first account each feature corresponding with described second account is carried out to the similarity between each characteristic parameter described;
For the similarity between each feature that each feature according to described first account is corresponding with described second account, judging unit, judges whether described first account and described second account are repetition account;
Wherein, described judging unit comprises: the 3rd computing module, for the input parameter of the similarity between each feature that each feature of described first account is corresponding with described second account as predetermined model of cognition, calculate the similarity between the characteristic information of described first account and the characteristic information of described second account by described predetermined model of cognition; Judge module, for judging according to obtained similarity whether described first account and described second account are repetition account;
Communication unit, for judging that the first account and the second account are for after repeating account, send indication information to user, wherein, indication information is used to indicate the first account and the second account for repeat account.
10. system according to claim 9, is characterized in that, described computing unit comprises:
First acquisition module, for obtaining by first group of keyword A fisrt feature parameter being cut to word and obtain 1, A 2... A mand obtain by carrying out part-of-speech tagging to described first group of keyword and carrying out to each keyword in described first group of keyword first group of weights W that weight allocation obtains according to part of speech a1, W a2... W aM, wherein, described fisrt feature parameter is a characteristic parameter of the feature in the characteristic information of described first account;
Second acquisition module, obtains second group of keyword B for obtaining by cutting word to second feature parameter 1, B 2... B nand obtain by carrying out part-of-speech tagging to described second group of keyword and carrying out to each keyword in described second group of keyword second group of weights W that weight allocation obtains according to part of speech b1, W b2... W bN, wherein, described second feature parameter is a characteristic parameter of the feature in the characteristic information of described second account;
Select module, for selecting keyword C identical between described first group of keyword and described second group of keyword 1... C h, the weights W of H>=1 and correspondence c1... W cH;
First computing module, for by fisrt feature parameter described in following formulae discovery and the similarity df between described second feature parameter
Wherein, d1=W c1× W c1+ ... W cH× W cH;
da=W A1×W A1+…W AM×W AM
db=W B1×W B1+…W BN×W BN
11. systems according to claim 9, it is characterized in that, described computing unit also comprises: the second computing module, for carrying out matching by following steps to the similarity between each characteristic parameter of each characteristic parameter of the fisrt feature of described first account second feature corresponding with described second account:
d=c1×W c1+c2×W c2…+cq×W cq,q≥1
Wherein, d is the similarity between the fisrt feature of the described first account second feature corresponding with described second account;
C1, c2 ... cq is the similarity between each characteristic parameter of described fisrt feature and each characteristic parameter of described second feature;
W c1, W c2w cqfor pre-assigned weight.
12. systems according to claim 9, is characterized in that, described 3rd computing module comprises:
Training submodule, for the training parameter by predetermined quantity, described predetermined model of cognition is trained, wherein, each described training parameter comprises: as input parameter two each features of account between similarity, and, as output parameter described two accounts pre-set between similarity;
Calculating sub module, for using the similarity in the characteristic information of each feature in the characteristic information of described first account and described second account between characteristic of correspondence as input parameter, the similarity obtaining between the characteristic information of described first account and the characteristic information of described second account by the described predetermined model of cognition after training.
13. systems according to claim 9, is characterized in that, described judge module comprises:
Judge submodule, for judging whether the similarity between the characteristic information of described first account and the characteristic information of described second account is greater than predetermined threshold, and the similarity between the characteristic information and the characteristic information of described second account of described first account is when being greater than described predetermined threshold, judge that described first account and described second account are for repeat account.
14. the system according to any one of claim 9 to 13, is characterized in that, described acquiring unit comprise following one of at least:
First acquisition module, for obtaining the essential information of described first account and described second account; Word and part-of-speech tagging are cut to the described essential information of described first account, and according to the part of speech of mark, each keyword that word obtains is cut to the described essential information by described first account and carry out weight allocation, to obtain the essential information feature of described first account; Word and part-of-speech tagging are cut to the described essential information of described second account, and according to the part of speech of mark, each keyword that word obtains is cut to the described essential information by described second account and carry out weight allocation, to obtain the essential information feature of described second account;
Second acquisition module, for obtaining the product information of described first account and described second account; Word and part-of-speech tagging are cut to the product information of described first account, according to the part of speech of mark, each keyword that word obtains is cut to the described product information by described first account and carry out number percent statistics, and using the product information feature of described statistics as described first account institute release product; Word and part-of-speech tagging are cut to the product information of described second account, according to the part of speech of mark, each keyword that word obtains is cut to the described product information by described second account and carry out number percent statistics, and using the product information feature of described statistics as described second account institute release product; Or
3rd acquisition module, for obtaining the identification information Cookie ID used when described first account and described second account log in described website, using the Cookie ID of described first account that the gets behavioural information feature as described first account, using the Cookie ID of described second account that the gets behavioural information feature as described second account.
CN201110113252.1A 2011-05-03 2011-05-03 Method and system for identifying repeated account Active CN102768659B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201110113252.1A CN102768659B (en) 2011-05-03 2011-05-03 Method and system for identifying repeated account
HK12113367.4A HK1172706A1 (en) 2011-05-03 2012-12-25 Method and system for automatically identifying repeated account

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110113252.1A CN102768659B (en) 2011-05-03 2011-05-03 Method and system for identifying repeated account

Publications (2)

Publication Number Publication Date
CN102768659A CN102768659A (en) 2012-11-07
CN102768659B true CN102768659B (en) 2015-06-24

Family

ID=47096063

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110113252.1A Active CN102768659B (en) 2011-05-03 2011-05-03 Method and system for identifying repeated account

Country Status (2)

Country Link
CN (1) CN102768659B (en)
HK (1) HK1172706A1 (en)

Families Citing this family (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104348871B (en) * 2013-08-05 2019-01-11 深圳市腾讯计算机系统有限公司 A kind of similar account extended method and device
CN105095306B (en) * 2014-05-20 2019-04-09 阿里巴巴集团控股有限公司 The method and device operated based on affiliated partner
CN104077366B (en) * 2014-06-13 2018-03-23 北京百度网讯科技有限公司 A kind of method and apparatus for being used to determine characteristic information in the network device
CN105335390A (en) * 2014-07-09 2016-02-17 阿里巴巴集团控股有限公司 Object classification method, business pushing method and server
CN104239490B (en) * 2014-09-05 2017-05-10 电子科技大学 Multi-account detection method and device for UGC (user generated content) website platform
CN104537118B (en) * 2015-01-26 2017-12-26 苏州大学 A kind of microblog data processing method, apparatus and system
CN104573076B (en) * 2015-01-27 2017-11-03 南京烽火星空通信发展有限公司 A kind of Chinese remark names system recommendation method of social network sites user
CN105991621B (en) * 2015-03-04 2019-12-13 深圳市腾讯计算机系统有限公司 Security detection method and server
CN106034149B (en) * 2015-03-13 2019-06-18 阿里巴巴集团控股有限公司 A kind of account recognition methods and device
CN106156149B (en) * 2015-04-14 2020-01-03 阿里巴巴集团控股有限公司 Data transfer method and device
CN106294429A (en) * 2015-05-26 2017-01-04 阿里巴巴集团控股有限公司 Repeat data identification method and device
CN104899189B (en) * 2015-05-27 2017-11-28 深圳市华傲数据技术有限公司 Object oriented matching process based on comentropy
CN106372977B (en) * 2015-07-23 2019-06-07 阿里巴巴集团控股有限公司 A kind of processing method and equipment of virtual account
CN105207996B (en) * 2015-08-18 2018-11-23 小米科技有限责任公司 Account merging method and device
CN105491444B (en) * 2015-11-25 2018-11-06 珠海多玩信息技术有限公司 A kind of data identifying processing method and device
CN105516282B (en) * 2015-12-01 2019-06-11 深圳市元征科技股份有限公司 A kind of method and wearable device of data synchronization processing
CN105897726A (en) * 2016-05-09 2016-08-24 深圳市永兴元科技有限公司 Associated account data sharing method and device
CN106126654B (en) * 2016-06-27 2019-10-18 中国科学院信息工程研究所 A kind of inter-network station user-association method based on user name similarity
CN107066616B (en) * 2017-05-09 2020-12-22 京东数字科技控股有限公司 Account processing method and device and electronic equipment
CN107451879B (en) * 2017-06-12 2018-11-02 北京小度信息科技有限公司 Information judgment method and device
CN107404408B (en) * 2017-08-30 2020-05-22 北京邮电大学 Virtual identity association identification method and device
CN107730364A (en) * 2017-10-31 2018-02-23 北京麒麟合盛网络技术有限公司 user identification method and device
CN111046894A (en) * 2018-10-15 2020-04-21 北京京东尚科信息技术有限公司 Method and device for identifying vest account
CN111104795A (en) * 2019-11-19 2020-05-05 平安金融管理学院(中国·深圳) Company name matching method and device, computer equipment and storage medium
CN111881304B (en) * 2020-07-21 2024-04-26 百度在线网络技术(北京)有限公司 Author identification method, device, equipment and storage medium
CN113779346A (en) * 2021-01-14 2021-12-10 北京沃东天骏信息技术有限公司 Method and device for identifying one person with multiple accounts
CN113536252B (en) * 2021-07-21 2022-08-09 贝壳找房(北京)科技有限公司 Account identification method and computer-readable storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101022373B1 (en) * 2004-01-29 2011-03-22 주식회사 케이티 Log-in system allowing duplicated user account and method for registering of user account and method for authentication of user
US7725421B1 (en) * 2006-07-26 2010-05-25 Google Inc. Duplicate account identification and scoring
CN101316262B (en) * 2007-05-31 2011-07-13 中兴通讯股份有限公司 Method for controlling repeated registration of the same account terminal
CN101727487A (en) * 2009-12-04 2010-06-09 中国人民解放军信息工程大学 Network criticism oriented viewpoint subject identifying method and system

Also Published As

Publication number Publication date
HK1172706A1 (en) 2013-04-26
CN102768659A (en) 2012-11-07

Similar Documents

Publication Publication Date Title
CN102768659B (en) Method and system for identifying repeated account
CN103336766B (en) Short text garbage identification and modeling method and device
US20160085871A1 (en) Searching for information based on generic attributes of the query
CN105975453A (en) Method and device for comment label extraction
CN105630827B (en) A kind of information processing method, system and auxiliary system
CN110046298A (en) Query word recommendation method and device, terminal device and computer readable medium
CN103473317A (en) Method and equipment for extracting keywords
CN105653562A (en) Calculation method and apparatus for correlation between text content and query request
CN110610193A (en) Method and device for processing labeled data
CN101957845B (en) On-line application system and implementation method thereof
CN103761254A (en) Method for matching and recommending service themes in various fields
CN107807957A (en) entity library generating method and device
CN112257419A (en) Intelligent retrieval method and device for calculating patent document similarity based on word frequency and semantics, electronic equipment and storage medium thereof
CN111242318B (en) Service model training method and device based on heterogeneous feature library
TW201401088A (en) Search method and apparatus
CN110362601A (en) Mapping method, device, equipment and the storage medium of metadata standard
CN108241649A (en) The searching method and device of knowledge based collection of illustrative plates
CN111127068A (en) Automatic pricing method and device for engineering quantity list
CN105989001A (en) Image searching method and device, and image searching system
CN107679186A (en) The method and device of entity search is carried out based on entity storehouse
CN108228788A (en) Guide of action automatically extracts and associated method and electronic equipment
CN103744929A (en) Target user object determination method
CN112069833B (en) Log analysis method, log analysis device and electronic equipment
CN105095271A (en) Microblog retrieval method and microblog retrieval apparatus
CN111523798A (en) Automatic modeling method, device and system and electronic equipment thereof

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
REG Reference to a national code

Ref country code: HK

Ref legal event code: GR

Ref document number: 1172706

Country of ref document: HK