CN105491444B - A kind of data identifying processing method and device - Google Patents

A kind of data identifying processing method and device Download PDF

Info

Publication number
CN105491444B
CN105491444B CN201510835028.1A CN201510835028A CN105491444B CN 105491444 B CN105491444 B CN 105491444B CN 201510835028 A CN201510835028 A CN 201510835028A CN 105491444 B CN105491444 B CN 105491444B
Authority
CN
China
Prior art keywords
feature vector
target feature
user
information
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510835028.1A
Other languages
Chinese (zh)
Other versions
CN105491444A (en
Inventor
余建兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZHUHAI DUOWAN INFORMATION TECHNOLOGY Ltd
Original Assignee
ZHUHAI DUOWAN INFORMATION TECHNOLOGY Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZHUHAI DUOWAN INFORMATION TECHNOLOGY Ltd filed Critical ZHUHAI DUOWAN INFORMATION TECHNOLOGY Ltd
Priority to CN201510835028.1A priority Critical patent/CN105491444B/en
Publication of CN105491444A publication Critical patent/CN105491444A/en
Application granted granted Critical
Publication of CN105491444B publication Critical patent/CN105491444B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/435Processing of additional data, e.g. decrypting of additional data, reconstructing software from modules extracted from the transport stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/441Acquiring end-user identification, e.g. using personal code sent by the remote control or by inserting a card

Abstract

The embodiment of the invention discloses a kind of data identifying processing method and device, wherein method includes:According to the corresponding target feature vector of client described in the facility information for the client being collected into, user information and service feature information structuring;The user type mark carried respectively based on multiple feature vectors in flag data set, the disaggregated model for classifying to multiple feature vectors in the flag data set is created, and the corresponding user type of the target feature vector is identified according to the characteristic value in the disaggregated model and the target feature vector;User type mark corresponding with the user type of the target feature vector is set for the target feature vector, and the target feature vector for carrying the user type mark is added to the flag data set.Using the present invention, whether identification main broadcaster client that can be accurate, inexpensive is practised fraud using illegal means.

Description

A kind of data identifying processing method and device
Technical field
The present invention relates to Internet technical field more particularly to a kind of data identifying processing method and devices.
Background technology
In recent years, gather the functions such as online K songs, Online Video live streaming, game on line live streaming, online education live streaming to be integrated Comprehensive rich media client develop on an unprecedented scale so that spectators user can comfortable ground main broadcaster is watched by spectator client The content of client live streaming.But but there are some disabled users to help main broadcaster's client by using the mode of protocol number at present Illegal brush popularity operation, the operation of brush stage property etc. are realized, to obtain unlawful interests.Wherein, protocol number is a kind of using network Package form logs in the cheating program of client, which is chiefly used in live broadcast service of playing.
Currently, in order to find the spectator client for belonging to protocol number, typically by manually according to business experience to spectators visitor The correlated characteristic at family end is analyzed, to find whether spectator client is protocol number client, and to protocol number client into Row respective handling.Since the quantity of spectator client is huger, so by manually being analyzed one by one spectator client, Huge human cost will be brought, and is not that obviously spectator client, manual analysis get up ratio for feature It is more difficult, it is easy to cause erroneous judgement.
Invention content
A kind of data identifying processing method of offer of the embodiment of the present invention and device, can accurate, low cost identification main broadcaster Whether client is practised fraud using illegal means.
An embodiment of the present invention provides a kind of data identifying processing methods, including:
The facility information and user information of client are collected, and according to the facility information, the user information and industry The corresponding target feature vector of the characteristic information construction client of being engaged in;The target feature vector include the facility information, The user information and the corresponding characteristic value of the service feature information;
Based on the user type mark that multiple feature vectors in flag data set carry respectively, create for described The disaggregated model that multiple feature vectors in flag data set are classified, and it is special according to the disaggregated model and the target Characteristic value in sign vector identifies the corresponding user type of the target feature vector;
For the target feature vector, user type mark corresponding with the user type of the target feature vector is set, And the target feature vector for carrying the user type mark is added to the flag data set, in order to follow-up basis New flag data set updates the disaggregated model so that new target feature vector to be identified;The user type mark Including validated user mark and disabled user's mark.
Correspondingly, the embodiment of the present invention additionally provides a kind of data recognition process unit, including:
Constructing module, the facility information for collecting client and user information are collected, and according to the facility information, institute State the corresponding target feature vector of client described in user information and service feature information structuring;The target feature vector packet Include the facility information, the user information and the corresponding characteristic value of the service feature information;
Create identification module, the user type mark for being carried respectively based on multiple feature vectors in flag data set Know, creates the disaggregated model for classifying to multiple feature vectors in the flag data set, and according to described point Characteristic value in class model and the target feature vector identifies the corresponding user type of the target feature vector;
Add module is set, for the user type pair for target feature vector setting and the target feature vector The user type mark answered, and the target feature vector for carrying the user type mark is added to the flag data collection It closes, in order to subsequently update the disaggregated model according to new flag data set to know to new target feature vector Not;The user type mark includes validated user mark and disabled user's mark.
The embodiment of the present invention, can be with structure by the facility information and user information and service feature information of collection client The corresponding target feature vector of client is made, and is identified according to the characteristic value in the disaggregated model and target feature vector created Go out the corresponding user type of target feature vector, if user type is disabled user's type, it can be said that the bright client is association View client, so as to realize whether automatic identification spectator client is protocol number client, to reduce human cost;Into One step can also be that user type mark corresponding with the user type of target feature vector is arranged in target feature vector, and will take Target feature vector with user type mark is added to flag data set, in order to subsequently can be according to new reference numerals According to set update disaggregated model new target feature vector to be identified, it can be seen that, in flag data set The quantity of feature vector increases, and the disaggregated model created also can more and more precisely, to the unconspicuous target signature of feature Vector can also accurately identify, that is, improve the identification accuracy to protocol number.
Description of the drawings
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with Obtain other attached drawings according to these attached drawings.
Fig. 1 is a kind of flow diagram of data identifying processing method provided in an embodiment of the present invention;
Fig. 2 is the flow diagram of another data identifying processing method provided in an embodiment of the present invention;
Fig. 3 is a kind of structural schematic diagram of data recognition process unit provided in an embodiment of the present invention;
Fig. 4 is a kind of structural schematic diagram for collecting constructing module provided in an embodiment of the present invention;
Fig. 5 is a kind of structural schematic diagram creating identification module provided in an embodiment of the present invention;
Fig. 6 is the structural schematic diagram of another data recognition process unit provided in an embodiment of the present invention;
Fig. 7 is the structural schematic diagram of another data recognition process unit provided in an embodiment of the present invention.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
Fig. 1 is referred to, is a kind of flow diagram of data identifying processing method provided in an embodiment of the present invention, the side Method may include:
S101, collects the facility information and user information of client, and according to the facility information, the user information with And the corresponding target feature vector of client described in service feature information structuring;
Specifically, a kind of facility information that can collect client applied to the data recognition process unit of background server And user information, wherein the user information may include subscriber identity information and user behavior information.Wherein, the equipment Information can refer to user device environment information, and the process feature, called parent process, transmission data packet for specifically including operation are adopted Agreement etc..The subscriber identity information can refer to user the client (such as spectator client) record, specifically Including user name, age, gender, registered place, registration IP (Internet Protocol, procotol), grade, the pet name, letter The information such as Jie, client login situation.The user behavior information can refer to the user of game live streaming platform record in each frequency Behavior in road specifically includes log-on message, viewing information, consumption information (such as sending flower, send stage property etc.) and the interaction of user Behavioural information (is such as left a message);Wherein, the log-on message of the user may include the stepping on of adding up of i days users before counting from day Number/number of days/duration is recorded, the period is logged in, logs in IP and the related frequency;The viewing information may include that viewing live streaming is accumulative Number/number of days/duration/period;The consumption information may include the consumption number of times/number of days/amount of money/period;The interaction row It may include the period etc. of message for information;Wherein, the period refers to the specific time that behavior occurs.
The data recognition process unit creates the corresponding target feature vector of the client again, and the equipment is believed Described in breath, the subscriber identity information, the user behavior information and the corresponding characteristic value of service feature information are used as The element of target feature vector.Wherein, the service feature information may include account name length whether be more than 15 characters, Whether word and data mix account name, whether account name is containing Chinese Name phonetic (being obtained in such as demographic database), account Whether name contains whether english name and English everyday words, account registration IP have whether the registration of other accounts, account login IP have Other accounts log in, whether account binds mobile phone and whether mailbox, account set privacy problem, account uses the pet name whether with User name is identical, whether account idiograph and brief introduction are empty, account grade and integral etc..
Wherein, the dimension of the target feature vector is the total quantity of characteristic value in the target feature vector.It considers The raw value range disunity of the primitive character value of each feature, for example log duration range may be between 1 to 3600, and step on Recording numbers range may be in 1 to 100 time, therefore, and the characteristic value of quantity Value Types is belonged in the target feature vector is all It is obtained by normalized;Wherein, the formula of normalized can be:The characteristic value of certain feature after normalized =(the raw value stated range minimum of primitive character value-this feature of this feature)/(the raw value range of this feature is maximum The raw value stated range minimum of value-this feature), do numberical range corresponding to the characteristic value after normalized can [0, 1] between.In addition, the characteristic value for belonging to non-quantity Value Types in the target feature vector is by using preset specified number Value carries out assignment and obtains, i.e., the feature for non-quantity Value Types, can be to its classification assignment using as the target signature Vector an element value, for example feature " male/female " be assigned a value of 0,1 respectively.
S102, based on the user type mark that multiple feature vectors in flag data set carry respectively, establishment is used for To the disaggregated model that multiple feature vectors in the flag data set are classified, and according to the disaggregated model and described Characteristic value in target feature vector identifies the corresponding user type of the target feature vector;
Specifically, after the data recognition process unit obtains the target feature vector, it can be according to target spy The characteristic value of sign vector calculates the position in the hyperplane of vector space, Coordinate calculation method of the multi-C vector in hyperplane For the prior art, it is not discussed here.At this point, the data recognition process unit can be based on SVM (Support Vector Machine, support vector machines) use that carries respectively of multiple feature vectors in grader and flag data set Family type identification (the user type mark includes validated user mark and disabled user's mark), creates in the hyperplane Disaggregated model for classifying to multiple feature vectors in the flag data set;Wherein, the disaggregated model packet The validated user region in the hyperplane and disabled user region are included, the validated user region includes carrying validated user The feature vector of mark, the disabled user region include the feature vector for carrying disabled user's mark, the validated user area Domain and disabled user region can also include multiple feature vectors for not carrying the user type mark, the flag data Multiple feature vectors and the feature vector for not carrying the user type mark in set are in the hyperplane Position is all to be calculated in advance according to the characteristic value of each feature vector by the data recognition process unit, described not take Feature vector with user type mark includes at least the target feature vector.
After the data recognition process unit creates the disaggregated model, it can calculate to be distributed in the hyperplane and own The Euclidean between multiple feature vectors in the feature vector and the flag data set of the user type mark is not carried Distance, if for example, the feature vector for not carrying the user type mark has A, B two, the spy in the flag data set Sign vector has C, D, E tri-, then needs to calculate separately the Euclidean distance between A and C, A and D, A and E, B and C, B and D, B and E. Wherein, the formula of the Euclidean distance between two feature vectors of calculating is:D=sqrt (∑ (Xi1-Xi2) ^2), i=1,2..n; Xi1For the characteristic value of certain feature in one of feature vector, Xi2For the characteristic value of this feature in another feature vector.Work as institute State the corresponding Euclidean distance of target feature vector for most short Euclidean distance in all Euclidean distances for being calculated when, illustrate and institute It is all Europe for being calculated to state in the associated at least one Euclidean distance of target feature vector there are one of Euclidean distance Most short Euclidean distance in family name's distance, at this point it is possible to which the position according to the target feature vector in the hyperplane, determines Region of the target feature vector in the disaggregated model, to identify the corresponding user type of the target feature vector, I.e. if position of the target feature vector in the hyperplane belongs to the validated user region in the disaggregated model, It can identify that the corresponding user type of the target feature vector is validated user, that is, illustrate that the target feature vector corresponds to Client be not protocol number client;If position of the target feature vector in the hyperplane belongs to the classification Disabled user region in model can then identify that the corresponding user type of the target feature vector is disabled user, i.e., Illustrate that the corresponding client of the target feature vector is protocol number client.Further, when the target feature vector pair The Euclidean distance answered not for the most short Euclidean distance in all Euclidean distances for being calculated when, temporarily not to the target signature to Amount is identified, and current only to the feature vector progress for not carrying the user type mark with most short Euclidean distance Identification.For example, if the feature vector for not carrying the user type mark has A, B two (A is the target feature vector), institute Stating the feature vector in flag data set has C, D, E tri-, and calculates separately out A and C, A and D, A and E, B and C, B and D, B Euclidean distance between E, and detect that A and C is the most short Euclidean distance in all Euclidean distances, then it can pass through described point The user type of A is identified in class model.
Wherein, it selects the purpose of most short Euclidean distance and is current all not carrying the user type mark to select Feature vector in the most apparent feature vector of feature, i.e. Euclidean distance is shorter, illustrates that this does not carry user type mark Feature vector it is closer from the feature vector for carrying user type mark, then illustrate that this does not carry the user type mark Feature vector characteristic value closer to the feature vector for carrying user type mark characteristic value, i.e., this do not carry described The feature of the feature vector of user type mark is more apparent, by the way that the most apparent feature vector of feature is identified and can be ensured Current identification is most accurately.
User type corresponding with the user type of the target feature vector is arranged for the target feature vector in S103 Mark, and the target feature vector for carrying the user type mark is added to the flag data set, in order to rear It is continuous that the disaggregated model is updated so that new target feature vector to be identified according to new flag data set;
Specifically, after the data recognition process unit identifies the corresponding user type of the target feature vector, it can Think that user type mark corresponding with the user type of the target feature vector is arranged in the target feature vector, and will take Target feature vector with user type mark is added to the flag data set, in order to subsequently according to new mark Note data acquisition system updates the disaggregated model so that new target feature vector to be identified.Wherein, initial flag data collection A small amount of feature vector in conjunction can by handmarking its corresponding user type mark, with not carrying the user largely The feature vector of type identification is specifically identified, marks, and the feature vector in flag data set can be made more and more, because This, again will be more more accurate than original disaggregated model according to the new disaggregated model that new flag data set is established, So can accurately be identified to the new target feature vector based on the new disaggregated model, the new target Feature vector can be gone out selected in the remaining feature vector for not carrying the user type mark with most short Europe The feature vector of family name's distance.Since the feature vector identified every time is all the remaining feature for not carrying the user type mark The most apparent feature vector of feature in vector can get over feature so being based on data identifying processing method provided by the invention Unconspicuous feature vector is placed on to be identified more afterwards, and disaggregated model is also more accurate in the backward, to ensure that each Feature vector all plays the effect accurately identified, that is, realizing logical too small amount of handmarking can be in numerous spectator client It is middle all to find out all accord client.For example, the full dose user of game live streaming is more than 3,000,000, and initial label In data acquisition system from a small amount of feature vector of handmarking can only need to include 100 carry disabled users mark features to Amount and 100 feature vectors for carrying validated user mark, the data recognition process unit pass through the initial flag data Set can be identified and mark one by one to the client of whole users.
The embodiment of the present invention, can be with structure by the facility information and user information and service feature information of collection client The corresponding target feature vector of client is made, and is identified according to the characteristic value in the disaggregated model and target feature vector created Go out the corresponding user type of target feature vector, if user type is disabled user's type, it can be said that the bright client is association View client, so as to realize whether automatic identification spectator client is protocol number client, to reduce human cost;Into One step can also be that user type mark corresponding with the user type of target feature vector is arranged in target feature vector, and will take Target feature vector with user type mark is added to flag data set, in order to subsequently can be according to new reference numerals According to set update disaggregated model new target feature vector to be identified, it can be seen that, in flag data set The quantity of feature vector increases, and the disaggregated model created also can more and more precisely, to the unconspicuous target signature of feature Vector can also accurately identify, that is, improve the identification accuracy to protocol number.
Fig. 2 is referred to again, is the flow diagram of another data identifying processing method provided in an embodiment of the present invention, institute The method of stating may include:
S201, collects the facility information and user information of client, and according to the facility information, the user information with And the corresponding target feature vector of client described in service feature information structuring;
Specifically, a kind of facility information that can collect client applied to the data recognition process unit of background server And user information, wherein the user information may include subscriber identity information and user behavior information.Wherein, the equipment Information can refer to user device environment information, and the process feature, called parent process, transmission data packet for specifically including operation are adopted Agreement etc..The subscriber identity information can refer to user the client (such as spectator client) record, specifically Including information such as user name, age, gender, registered place, registration IP, grade, the pet name, brief introduction, client login situations.The use Family behavioural information can refer to behavior of the user of game live streaming platform record in each channel, specifically include the login letter of user Breath, viewing information, consumption information (such as sending flower, send stage property etc.) and mutual-action behavior information (such as leaving a message);Wherein, the user Log-on message may include i days users add up before counting from day login times/number of days/duration, log in the period, log in IP And the related frequency;The viewing information may include viewing live streaming accumulative number/number of days/duration/period;The consumption letter Breath may include the consumption number of times/number of days/amount of money/period;The mutual-action behavior information may include the period etc. of message;Wherein, The period refers to the specific time that behavior occurs.
The data recognition process unit creates the corresponding target feature vector of the client again, and the equipment is believed Described in breath, the subscriber identity information, the user behavior information and the corresponding characteristic value of service feature information are used as The element of target feature vector.Wherein, the service feature information may include account name length whether be more than 15 characters, Whether word and data mix account name, whether account name is containing Chinese Name phonetic (being obtained in such as demographic database), account Whether name contains whether english name and English everyday words, account registration IP have whether the registration of other accounts, account login IP have Other accounts log in, whether account binds mobile phone and whether mailbox, account set privacy problem, account uses the pet name whether with User name is identical, whether account idiograph and brief introduction are empty, account grade and integral etc..
Wherein, the dimension of the target feature vector is the total quantity of characteristic value in the target feature vector.It considers The raw value range disunity of the primitive character value of each feature, for example log duration range may be between 1 to 3600, and step on Recording numbers range may be in 1 to 100 time, therefore, and the characteristic value of quantity Value Types is belonged in the target feature vector is all It is obtained by normalized;Wherein, the formula of normalized can be:The characteristic value of certain feature after normalized =(the raw value stated range minimum of primitive character value-this feature of this feature)/(the raw value range of this feature is maximum The raw value stated range minimum of value-this feature), do numberical range corresponding to the characteristic value after normalized can [0, 1] between.In addition, the characteristic value for belonging to non-quantity Value Types in the target feature vector is by using preset specified number Value carries out assignment and obtains, i.e., the feature for non-quantity Value Types, can be to its classification assignment using as the target signature Vector an element value, for example feature " male/female " be assigned a value of 0,1 respectively.
S202 calculates the position in the hyperplane of vector space according to the characteristic value in the target feature vector;
Specifically, after the data recognition process unit obtains the target feature vector, it can be according to target spy The characteristic value of sign vector calculates the position in the hyperplane of vector space, Coordinate calculation method of the multi-C vector in hyperplane For the prior art, it is not discussed here.It optionally, can be in order to improve the coordinate computational efficiency in the hyperplane According to the characteristic value correlation between the threshold value and different characteristic vector for judging characteristic value validity, in target spy It levies and filters out validity feature value in the characteristic value of vector, and the target feature vector is calculated in institute according to the validity feature value The position in hyperplane is stated, since the quantity of the validity feature value is than all characteristic values in the target feature vector It is few, it is possible to improve the coordinate computational efficiency in the hyperplane.Wherein, since important feature carrying information is more, I.e. characteristic value differs greatly, so specifically may be used by the method for screening validity feature value for the threshold value of judging characteristic value validity To include:1, numeric type Feature change coefficient is more than reservation threshold, then this feature can be used as validity feature value;2, numeric type feature Mark difference is more than reservation threshold, then this feature can be used as validity feature value;If 3, the number of certain class label of classifying type feature is less than pre- Determine threshold values, then this feature can be used as validity feature value;If 4, the quantity of classifying type feature class label is less than reservation threshold, the spy Sign can be used as validity feature value.Wherein, standard deviation/average value of the coefficient of variation=normal distribution.Wherein, by by the target Feature vector is compared with the feature vector in the flag data set, it is known that the two relevance values are closer to special Sign is more important, it is possible to using the high feature of characteristic value relevance values as validity feature value, wherein detection characteristic value correlation Method may include three aspect examine:Pearson related-coefficient tests, variance analysis test, Chi-square Test.
S203 is carried respectively based on multiple feature vectors in support vector machines grader and flag data set User type mark, in the hyperplane create for dividing multiple feature vectors in the flag data set The disaggregated model of class;
Specifically, the data recognition process unit can be based on multiple in SVM classifier and flag data set (the user type mark includes validated user mark and disabled user's mark to the user type mark that feature vector carries respectively Know), the classification mould for classifying to multiple feature vectors in the flag data set is created in the hyperplane Type;Wherein, the disaggregated model is included in the validated user region in the hyperplane and disabled user region, the legal use Family region includes the feature vector for carrying validated user mark, and the disabled user region includes the spy for carrying disabled user's mark Sign vector, the validated user region and disabled user region can not also carry what the user type identified including multiple Feature vector, multiple feature vectors in the flag data set and the feature for not carrying the user type mark Position of the vector in the hyperplane is all the characteristic value by the data recognition process unit according to each feature vector in advance (or validity feature value) is calculated, and the feature vector for not carrying the user type mark includes at least the mesh Mark feature vector.
S204, calculate be distributed in the hyperplane all feature vectors for not carrying user type mark with it is described The Euclidean distance between multiple feature vectors in flag data set;
Specifically, after the data recognition process unit creates the disaggregated model, can calculate be distributed in it is described super flat In face all feature vectors for not carrying user type mark and multiple feature vectors in the flag data set it Between Euclidean distance, if for example, the feature vector for not carrying user type mark has A, B two, the flag data collection Feature vector in conjunction has C, D, E tri-, then needs to calculate separately the Europe between A and C, A and D, A and E, B and C, B and D, B and E Family name's distance.Wherein, the formula of the Euclidean distance between two feature vectors of calculating is:D=sqrt (∑ (Xi1-Xi2) ^2), i= 1,2..n;Xi1For the characteristic value of certain feature in one of feature vector, Xi2For the feature of this feature in another feature vector Value.
S205 is the most short Europe in all Euclidean distances for being calculated when the corresponding Euclidean distance of the target feature vector Family name apart from when, according to position of the target feature vector in the hyperplane, determine the target feature vector described Region in disaggregated model, to identify the corresponding user type of the target feature vector;
Specifically, when the corresponding Euclidean distance of the target feature vector is most short in all Euclidean distances for being calculated When Euclidean distance, illustrate at least one Euclidean distance associated with the target feature vector there are one of Euclidean away from From for the most short Euclidean distance in all Euclidean distances for being calculated, at this point it is possible to according to the target feature vector described Position in hyperplane determines region of the target feature vector in the disaggregated model, to identify the target signature The corresponding user type of vector, i.e., if position of the target feature vector in the hyperplane belongs to the disaggregated model In validated user region, then can identify the corresponding user type of the target feature vector be validated user, that is, illustrate The corresponding client of the target feature vector is not protocol number client;If the target feature vector is in the hyperplane In position belong to the disabled user region in the disaggregated model, then can identify the corresponding use of the target feature vector Family type is disabled user, that is, illustrates that the corresponding client of the target feature vector is protocol number client.Further, when The corresponding Euclidean distance of the target feature vector not for the most short Euclidean distance in all Euclidean distances for being calculated when, temporarily The target feature vector is not identified, and current only to not carrying the user type with most short Euclidean distance The feature vector of mark is identified.Described for example, if the feature vector for not carrying user type mark has A, B two Feature vector in flag data set has C, D, E tri-, and calculates separately out A and C, A and D, A and E, B and C, B and D, B and E Between Euclidean distance, and detect that A and C are the most short Euclidean distance in all Euclidean distances, then can pass through the classification Model is first identified the user type of A.
Wherein, it selects the purpose of most short Euclidean distance and is current all not carrying the user type mark to select Feature vector in the most apparent feature vector of feature, i.e. Euclidean distance is shorter, illustrates that this does not carry user type mark Feature vector it is closer from the feature vector for carrying user type mark, that is, illustrate that this does not carry the user type mark Feature vector characteristic value closer to the feature vector for carrying user type mark characteristic value, i.e., this do not carry described The feature of the feature vector of user type mark is more apparent, by the way that the most apparent feature vector of feature is identified and can be ensured Current identification is most accurately.
User type corresponding with the user type of the target feature vector is arranged for the target feature vector in S206 Mark, and the target feature vector for carrying the user type mark is added to the flag data set;
Specifically, after the data recognition process unit identifies the corresponding user type of the target feature vector, it can Think that user type mark corresponding with the user type of the target feature vector is arranged in the target feature vector, and will take Target feature vector with user type mark is added to the flag data set, in order to subsequently according to new mark Note data acquisition system updates the disaggregated model so that new target feature vector to be identified.Wherein, initial flag data collection A small amount of feature vector in conjunction can by handmarking its corresponding user type mark, with not carrying the user largely The feature vector of type identification is specifically identified, marks, and the feature vector in flag data set can be made more and more, because This, again will be more more accurate than original disaggregated model according to the new disaggregated model that new flag data set is established, So can accurately be identified to the new target feature vector based on the new disaggregated model, the new target Feature vector can be gone out selected in the remaining feature vector for not carrying the user type mark with most short Europe The feature vector of family name's distance.Since the feature vector identified every time is all the remaining feature for not carrying the user type mark The most apparent feature vector of feature in vector can get over feature so being based on data identifying processing method provided by the invention Unconspicuous feature vector is placed on to be identified more afterwards, and disaggregated model is also more accurate in the backward, to ensure that each Feature vector all plays the effect accurately identified, that is, realizing logical too small amount of handmarking can be in numerous spectator client It is middle all to find out all accord client.If for example, having feature vector A, B, C in the flag data set, currently The feature vector for not carrying the user type mark has D, E, F, then can be first according to the feature vector in flag data set A, B, C create disaggregated model a1, if at this point, detecting that feature vector D with most short Euclidean distance, can pass through disaggregated model A1 is identified and marks to feature vector D, and the feature vector D for carrying the user type mark is added to flag data Set;Create disaggregated model a2 further according to feature vector A, B, C, D in flag data set, if at this point, detect feature to Measuring F has most short Euclidean distance, then feature vector F can be identified and be marked by disaggregated model a2, and will carry institute The feature vector F for stating user type mark is added to flag data set;Finally, further according to the feature in flag data set to It measures A, B, C, D, F and creates disaggregated model a3, at this point, understanding that feature vector E has most short Euclidean distance, it is possible to pass through classification Model a3 is identified and marks to feature vector E, and the feature vector E for carrying the user type mark is added to label Data acquisition system so that flag data set includes feature vector A, B, C, D, E, F.In another example the full dose user for live streaming of playing is super 3,000,000 are crossed, and a small amount of feature vector by handmarking in initial flag data set can need to include only 100 carryings The feature vector of disabled user's mark and 100 feature vectors for carrying validated user mark, the data recognition process unit The client of whole users can be identified and be marked one by one by the initial flag data set.
It is special to calculate the target when the user type of the target feature vector is identified as disabled user's mark by S207 The vectorial Euclidean distance between the feature vector of carrying disabled user mark in the flag data set respectively of sign, to obtain Average Euclidean distance;
Specifically, after all feature vectors for not carrying the user type mark are all identified and mark, it is described Data recognition process unit can make corresponding punishment to the corresponding client of feature vector for carrying disabled user's mark and arrange It applies.Again by taking the corresponding client of the target feature vector as an example, when the user type of the target feature vector be identified as it is non- When method user identifier, spy of the target feature vector respectively with carrying disabled user mark in the flag data set is calculated Euclidean distance between sign vector, to obtain average Euclidean distance;Likewise, other carry the feature vector of disabled user's mark Also it needs to calculate corresponding average Euclidean distance, calculating process is identical as the target feature vector.If for example, the reference numerals There is A, B, C tri- (A is the target feature vector) according to the feature vector for carrying disabled user's mark in set, then needs The Euclidean distance (being respectively AB, BC, AC) between A and B, B and C, A and C is first calculated, then calculates the average Euclidean distance of A and is (AB+AC) the average Euclidean distance that the average Euclidean distance of/2, B is (AB+BC)/2, C is (AC+BC)/2.
S208 calculates the corresponding confidence level of the target feature vector according to the average Euclidean distance, and to the mesh The feature vector of carrying disabled user mark is corresponding in the mark corresponding confidence level of feature vector and the flag data set sets Reliability is ranked up;
Specifically, the average Euclidean distance meter of the data recognition process unit further according to the target feature vector The corresponding confidence level of the target feature vector is calculated, the average Euclidean distance is longer, then the confidence level is lower;Likewise, The data recognition process unit also calculates corresponding confidence level to the feature vector of other carryings disabled user's mark.The number According to recognition process unit again to the corresponding confidence level of the target feature vector and other carry disabled user mark feature to It measures corresponding confidence level to be ranked up, can be specifically ranked up according to the sequence of confidence level from big to small.
S209 determines the target feature vector according to the sorting position of the corresponding confidence level of the target feature vector Corresponding illegal grade, and the client is handled according to the illegal grade corresponding tactful processing mode;
Specifically, the data recognition process unit can be according to the sequence of the corresponding confidence level of the target feature vector Position determines the corresponding illegal grade of the target feature vector, and according to the corresponding tactful processing mode of the illegal grade The client is handled.For example, the data recognition process unit can preset four illegal grades, danger of attaching most importance to respectively User, middle danger user, light danger user and suspicion user, and come preceding 10% feature vector during confidence level is sorted and be determined as Endanger user again, and the feature vector for coming preceding 10% to 30% is determined as middle danger user, comes preceding 30% to 60% feature vector It is determined as the user that gently endangers, the feature vector for coming preceding 60% to 100% is determined as suspicion user;Wherein, the suspicion user couple The tactful processing mode answered can be:User is kicked offline, it is desirable that input identifying code;The corresponding tactful processing of the light danger user Mode can be:User is kicked offline and user's input handset number is required to verify, for example user can input a cell-phone number, so Input handset identifying code afterwards;The corresponding tactful processing mode of the middle danger user can be:User is kicked offline and requirement hand Machine Modify password;The corresponding tactful processing mode of the heavy danger user can be:Direct title, if there is feedback needs to restore account Number, need manual examination and verification.It can be seen that by calculating each corresponding confidence level of feature vector for carrying disabled user's mark, it can With the illegal grade of each feature vector for carrying disabled user's mark of determination, so as to more reasonably non-to each carrying The corresponding client of feature vector of method user identifier makes corresponding punishment.
The embodiment of the present invention, can be with structure by the facility information and user information and service feature information of collection client The corresponding target feature vector of client is made, and is identified according to the characteristic value in the disaggregated model and target feature vector created Go out the corresponding user type of target feature vector, if user type is disabled user's type, it can be said that the bright client is association View client, so as to realize whether automatic identification spectator client is protocol number client, to reduce human cost;Into One step can also be that user type mark corresponding with the user type of target feature vector is arranged in target feature vector, and will take Target feature vector with user type mark is added to flag data set, in order to subsequently can be according to new reference numerals According to set update disaggregated model new target feature vector to be identified, it can be seen that, in flag data set The quantity of feature vector increases, and the disaggregated model created also can more and more precisely, to the unconspicuous target signature of feature Vector can also accurately identify, that is, improve the identification accuracy to protocol number;And by calculating each carrying disabled user The corresponding confidence level of feature vector of mark, it may be determined that the illegal grade of each feature vector for carrying disabled user's mark, So as to more reasonably make corresponding punishment to the corresponding client of feature vector of each carrying disabled user mark.
Fig. 3 is referred to, is a kind of structural schematic diagram of data recognition process unit provided in an embodiment of the present invention, the number Background server is can be applied to according to recognition process unit 1, the data recognition process unit 1 may include:Collect construction mould Block 10 creates identification module 20, setting add module 30;
The collection constructing module 10, the facility information for collecting client and user information, and according to the equipment The corresponding target feature vector of client described in information, the user information and service feature information structuring;The target is special Sign vector includes the facility information, the user information and the corresponding characteristic value of the service feature information;
Specifically, the constructing module 10 of collecting can collect the facility information and user information of client, and according to institute State the corresponding target feature vector of client described in facility information, the user information and service feature information structuring.Into one Step, then be the structural schematic diagram for collecting constructing module 10 please also refer to Fig. 4, the collection constructing module 10 includes: Collector unit 101, vectorial creating unit 102;
The collector unit 101, the facility information for collecting client and user information;The user information includes using Family identity information and user behavior information;
The vector creating unit 102, for creating the corresponding target feature vector of the client, and by the equipment Information, the subscriber identity information, the user behavior information and the corresponding characteristic value of service feature information are as institute State the element of target feature vector;
Wherein, the user information may include subscriber identity information and user behavior information.Wherein, the facility information User device environment information can be referred to, specifically include process feature, called parent process, transmission data the packet use of operation Agreement etc..The subscriber identity information can refer to user in the record of the client (such as spectator client), specifically include The information such as user name, age, gender, registered place, registration IP, grade, the pet name, brief introduction, client login situation.User's row Behavior of the user in each channel that game live streaming platform record can be referred to for information, specifically include user log-on message, Viewing information, consumption information (such as sending flower, send stage property etc.) and mutual-action behavior information (such as leaving a message);Wherein, the user Log-on message may include login times/number of days/duration that i days users add up before counting from day, log in the period, log in IP with And the related frequency;The viewing information may include viewing live streaming accumulative number/number of days/duration/period;The consumption information It may include the consumption number of times/number of days/amount of money/period;The mutual-action behavior information may include the period etc. of message;Wherein, institute It refers to the specific time that behavior occurs to state the period.
The vector creating unit 102 can create the corresponding target feature vector of the client, and by the equipment Information, the subscriber identity information, the user behavior information and the corresponding characteristic value of service feature information are as institute State the element of target feature vector.Wherein, the service feature information may include whether the length of account name is more than 15 words Whether word and data mix, whether account name is containing Chinese Name phonetic (being obtained in such as demographic database), account for symbol, account name Whether whether number name containing english name and English everyday words, account registration IP have whether the registration of other accounts, account log in IP There are other accounts to log in, whether account binds mobile phone and whether mailbox, account set privacy problem, account uses the pet name whether Whether identical as user name, account idiograph and brief introduction are empty, account grade and integral etc..
Wherein, the dimension of the target feature vector is the total quantity of characteristic value in the target feature vector.It considers The raw value range disunity of the primitive character value of each feature, for example log duration range may be between 1 to 3600, and step on Recording numbers range may be in 1 to 100 time, therefore, and the characteristic value of quantity Value Types is belonged in the target feature vector is all It is obtained by normalized;Wherein, the formula of normalized can be:The characteristic value of certain feature after normalized =(the raw value stated range minimum of primitive character value-this feature of this feature)/(the raw value range of this feature is maximum The raw value stated range minimum of value-this feature), do numberical range corresponding to the characteristic value after normalized can [0, 1] between.In addition, the characteristic value for belonging to non-quantity Value Types in the target feature vector is by using preset specified number Value carries out assignment and obtains, i.e., the feature for non-quantity Value Types, can be to its classification assignment using as the target signature Vector an element value, for example feature " male/female " be assigned a value of 0,1 respectively.
The establishment identification module 20, the user for being carried respectively based on multiple feature vectors in flag data set Type identification, disaggregated model of the establishment for classifying to multiple feature vectors in the flag data set, and according to Characteristic value in the disaggregated model and the target feature vector identifies the corresponding user type of the target feature vector;
Specifically, after obtaining the target feature vector, the establishment identification module 20 can be based on flag data set In multiple feature vectors carry respectively user type mark, create for multiple features in the flag data set The disaggregated model that vector is classified, and institute is identified according to the characteristic value in the disaggregated model and the target feature vector State the corresponding user type of target feature vector.Further, it is the establishment identification module 20 then please also refer to Fig. 5 Structural schematic diagram, the establishment identification module 20 may include:Position calculation unit 201, model creating unit 202, distance meter Calculate unit 203, recognition unit 204;
The position calculation unit 201, for according to the characteristic value in the target feature vector, calculating in vector space Hyperplane in position;
Specifically, in order to improve the coordinate computational efficiency in the hyperplane, the position calculation unit 201 can be with Specifically for basis for the characteristic value correlation between the threshold value and different characteristic vector of judging characteristic value validity, in institute It states and filters out validity feature value in the characteristic value of target feature vector, and the target signature is calculated according to the validity feature value Position of the vector in the hyperplane.Since the quantity of the validity feature value is than all spies in the target feature vector Value indicative will be lacked, it is possible to improve the coordinate computational efficiency in the hyperplane.Wherein, since important feature carries information More, i.e., characteristic value differs greatly, so the position calculation unit 201 passes through the threshold value sieve for judging characteristic value validity The method for selecting validity feature value can specifically include:1, numeric type Feature change coefficient is more than reservation threshold, then this feature can be made For validity feature value;2, numeric type feature mark difference is more than reservation threshold, then this feature can be used as validity feature value;3, classifying type is special If the number for levying certain class label is less than reservation threshold, this feature can be used as validity feature value;If 4, classifying type feature class label Quantity be less than reservation threshold, then this feature can be used as validity feature value.Wherein, the standard deviation of the coefficient of variation=normal distribution/ Average value.It wherein, can be with by the way that the target feature vector to be compared with the feature vector in the flag data set Know the two relevance values closer to feature is more important, so the position calculation unit 201 can be by characteristic value relevance values High feature is as validity feature value, wherein the method for detection characteristic value correlation may include that three aspects are examined:pearson Related-coefficient test, variance analysis test, Chi-square Test.
The model creating unit 202, for based on more in support vector machines grader and flag data set The user type mark that a feature vector carries respectively, creates in the hyperplane for in the flag data set The disaggregated model that multiple feature vectors are classified;The disaggregated model be included in validated user region in the hyperplane and Disabled user region;
Specifically, the model creating unit 202 can be based on multiple spies in SVM classifier and flag data set The user type mark (the user type mark includes validated user mark and disabled user's mark) that sign vector carries respectively, The disaggregated model for classifying to multiple feature vectors in the flag data set is created in the hyperplane;Its In, the disaggregated model is included in validated user region and disabled user region in the hyperplane, the validated user area Domain include carry validated user mark feature vector, the disabled user region include carry disabled user mark feature to Amount, the validated user region and disabled user region can also include multiple features for not carrying the user type mark Vector, multiple feature vectors in the flag data set and the feature vector for not carrying the user type mark Position in the hyperplane is all (or to be had according to the characteristic value of each feature vector by the position calculation unit 201 in advance Effect characteristic value) be calculated, it is special that the feature vector for not carrying the user type mark includes at least the target Sign vector.
The metrics calculation unit 203 all in the hyperplane does not carry the user type for calculating to be distributed in The Euclidean distance between multiple feature vectors in the feature vector of mark and the flag data set;
Specifically, after the model creating unit 202 creates the disaggregated model, the metrics calculation unit 203 can be with It calculates and is distributed in all feature vectors for not carrying the user type mark and the flag data set in the hyperplane In multiple feature vectors between Euclidean distance, if for example, the feature vector for not carrying user type mark has A, B two A, the feature vector in the flag data set has C, D, E tri-, then the metrics calculation unit 203 needs to calculate separately A With the Euclidean distance between C, A and D, A and E, B and C, B and D, B and E.Wherein, calculate two feature vectors between Euclidean away from From formula be:D=sqrt (∑ (Xi1-Xi2) ^2), i=1,2..n;Xi1For the feature of certain feature in one of feature vector Value, Xi2For the characteristic value of this feature in another feature vector.
The recognition unit 204, all Europe for being calculated when the corresponding Euclidean distance of the target feature vector Family name distance in most short Euclidean distance when, according to position of the target feature vector in the hyperplane, determine the mesh Region of the feature vector in the disaggregated model is marked, to identify the corresponding user type of the target feature vector;
Specifically, when the corresponding Euclidean distance of the target feature vector is most short in all Euclidean distances for being calculated When Euclidean distance, illustrate at least one Euclidean distance associated with the target feature vector there are one of Euclidean away from From for the most short Euclidean distance in all Euclidean distances for being calculated, at this point, the recognition unit 204 can be according to the target Position of the feature vector in the hyperplane determines region of the target feature vector in the disaggregated model, to know The corresponding user type of not described target feature vector, i.e., if position of the target feature vector in the hyperplane belongs to Validated user region in the disaggregated model, then the recognition unit 204 can identify the target feature vector pair The user type answered is validated user, that is, illustrates that the corresponding client of the target feature vector is not protocol number client;Such as Position of the target feature vector described in fruit in the hyperplane belongs to the disabled user region in the disaggregated model, then described Recognition unit 204 can identify that the corresponding user type of the target feature vector is disabled user, that is, illustrate the target The corresponding client of feature vector is protocol number client.Further, when the corresponding Euclidean distance of the target feature vector Not for the most short Euclidean distance in all Euclidean distances for being calculated when, the target feature vector is not identified temporarily, And it is current that only the feature vector for not carrying the user type mark with most short Euclidean distance is identified.For example, If the feature vector for not carrying user type mark has A, B two, the feature vector in the flag data set have C, D, E tri-, and the Euclidean distance between A and C, A and D, A and E, B and C, B and D, B and E is calculated separately out, and detect A and C For the most short Euclidean distance in all Euclidean distances, then the recognition unit 204 can be by the disaggregated model first to the use of A Family type is identified.
Wherein, it selects the purpose of most short Euclidean distance and is current all not carrying the user type mark to select Feature vector in the most apparent feature vector of feature, i.e. Euclidean distance is shorter, illustrates that this does not carry user type mark Feature vector it is closer from the feature vector for carrying user type mark, that is, illustrate that this does not carry the user type mark Feature vector characteristic value closer to the feature vector for carrying user type mark characteristic value, i.e., this do not carry described The feature of the feature vector of user type mark is more apparent, by the way that the most apparent feature vector of feature is identified and can be ensured Current identification is most accurately.
The setting add module 30, for the user for target feature vector setting and the target feature vector The corresponding user type mark of type, and the target feature vector for carrying the user type mark is added to the label Data acquisition system, in order to subsequently according to new flag data set update the disaggregated model with to new target feature vector into Row identification;
Specifically, after identifying the corresponding user type of the target feature vector, the setting add module 30 can Think that user type mark corresponding with the user type of the target feature vector is arranged in the target feature vector, and will take Target feature vector with user type mark is added to the flag data set, in order to subsequently according to new mark Note data acquisition system updates the disaggregated model so that new target feature vector to be identified.Wherein, initial flag data collection A small amount of feature vector in conjunction can by handmarking its corresponding user type mark, with not carrying the user largely The feature vector of type identification is specifically identified, marks, and the feature vector in flag data set can be made more and more, because This, again will be more more accurate than original disaggregated model according to the new disaggregated model that new flag data set is established, So can accurately be identified to the new target feature vector based on the new disaggregated model, the new target Feature vector can be gone out selected in the remaining feature vector for not carrying the user type mark with most short Europe The feature vector of family name's distance.Since the feature vector identified every time is all the remaining feature for not carrying the user type mark The more unconspicuous feature vector of feature is placed on and is identified more afterwards by the most apparent feature vector of feature in vector, and more past Disaggregated model is also more accurate afterwards, it is possible to which guarantee plays the effect accurately identified to each feature vector, that is, realizes Leading to too small amount of handmarking can all find out all accord client in numerous spectator clients.For example, If having feature vector A, B, C in the flag data set, do not carry currently user type mark feature vector have D, E, F, then the identification module 20 that creates can be first according to feature vector A, B, C establishment disaggregated model in flag data set A1, if at this point, detecting that feature vector D with most short Euclidean distance, can carry out feature vector D by disaggregated model a1 Identification and label, and the feature vector D for carrying the user type mark is added to label by the setting add module 30 Data acquisition system;The identification module 20 that creates creates disaggregated model further according to feature vector A, B, C, D in flag data set A2, if at this point, detecting that feature vector F with most short Euclidean distance, can carry out feature vector F by disaggregated model a2 Identification and label, and the feature vector F for carrying the user type mark is added to label by the setting add module 30 Data acquisition system;Finally, the identification module 20 that creates is further according to feature vector A, B, C, D, F establishment point in flag data set Class model a3, at this point, understanding that feature vector E has most short Euclidean distance, it is possible to by disaggregated model a3 to feature vector E It is identified and marks, and be added to the feature vector E for carrying the user type mark by the setting add module 30 Flag data set so that flag data set includes feature vector A, B, C, D, E, F.In another example the full dose for live streaming of playing is used Family is more than 3,000,000, and a small amount of feature vector by handmarking in initial flag data set can need to include only 100 Carry the feature vector and 100 feature vectors for carrying validated user mark of disabled user's mark, the data identifying processing Device 1 can be identified and marked one by one to the client of whole users by the initial flag data set.
The embodiment of the present invention, can be with structure by the facility information and user information and service feature information of collection client The corresponding target feature vector of client is made, and is identified according to the characteristic value in the disaggregated model and target feature vector created Go out the corresponding user type of target feature vector, if user type is disabled user's type, it can be said that the bright client is association View client, so as to realize whether automatic identification spectator client is protocol number client, to reduce human cost;Into One step can also be that user type mark corresponding with the user type of target feature vector is arranged in target feature vector, and will take Target feature vector with user type mark is added to flag data set, in order to subsequently can be according to new reference numerals According to set update disaggregated model new target feature vector to be identified, it can be seen that, in flag data set The quantity of feature vector increases, and the disaggregated model created also can more and more precisely, to the unconspicuous target signature of feature Vector can also accurately identify, that is, improve the identification accuracy to protocol number.
Fig. 6 is referred to again, is the structural schematic diagram of another data recognition process unit provided in an embodiment of the present invention, institute It states data recognition process unit 1 and can be applied to background server, the data recognition process unit 1 may include above-mentioned Fig. 3 Collection constructing module 10, establishment identification module 20 in corresponding embodiment, setting add module 30, further, the data Recognition process unit 1 can also include:Computing module 40, sorting module 50, tactful processing module 60;
The computing module 40, for when the user type of the target feature vector be identified as disabled user mark when, Calculate the target feature vector respectively in the flag data set carry disabled user mark feature vector between Euclidean distance, to obtain average Euclidean distance;
Specifically, after all feature vectors for not carrying the user type mark are all identified and mark, it is described Data recognition process unit 1 can make corresponding punishment to the corresponding client of feature vector for carrying disabled user's mark and arrange It applies.Again by taking the corresponding client of the target feature vector as an example, when the user type of the target feature vector be identified as it is non- When method user identifier, the computing module 40 can calculate the target feature vector and be taken respectively with the flag data set Euclidean distance between feature vector with disabled user's mark, to obtain average Euclidean distance;Likewise, other are carried illegally The feature vector of user identifier also needs to calculate corresponding average Euclidean distance, calculating process and the target feature vector phase Together.If for example, the feature vector for carrying disabled user's mark in the flag data set has A, B, C tri-, (A is described Target feature vector), then it needs first to calculate Euclidean distance between A and B, B and C, A and C (respectively by the computing module 40 For AB, BC, AC), then it is the flat of (AB+BC)/2, C to calculate the average Euclidean distance that the average Euclidean distance of A is (AB+AC)/2, B Equal Euclidean distance is (AC+BC)/2.
The computing module 40 is additionally operable to set according to the average Euclidean distance calculating target feature vector is corresponding Reliability;
The sorting module 50, for in the corresponding confidence level of the target feature vector and the flag data set The corresponding confidence level of feature vector for carrying disabled user's mark is ranked up;
Specifically, described in the average Euclidean distance calculating of the computing module 40 further according to the target feature vector The corresponding confidence level of target feature vector, the average Euclidean distance is longer, then the confidence level is lower;Likewise, the meter It calculates module 40 and corresponding confidence level also is calculated to the feature vector of other carryings disabled user's mark.The sorting module 50 is right again The corresponding confidence level of target feature vector confidence level corresponding with the feature vector of other carryings disabled user's mark carries out Sequence, can specifically be ranked up according to the sequence of confidence level from big to small.
The strategy processing module 60, for the sorting position according to the corresponding confidence level of the target feature vector, really Determine the corresponding illegal grade of the target feature vector, and according to the corresponding tactful processing mode of the illegal grade to the visitor Family end is handled;
Specifically, the strategy processing module 60 can be according to the sequence position of the corresponding confidence level of the target feature vector It sets, determines the corresponding illegal grade of the target feature vector, and according to the corresponding tactful processing mode pair of the illegal grade The client is handled.For example, preset four illegal grades, attach most importance to respectively danger user, middle danger user, gently endanger user and Suspicion user, and come preceding 10% feature vector during confidence level is sorted by the tactful processing module 60 and determines use of endangering of attaching most importance to Family, the feature vector for coming preceding 10% to 30% are determined as middle danger user, come preceding 30% to 60% feature vector and are determined as Light danger user, the feature vector for coming preceding 60% to 100% are determined as suspicion user;Wherein, the corresponding plan of the suspicion user Omiting processing mode can be:User is kicked offline, it is desirable that input identifying code;The corresponding tactful processing mode of the light danger user can Think:User is kicked offline and user's input handset number is required to verify, for example user can input a cell-phone number, then input Mobile phone identifying code;The corresponding tactful processing mode of the middle danger user can be:User is kicked offline and requirement to be changed with mobile phone Password;The corresponding tactful processing mode of the heavy danger user can be:Direct title is needed if there is feedback needs to restore account Manual examination and verification.It can be seen that by calculating each corresponding confidence level of feature vector for carrying disabled user's mark, it may be determined that The illegal grade of each feature vector for carrying disabled user's mark, so as to more reasonably to each carrying disabled user The corresponding client of feature vector of mark makes corresponding punishment.
The embodiment of the present invention, can be with structure by the facility information and user information and service feature information of collection client The corresponding target feature vector of client is made, and is identified according to the characteristic value in the disaggregated model and target feature vector created Go out the corresponding user type of target feature vector, if user type is disabled user's type, it can be said that the bright client is association View client, so as to realize whether automatic identification spectator client is protocol number client, to reduce human cost;Into One step can also be that user type mark corresponding with the user type of target feature vector is arranged in target feature vector, and will take Target feature vector with user type mark is added to flag data set, in order to subsequently can be according to new reference numerals According to set update disaggregated model new target feature vector to be identified, it can be seen that, in flag data set The quantity of feature vector increases, and the disaggregated model created also can more and more precisely, to the unconspicuous target signature of feature Vector can also accurately identify, that is, improve the identification accuracy to protocol number;And by calculating each carrying disabled user The corresponding confidence level of feature vector of mark, it may be determined that the illegal grade of each feature vector for carrying disabled user's mark, So as to more reasonably make corresponding punishment to the corresponding client of feature vector of each carrying disabled user mark.
Fig. 7 is referred to, is the structural schematic diagram of another data recognition process unit provided in an embodiment of the present invention, it is described Data recognition process unit 1000 may include processor 1001, communication interface 1002 and (the data identification of memory 1003 The quantity of processor 1001 in processing unit 1000 can be one or more, in Fig. 7 by taking a processor as an example).This hair In some bright embodiments, processor 1001, communication interface 1002 and memory 1003 can pass through communication bus or other modes Connection, wherein Fig. 7 by communication bus for being connected.
Wherein, the communication interface 1002, for being communicated with client;
The memory 1003 is for storing program;
The processor 1001 is for executing described program, to realize
The facility information and user information of client are collected, and according to the facility information, the user information and industry The corresponding target feature vector of the characteristic information construction client of being engaged in;The target feature vector include the facility information, The user information and the corresponding characteristic value of the service feature information;
Based on the user type mark that multiple feature vectors in flag data set carry respectively, create for described The disaggregated model that multiple feature vectors in flag data set are classified, and it is special according to the disaggregated model and the target Characteristic value in sign vector identifies the corresponding user type of the target feature vector;
For the target feature vector, user type mark corresponding with the user type of the target feature vector is set, And the target feature vector for carrying the user type mark is added to the flag data set, in order to follow-up basis New flag data set updates the disaggregated model so that new target feature vector to be identified;The user type mark Including validated user mark and disabled user's mark.
In one embodiment, the processor 1001 is additionally operable to:
When the user type of the target feature vector is identified as disabled user's mark, the target feature vector is calculated Euclidean distance between the feature vector of carrying disabled user mark in the flag data set respectively, to obtain average Europe Family name's distance;
The corresponding confidence level of the target feature vector is calculated according to the average Euclidean distance, and to the target signature Carried in vectorial corresponding confidence level and the flag data set the corresponding confidence level of feature vector of disabled user's mark into Row sequence;
According to the sorting position of the corresponding confidence level of the target feature vector, determine that the target feature vector is corresponding Illegal grade, and the client is handled according to the illegal grade corresponding tactful processing mode.
In one embodiment, the processor 1001 is executing the facility information and user information for collecting client, and According to the corresponding target feature vector of client described in the facility information, the user information and service feature information structuring When, it is specifically used for:
Collect the facility information and user information of client;The user information includes subscriber identity information and user behavior Information;
Create the corresponding target feature vector of the client, and by the facility information, the subscriber identity information, institute State the element of user behavior information and the corresponding characteristic value of service feature information as the target feature vector;
Wherein, the dimension of the target feature vector is the total quantity of characteristic value in the target feature vector;The mesh The characteristic value for belonging to quantity Value Types in mark feature vector is obtained by normalized, and belongs to non-quantity Value Types Characteristic value carries out assignment by using preset specified numerical value and obtains.
In one embodiment, the processor 1001 is being executed based on multiple feature vectors in flag data set point The user type mark not carried, creates the classification for classifying to multiple feature vectors in the flag data set Model, and identify that the target feature vector corresponds to according to the characteristic value in the disaggregated model and the target feature vector User type when, be specifically used for:
According to the characteristic value in the target feature vector, the position in the hyperplane of vector space is calculated;
The user carried respectively based on multiple feature vectors in support vector machines grader and flag data set Type identification creates point for classifying to multiple feature vectors in the flag data set in the hyperplane Class model;The disaggregated model is included in validated user region and disabled user region in the hyperplane;
It calculates and is distributed in all feature vectors for not carrying the user type mark and the label in the hyperplane The Euclidean distance between multiple feature vectors in data acquisition system;All feature vectors for not carrying the user type mark are extremely Include the target feature vector less;
When the corresponding Euclidean distance of the target feature vector for the most short Euclidean in all Euclidean distances for being calculated away from From when, according to position of the target feature vector in the hyperplane, determine the target feature vector in the classification Region in model, to identify the corresponding user type of the target feature vector.
In one embodiment, the processor 1001 is being executed according to the characteristic value in the target feature vector, meter When calculating the position in the hyperplane of vector space, it is specifically used for:
According to the characteristic value correlation between the threshold value and different characteristic vector for judging characteristic value validity, in institute It states and filters out validity feature value in the characteristic value of target feature vector, and the target signature is calculated according to the validity feature value Position of the vector in the hyperplane.
The embodiment of the present invention, can be with structure by the facility information and user information and service feature information of collection client The corresponding target feature vector of client is made, and is identified according to the characteristic value in the disaggregated model and target feature vector created Go out the corresponding user type of target feature vector, if user type is disabled user's type, it can be said that the bright client is association View client, so as to realize whether automatic identification spectator client is protocol number client, to reduce human cost;Into One step can also be that user type mark corresponding with the user type of target feature vector is arranged in target feature vector, and will take Target feature vector with user type mark is added to flag data set, in order to subsequently can be according to new reference numerals According to set update disaggregated model new target feature vector to be identified, it can be seen that, in flag data set The quantity of feature vector increases, and the disaggregated model created also can more and more precisely, to the unconspicuous target signature of feature Vector can also accurately identify, that is, improve the identification accuracy to protocol number;And by calculating each carrying disabled user The corresponding confidence level of feature vector of mark, it may be determined that the illegal grade of each feature vector for carrying disabled user's mark, So as to more reasonably make corresponding punishment to the corresponding client of feature vector of each carrying disabled user mark.
One of ordinary skill in the art will appreciate that realizing all or part of flow in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a computer read/write memory medium In, the program is when being executed, it may include such as the flow of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..
The above disclosure is only the preferred embodiments of the present invention, cannot limit the right model of the present invention with this certainly It encloses, therefore equivalent changes made in accordance with the claims of the present invention, is still within the scope of the present invention.

Claims (10)

1. a kind of data identifying processing method, which is characterized in that including:
The facility information and user information of client are collected, and special according to the facility information, the user information and business Levy the corresponding target feature vector of client described in information structuring;The target feature vector includes the facility information, described User information and the corresponding characteristic value of the service feature information;
Based on the user type mark that multiple feature vectors in flag data set carry respectively, create for the label The disaggregated model that multiple feature vectors in data acquisition system are classified, when the target feature vector being distributed in hyperplane When Euclidean distance between multiple feature vectors in the flag data set includes most short Euclidean distance, according to described point Characteristic value in class model and the target feature vector identifies the corresponding user type of the target feature vector;Wherein, Position of the target feature vector in the hyperplane is calculated according to the validity feature value in target feature vector 's;The validity feature value is according to for the characteristic value between the threshold value and different characteristic vector of judging characteristic value validity What correlation was screened, the characteristic value correlation is based on pearson related-coefficient tests, variance analysis test, card side What inspection obtained;
User type mark corresponding with the user type of the target feature vector is set for the target feature vector, and will The target feature vector for carrying user type mark is added to the flag data set, in order to follow-up according to new Flag data set updates the disaggregated model so that new target feature vector to be identified;The user type identifies Validated user identifies and disabled user's mark.
2. the method as described in claim 1, which is characterized in that further include:
When the user type of the target feature vector is identified as disabled user's mark, the target feature vector difference is calculated With disabled user's mark is carried in the flag data set feature vector between Euclidean distance, with obtain average Euclidean away from From;
The corresponding confidence level of the target feature vector is calculated according to the average Euclidean distance, and to the target feature vector The corresponding confidence level of feature vector that disabled user's mark is carried in corresponding confidence level and the flag data set is arranged Sequence;
According to the sorting position of the corresponding confidence level of the target feature vector, determine that the target feature vector is corresponding illegal Grade, and the client is handled according to the illegal grade corresponding tactful processing mode.
3. the method as described in claim 1, which is characterized in that the facility information and user information for collecting client, and According to the corresponding target signature of client described in the facility information, the user information and service feature information structuring to Amount, including:
Collect the facility information and user information of client;The user information includes subscriber identity information and user behavior letter Breath;
Create the corresponding target feature vector of the client, and by the facility information, the subscriber identity information, the use The element of family behavioural information and the corresponding characteristic value of service feature information as the target feature vector;
Wherein, the dimension of the target feature vector is the total quantity of characteristic value in the target feature vector;The target is special The characteristic value for belonging to quantity Value Types in sign vector is obtained by normalized, and belongs to the feature of non-quantity Value Types Value carries out assignment by using preset specified numerical value and obtains.
4. the method as described in claim 1, which is characterized in that multiple feature vectors point in the set based on flag data The user type mark not carried, creates the classification for classifying to multiple feature vectors in the flag data set Model, and identify that the target feature vector corresponds to according to the characteristic value in the disaggregated model and the target feature vector User type, including:
According to the characteristic value in the target feature vector, the position in the hyperplane of vector space is calculated;
The user type carried respectively based on multiple feature vectors in support vector machines grader and flag data set Mark creates the classification mould for classifying to multiple feature vectors in the flag data set in the hyperplane Type;The disaggregated model is included in validated user region and disabled user region in the hyperplane;
It calculates and is distributed in all feature vectors for not carrying the user type mark and the flag data in the hyperplane The Euclidean distance between multiple feature vectors in set;All feature vectors for not carrying the user type mark are at least wrapped Include the target feature vector;
When the corresponding Euclidean distance of the target feature vector for the most short Euclidean distance in all Euclidean distances for being calculated when, According to position of the target feature vector in the hyperplane, determine the target feature vector in the disaggregated model Region, to identify the corresponding user type of the target feature vector.
5. method as claimed in claim 4, which is characterized in that the characteristic value according in the target feature vector, meter The position in the hyperplane of vector space is calculated, is specifically included:
According to the characteristic value correlation between the threshold value and different characteristic vector for judging characteristic value validity, in the mesh It marks and filters out validity feature value in the characteristic value of feature vector, and the target feature vector is calculated according to the validity feature value Position in the hyperplane.
6. a kind of data recognition process unit, which is characterized in that including:
Constructing module, the facility information for collecting client and user information are collected, and according to the facility information, the use The corresponding target feature vector of client described in family information and service feature information structuring;The target feature vector includes institute State facility information, the user information and the corresponding characteristic value of the service feature information;
Identification module is created, the user type mark for being carried respectively based on multiple feature vectors in flag data set, Disaggregated model for classifying to multiple feature vectors in the flag data set is created, when being distributed in hyperplane The target feature vector and the flag data set in multiple feature vectors between Euclidean distance include most short Europe Family name apart from when, the target feature vector pair is identified according to the characteristic value in the disaggregated model and the target feature vector The user type answered;Wherein, position of the target feature vector in the hyperplane is according in target feature vector Validity feature value is calculated;The validity feature value is according to the threshold value for being used for judging characteristic value validity and different spies Characteristic value correlation between sign vector is screened, the characteristic value correlation be based on pearson related-coefficient tests, Variance analysis test, Chi-square Test obtain;
Add module is set, it is corresponding with the user type of the target feature vector for being arranged for the target feature vector User type identifies, and the target feature vector for carrying the user type mark is added to the flag data set, In order to subsequently update the disaggregated model according to new flag data set new target feature vector to be identified;Institute It includes validated user mark and disabled user's mark to state user type mark.
7. device as claimed in claim 6, which is characterized in that further include:
Computing module, for when the user type of the target feature vector is identified as disabled user's mark, calculating the mesh Euclidean distance of the feature vector respectively between the feature vector of carrying disabled user mark in the flag data set is marked, with Obtain average Euclidean distance;
The computing module is additionally operable to calculate the corresponding confidence level of the target feature vector according to the average Euclidean distance;
Sorting module, for carrying illegal use in the corresponding confidence level of the target feature vector and the flag data set The corresponding confidence level of feature vector of family mark is ranked up;
Tactful processing module determines the target for the sorting position according to the corresponding confidence level of the target feature vector The corresponding illegal grade of feature vector, and according to the corresponding tactful processing mode of the illegal grade to the client at Reason.
8. device as claimed in claim 6, which is characterized in that the collection constructing module includes:
Collector unit, the facility information for collecting client and user information;The user information includes subscriber identity information With user behavior information;
Vectorial creating unit, for creating the corresponding target feature vector of the client, and by the facility information, the use Family identity information, the user behavior information and the corresponding characteristic value of service feature information as the target signature to The element of amount;
Wherein, the dimension of the target feature vector is the total quantity of characteristic value in the target feature vector;The target is special The characteristic value for belonging to quantity Value Types in sign vector is obtained by normalized, and belongs to the feature of non-quantity Value Types Value carries out assignment by using preset specified numerical value and obtains.
9. device as claimed in claim 6, which is characterized in that the establishment identification module includes:
Position calculation unit, for according to the characteristic value in the target feature vector, calculating in the hyperplane of vector space Position;
Model creating unit, for based on multiple feature vectors in support vector machines grader and flag data set Carry respectively user type mark, in the hyperplane create for multiple features in the flag data set to Measure the disaggregated model classified;The disaggregated model is included in validated user region and disabled user area in the hyperplane Domain;
Metrics calculation unit, for calculate be distributed in the hyperplane all features for not carrying user type mark to The Euclidean distance between multiple feature vectors in amount and the flag data set;It is all not carry the user type mark Feature vector include at least the target feature vector;
Recognition unit, in all Euclidean distances for being calculated when the corresponding Euclidean distance of the target feature vector most When short Euclidean distance, according to position of the target feature vector in the hyperplane, determine that the target feature vector exists Region in the disaggregated model, to identify the corresponding user type of the target feature vector.
10. device as claimed in claim 9, which is characterized in that
The position calculation unit is specifically used for according to the threshold value and different characteristic vector for being used for judging characteristic value validity Between characteristic value correlation, filter out validity feature value in the characteristic value of the target feature vector, and according to it is described effectively Characteristic value calculates position of the target feature vector in the hyperplane.
CN201510835028.1A 2015-11-25 2015-11-25 A kind of data identifying processing method and device Active CN105491444B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510835028.1A CN105491444B (en) 2015-11-25 2015-11-25 A kind of data identifying processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510835028.1A CN105491444B (en) 2015-11-25 2015-11-25 A kind of data identifying processing method and device

Publications (2)

Publication Number Publication Date
CN105491444A CN105491444A (en) 2016-04-13
CN105491444B true CN105491444B (en) 2018-11-06

Family

ID=55678102

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510835028.1A Active CN105491444B (en) 2015-11-25 2015-11-25 A kind of data identifying processing method and device

Country Status (1)

Country Link
CN (1) CN105491444B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180089581A1 (en) * 2016-09-27 2018-03-29 Futurewei Technologies, Inc. Apparatus and method for dataset model fitting using a classifying engine
CN108268877A (en) * 2016-12-30 2018-07-10 中国移动通信集团黑龙江有限公司 A kind of method and apparatus for identifying target terminal
CN108399418B (en) * 2018-01-23 2021-09-03 北京奇艺世纪科技有限公司 User classification method and device
CN110166344B (en) * 2018-04-25 2021-08-24 腾讯科技(深圳)有限公司 Identity identification method, device and related equipment
CN110557447B (en) * 2019-08-26 2022-06-10 腾讯科技(武汉)有限公司 User behavior identification method and device, storage medium and server
CN111417021B (en) * 2020-03-16 2022-07-08 广州虎牙科技有限公司 Plug-in identification method and device, computer equipment and readable storage medium
CN111766487A (en) * 2020-07-31 2020-10-13 南京南瑞继保电气有限公司 Cable partial discharge defect type identification method based on multiple quality characteristic quantities
CN113521751B (en) * 2021-07-27 2023-11-14 腾讯科技(深圳)有限公司 Operation test method and device, storage medium and electronic equipment
CN114466358B (en) * 2022-01-30 2023-10-31 全球能源互联网研究院有限公司 User identity continuous authentication method and device based on zero trust

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101600178A (en) * 2009-06-26 2009-12-09 成都市华为赛门铁克科技有限公司 Junk information confirmation method and device, terminal
CN102708186A (en) * 2012-05-11 2012-10-03 上海交通大学 Identification method of phishing sites
CN102768659A (en) * 2011-05-03 2012-11-07 阿里巴巴集团控股有限公司 Method and system for identifying repeated account
CN104471501A (en) * 2012-06-12 2015-03-25 西门子公司 Generalized pattern recognition for fault diagnosis in machine condition monitoring
CN104579773A (en) * 2014-12-31 2015-04-29 北京奇虎科技有限公司 Domain name system analysis method and device
CN104933082A (en) * 2014-03-21 2015-09-23 华为技术有限公司 Evaluation information processing method and apparatus

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101600178A (en) * 2009-06-26 2009-12-09 成都市华为赛门铁克科技有限公司 Junk information confirmation method and device, terminal
CN102768659A (en) * 2011-05-03 2012-11-07 阿里巴巴集团控股有限公司 Method and system for identifying repeated account
CN102708186A (en) * 2012-05-11 2012-10-03 上海交通大学 Identification method of phishing sites
CN104471501A (en) * 2012-06-12 2015-03-25 西门子公司 Generalized pattern recognition for fault diagnosis in machine condition monitoring
CN104933082A (en) * 2014-03-21 2015-09-23 华为技术有限公司 Evaluation information processing method and apparatus
CN104579773A (en) * 2014-12-31 2015-04-29 北京奇虎科技有限公司 Domain name system analysis method and device

Also Published As

Publication number Publication date
CN105491444A (en) 2016-04-13

Similar Documents

Publication Publication Date Title
CN105491444B (en) A kind of data identifying processing method and device
CN105447147B (en) A kind of data processing method and device
CN106445796B (en) Automatic detection method and device for cheating channel
CN108399418A (en) A kind of user classification method and device
CN108304426B (en) Identification obtaining method and device
CN106469261A (en) A kind of auth method and device
CN106843941B (en) Information processing method, device and computer equipment
CN107515915A (en) User based on user behavior data identifies correlating method
CN112364202A (en) Video recommendation method and device and electronic equipment
CN105516192B (en) A kind of mail address is safe to identify control method and device
CN106021455A (en) Image characteristic relationship matching method, apparatus and system
CN114297448B (en) License applying method, system and medium based on intelligent epidemic prevention big data identification
CN107529093A (en) A kind of detection method and system of video file playback volume
CN107729924A (en) Picture review probability interval generation method and picture review decision method
CN106301979B (en) Method and system for detecting abnormal channel
CN107622406A (en) Identify the method and system of virtual unit
CN109816004A (en) Source of houses picture classification method, device, equipment and storage medium
CN108804501A (en) A kind of method and device of detection effective information
CN111179023B (en) Order identification method and device
CN109104381A (en) A kind of mobile application recognition methods based on third party's flow HTTP message
EP3882825A1 (en) Learning model application system, learning model application method, and program
CN109062945B (en) Information recommendation method, device and system for social network
CN113362095A (en) Information delivery method and device
CN113065126B (en) Personal information compliance method and device based on distributed data sandbox
CN107977413A (en) Feature selection approach, device, computer equipment and the storage medium of user data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder
CP02 Change in the address of a patent holder

Address after: 519000 High-tech Zone, Zhuhai City, Guangdong Province, Unit 1, Fourth Floor C, Building A, Headquarters Base No. 1, Qianwan Third Road, Tangjiawan Town

Patentee after: ZHUHAI DUOWAN INFORMATION TECHNOLOGY LIMITED

Address before: 510000 Nancun Town Wanbo Business Center, Panyu District, Guangzhou City, Guangdong Province, 29 floors of B-1 Building, Wanda Business Plaza North District

Patentee before: ZHUHAI DUOWAN INFORMATION TECHNOLOGY LIMITED