CN109145162A - For determining the method, equipment and computer readable storage medium of data similarity - Google Patents

For determining the method, equipment and computer readable storage medium of data similarity Download PDF

Info

Publication number
CN109145162A
CN109145162A CN201810957255.5A CN201810957255A CN109145162A CN 109145162 A CN109145162 A CN 109145162A CN 201810957255 A CN201810957255 A CN 201810957255A CN 109145162 A CN109145162 A CN 109145162A
Authority
CN
China
Prior art keywords
data
similarity
user
user behavior
processor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810957255.5A
Other languages
Chinese (zh)
Other versions
CN109145162B (en
Inventor
黄铃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hui'an Jinke (beijing) Technology Co Ltd
Original Assignee
Hui'an Jinke (beijing) Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hui'an Jinke (beijing) Technology Co Ltd filed Critical Hui'an Jinke (beijing) Technology Co Ltd
Priority to CN201810957255.5A priority Critical patent/CN109145162B/en
Publication of CN109145162A publication Critical patent/CN109145162A/en
Application granted granted Critical
Publication of CN109145162B publication Critical patent/CN109145162B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/231Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Abstract

Embodiment of the disclosure proposes the method, equipment and computer readable storage medium for determining data similarity.This method comprises: determining multiple respective feature vectors of first data;And similarity between the multiple first data is determined based on described eigenvector.The equipment includes: processor;Memory is configured as store instruction, and described instruction makes the processor when being executed by the processor: determining multiple respective feature vectors of first data;And similarity between the multiple first data is determined based on described eigenvector.

Description

For determining the method, equipment and computer readable storage medium of data similarity
Technical field
This disclosure relates to data processing field, and the method and apparatus for more particularly relating to determine data similarity.
Background technique
With becoming increasingly popular for internet, have become people's a part indispensable in production and living.According to most New data statistics shows in the case where world population reaches 7,500,000,000 or more according to estimates, and the world is averaged Internet user's ratio 50% is had been approached, Chinese Internet user's ratio is more than 50%, and developed country is even more than 80%.In face of so many More Internet users has become important one of project for the research of user behavior.
User behavior research is one of the research direction to attract attention recently in internet area.Although the row of single user For that may be difficult to predict, but by the behavior pattern of research a large number of users, businessman can be for example helped preferably to promote the sale of goods, help Social network sites are helped preferably to match user or help to prevent and find malicious user.
Summary of the invention
However method that current user behavior analysis mainly uses or manual intervention.For example, for social network sites/ The offending user user of violation information (for example, publication) of software, it usually needs the report of other users and by website/software The audit of trained staff determine.In addition, for the corpse user for example largely registered by machine, at present It can only be by some simple modes (for example, by a large amount of duplicate IP (Internet protocol) addresses of registration user etc. or passing through The means of check code etc) it is distinguished or prevents.However, this simple mode is being encountered using agency, springboard Etc. the corpse user of modes the case where when, it is also difficult to really play a role, generally also or need artificial final confirmation.
It is used in view of above-mentioned Manual intervention method is difficult to large scale deployment, it is therefore desirable to a kind of automation, various dimensions User behavior analysis scheme, can help such as website/software operation person that mass users are classified and are simplified subsequent Treatment process.
In order at least partly solve or mitigate the above problem, provide according to the embodiment of the present disclosure for determining data phase Like the method and apparatus of degree.It is in this way core with equipment, the automation various dimensions suitable for multiple fields can be constructed User behavior analysis scheme.
According to the disclosure in a first aspect, providing a kind of method for determining data similarity.This method comprises: really Fixed multiple respective feature vectors of first data;And it is determined based on described eigenvector between the multiple first data Similarity.
In some embodiments, the multiple first data are to be related to the user behavior data of user behavior.In some realities It applies in example, the user behavior data includes at least one of the following: the registration information of user, the operation information of user and user Social information.In some embodiments, the step of determining multiple first data respective feature vector includes: for described more The first data of each of a first data, the k-gram of each first data is calculated using k-gram algorithm;To calculating K-gram execute djb2 hash function, hashed value will be obtained as corresponding feature;And according to obtained feature come shape At the individual features vector of each first data.In some embodiments, coefficient k used in the k-gram algorithm is 5.? In some embodiments, after determining the similarity between the multiple first data based on described eigenvector, the side Method further include: determine multiple respective feature vectors of second data;And the multiple is determined based on described eigenvector The similarity in similarity and the multiple second data between one data and the multiple second data.In some realities It applies in example, the method also includes: for the similarity between the multiple first data, using clustering method to the multiple First data are classified.In some embodiments, the clustering method is Hierarchical clustering methods.
According to the second aspect of the disclosure, a kind of equipment for determining data similarity is provided.The equipment includes: place Manage device;Memory is configured as store instruction, and described instruction makes the processor when being executed by the processor: determining Multiple respective feature vectors of first data;And phase between the multiple first data is determined based on described eigenvector Like degree.In some embodiments, the multiple first data are to be related to the user behavior data of user behavior.In some embodiments In, the user behavior data includes at least one of the following: the society of the registration information of user, the operation information of user and user Hand over information.In some embodiments, described instruction also makes the processor when being executed by the processor: for described more The first data of each of a first data, the k-gram of each first data is calculated using k-gram algorithm;To calculating K-gram execute djb2 hash function, hashed value will be obtained as corresponding feature;And according to obtained feature come shape At the individual features vector of each first data.In some embodiments, coefficient k used in the k-gram algorithm is 5.? In some embodiments, described instruction also makes the processor when being executed by the processor: determining that multiple second data are each From feature vector;And it is determined based on described eigenvector between the multiple first data and the multiple second data Similarity and the multiple second data in similarity.In some embodiments, described instruction is by the processing Device also makes the processor when executing: for the similarity between the multiple first data, using clustering method to described Multiple first data are classified.In some embodiments, the clustering method is Hierarchical clustering methods.
According to the third aspect of the disclosure, a kind of computer readable storage medium including instruction is provided, it is described Instruction also makes the processor execute the method according to disclosure first aspect when executed by the processor.
Detailed description of the invention
By illustrating preferred embodiment of the present disclosure with reference to the accompanying drawing, above and other purpose, the spy of the disclosure will be made Advantage of seeking peace is clearer, in which:
By illustrating preferred embodiment of the present disclosure with reference to the accompanying drawing, above and other purpose, the spy of the disclosure will be made Advantage of seeking peace is clearer, in which:
Fig. 1 is to show the flow chart of the exemplary method for carrying out data processing according to the embodiment of the present disclosure.
Fig. 2 is the schematic diagram updated for data increment shown according to the embodiment of the present disclosure.
Fig. 3 is to show the hardware layout of the exemplary electronic device for carrying out data processing according to the embodiment of the present disclosure Figure.
Specific embodiment
The section Example of the disclosure is described in detail with reference to the accompanying drawings, is omitted in the course of the description for this It is unnecessary details and function for open, to prevent understanding of this disclosure from causing to obscure.In the present specification, Xia Shuyong Only illustrate in the various embodiments of description disclosure principle, should not be construed as limiting in any way scope of disclosure.Ginseng According to the exemplary implementation described below for being used to help the disclosure that comprehensive understanding is defined by the claims and their equivalents of attached drawing Example.Described below includes a variety of details to help to understand, but these details are considered as being only exemplary.Therefore, originally Field those of ordinary skill should be understood that do not depart from the scope of the present disclosure and spirit in the case where, can be to described herein Embodiment make various changes and modifications.In addition, for clarity and brevity, retouching for known function and structure is omitted It states.In addition, running through attached drawing, identical appended drawing reference is used for the same or similar function, device and/or operation.In addition, in attached drawing In, each section is not necessarily to scale to draw.In other words, the relative size of each section in attached drawing, length etc. might not It is corresponding with actual ratio.
In the disclosure, term " includes " and " containing " and its derivative mean including rather than limit;Term "or" is packet Containing property, mean and/or.In addition, in being described below of the disclosure, used directional terminology, such as "upper", "lower", "left", "right" etc. is used to instruction relative positional relationship, with auxiliary those skilled in the art understand that the embodiment of the present disclosure, and therefore It should be understood by those skilled in the art that: "upper"/"lower" in one direction can be changed to "lower"/"upper" in the opposite direction, and In another direction, other positions relationship, such as " left side "/" right side " etc. may be become.
In addition, the disclosure be not limited to involved in equipment each specific communication protocol, including but not limited to 2G, 3G, 4G, 5G network, WCDMA, CDMA2000, TD-SCDMA system etc., different equipment can use identical communication protocol, Different communication protocol can also be used.In addition, the disclosure is not limited to the specific operating system of equipment, may include (but It is not limited to) iOS, Windows Phone, Symbian (Saipan), Android (Android), Linux, Unix, Windows, MacOS Identical operating system can be used Deng, different equipment, different operating system can also be used.
Although hereinafter will illustrate to be used to determine data according to the embodiment of the present disclosure mainly in combination with user behavior data The scheme of similarity, however the present disclosure is not limited thereto.In fact, embodiment of the disclosure is by appropriate adjustment and modification In the case of, it is readily applicable to the field such as code reuse detection, malicious application detection, pirate detection.In other words, as long as The scene being determined to the similarity between data is needed, scheme according to an embodiment of the present disclosure can be used.
As previously mentioned, the embodiment of the present disclosure provide it is a kind of for determining the scheme of data similarity.Using the program as core The heart can construct the user behavior analysis scheme suitable for multiple fields.It specifically, can such as social network sites To be directed to its all registration user, the similarity of its user behavior data is determined according to its respective user behavior, and accordingly User behavior data similarity matrix is constructed, then clustering can be carried out to it, by the similar user of user behavior Classification, so as to identify the several groups user with similar behavior pattern.It is then possible to take difference for every group of user Counter-measure.For example, for the user group with common hobby/interest, can recommend to it relevant commodity, discussion group, Or introduce understanding etc. each other;And the corpse user for being registered by website robot, it may be considered that send law letters to them To warn account to delete risk or reduce its access right.
In some embodiments, user behavior data may include (but being not limited to): the registration information of user (for example, with Name in an account book, the pet name, head portrait, signature, address, telephone number, Email etc.), the operation information of user is (for example, the login of user Time, place, IP address, frequency, used dbase, version, consumption etc.), social data (for example, forum send out Note information, friend information etc.) etc..
It is above-mentioned to be used to determine that the scheme of data similarity quickly compare user behavior data and detect its similarity, and It is suitble to expand to the user behavior data of magnanimity (for example, up to a million or ten million user).In time in view of user behavior Unpredictability and potential inconsistency (for example, since particular event causes user behavior pattern to change), the program can also The user behavior difference or variation of tolerance to a certain degree.Therefore, this requires the program should can be with significant and accurate Mode compares user behavior data, to find the fine difference in user behavior data as looking for a needle in a haystack, while also protecting Hold lower rate of false alarm and rate of failing to report.
Generally, according to the scheme for determining data similarity of the embodiment of the present disclosure to user behavior data application examples Such as k-gram algorithm (hereinafter it will be described in detail) simultaneously carries out feature hash, to effectively cope with large-scale user behavior Data.The k-gram of user behavior data has been demonstrated there is certain for the unpredictability of user behavior and potential inconsistency The tolerance of kind degree, and it can easily be extracted from user behavior data.In addition, feature hash is also demonstrate,proved It is bright can play the role of in terms of reducing data dimension and classification it is good.Therefore, both is combined can be well Applied to above-mentioned field.
K-gram algorithm will be introduced first below.K-gram is referred to from for example given text or speech samples (or more Generally, data) in extract continuous n item aim sequence.According to application scenarios difference, project also refers to such as sound Element, syllable, letter, word or base-pair etc..For example, the 2-gram of sentence " to be or not to be " is: " to be","be or","or not","not to","to be".It, can be in the case where the sentence is the signature of some user A part as user behavior data is performed k-gram algorithm.
Further, since can have multiple and different part or field in user behavior data, for example, user user name, The pet name, signature, head portrait etc., and these data may be for example to be linked in sequence together as a data in actual treatment Existing for the form of section.Therefore, in some embodiments, the k-gram of the data segment can exclude the sequence across different field Outside.For example, if the title of user is " William Shakespear ", and signing is " to be or not to be ", Then the 2-gram of the data segment " William Shakespear to be or not to be " of its composition is: " William Shakespear ", " to be ", " be or ", " or not ", " not to ", " to be ", without including " Shakespear to".This is because the usually this sequence across field compared with other field internal sequences, has no too much when determining similarity Practical significance.However, this is not required in that.In further embodiments, it is also contemplated that this sequence across field, especially Be across one or more fields between there are in the case where High relevancy.
Feature hash is a kind of strong tool of dimension for reducing the data analyzed.Use single hash letter Number, original biggish data space can be compressed in the feature space of lesser randomization, right in this feature space It is unusual convenience and high-efficiency that feature, which carries out pairwise comparison,.It is to be noted that the high efficiency is with issuable in hash Data collision is cost.However, according to from machine learning field research shows that: even if there may be conflicts, but by Higher accuracy is still able to maintain that similarity, so that the subsequent algorithm carried out based on this feature space is (for example, layering Cluster) it also will be close to its accurate solution.
The expression as a result of user behavior data can be encoded, to obtain in the form of succinct bit vectors, It indicates feature present in data.
In addition, the similarity between them can be defined for feature determining in this way.For example, in some embodiments, Jaccard similarity can be used, is defined as:
Wherein, A and B is the k-gram feature set of two user behavior datas respectively.Since k-gram feature set being hashed Into boolean vector, and each entry indicates a feature and whether there is, therefore can be used next more efficient by bit arithmetic Ground approximate expression are as follows:
Wherein,WithIt is the bit vectors expression of k-gram feature set A and B respectively, ∧ and V are by bit respectively " with fortune Calculate " and or operation.It is to be noted that if the size of bit vectors is sufficiently large,Just very close two users Similarity J (A, B) between the k-gram feature set of behavioral data.
Relatively, the Jaccard distance D (A, B) of the dissimilar degree between two feature vectors of measurement, quilt can be defined It is defined as going Jaccard similarity among 1, it may be assumed that
Jaccard distance and the similarity value all in section [0,1].
Next, the scheme for being used to determine data similarity according to the embodiment of the present disclosure will be described in detail in conjunction with Fig. 1.
Fig. 1 is to show the flow chart of the exemplary method 100 for determining data similarity according to the embodiment of the present disclosure. In optional step S110, data prediction first can be executed to user behavior data.For example, in the embodiment shown in fig. 1, it can The each initial data field for including in user behavior data is first integrated into XML (extensible markup language) file. For example, for including user name, the pet name, signature, head portrait and/or the user behavior data including posting etc., it can be by each word Duan Qianhou connects, and with for example<username></username>,<signature></signature>etc XML tag Separated.In addition, when being pre-processed, it is also contemplated that the predicable that some all users may be had Etc field foreclose.For example, may may require that user must send out one after registering for the first time for some social network sites The model of a self-introduction, and this model is usually to have height similar and content without substantive difference.In this case, It is contemplated that the model is foreclosed.
Next, in the step s 120, k-gram processing and feature hash can be carried out to by pretreated XML data Processing, to determine the feature vector of user behavior data.In some embodiments, djb2 hash function can be used to carry out spy Hashing is levied, however the present disclosure is not limited thereto.Djb2 hash function be proposed by Dan Bernstein it is very simple The hash function of following recursive definition: hash (i)=hash (i 1) × 33+c (i) and hash (0)=5381, wherein hash (i) hashed value of i-th of input data is indicated, and c (i) indicates the value of i-th of input data.It is very good that the hash function has Random distribution, can raw data set be generally uniformly dispersed in feature space.
More specifically, the k-gram of its moving window with appropriate size (k) can be extracted for XML data, so These k-gram are hashed using djb2 afterwards.In addition, as previously mentioned, the k- of the sequence across field can be ignored in hash gram.Next, for each hashed value, can in bit vectors corresponding with user behavior data the corresponding ratio of setting Spy, to show the presence of corresponding k-gram.
In order to carry out feature hash, it is thus necessary to determine that two parameters, i.e. the length k of k-gram and the size m of bit vectors.This Two values can determine according to the actual experiment of such as small sample quantity, can also or Deterministic Methods heuristic according to other To determine.In some embodiments of the present disclosure, k=5 and m=240007 this configuration can be used.However need to pay attention to Be: this is only used for the concrete configuration for illustrating the embodiment of the present disclosure, and the present disclosure is not limited thereto, but any other can be used Suitable configuration.For example, k=3,5,7,9 etc., and m can also be determined according to determining method described below come corresponding.
K be for determine indicate user behavior data low-level image feature space dimension parameter, and which defines for The feature quantity that each user behavior data can extract.Therefore, k is the key parameter for determining similarity.If k is too small Then for all user behavior datas small number of specific characteristic will be present, so as to cause user in (for example, k=1) The expression of the excessively simplified of behavioral data, low dimensional.Under such expression, it may occur between user behavior data " overmatching ", i.e., many user behavior datas will be wrongly classified as similar user behavior data.
On the other hand, if k excessive (for example, the size for being greater than most users behavioral data field), will be used The very high-dimensional expression of family behavioral data.In this case, each due to can only be extracted from the data of higher dimensional The less feature of user behavior data, then limited using biggish k and carry out significant and robust between user behavior data Comparison ability.
To sum up, reasonable k there should be lesser value, the increase of the value will be caused by the matter to similarity-rough set Measure the insignificant increase of aspect.For example, the following table 1 shows several different value of K and its accordingly obtains average Jaccard similarity Distance.
k Average distance
3 0.939
5 0.969
7 0.980
9 0.984
After can substantially seeing fooled k > 5, benefited beginning caused by the increase of k gradually tails off.
Next, the size m for how choosing bit vectors will be discussed.The size m of bit vectors is calculated in (similarity) It is balanced between the approximate error that the bit vectors of complexity and k-gram feature indicate.It is desirable that wish that m is sufficiently large, So that generating less conflict when by k-gram hash to bit vectors.However in fact, m needs are sufficiently small, so that can With effectively calculate between millions user behavior datas by similarity.In other words, m is bigger, then bit vectors Indicate that user behavior data is more accurate, and m is smaller, then the time needed for calculating the similarity between all user behavior datas Cost is just smaller, needs to weigh both.
If as previously mentioned, m > > N (wherein, the quantity that N is the k-gram extracted from user behavior data), then two Jaccard similarity between a bit vectors just closely approximates the intersection between two k-gram feature sets.That is, in reality In, as long as m is sufficiently large, Jaccard similarity is similar to Precise Representation.Here sufficiently large means that: both can guarantee by It is accurate enough to Jacaard similarity, and can allow to efficiently being counted by similarity between all user behavior datas It calculates.
As described above, by selecting appropriate k and m, the appropriate bit vectors of the feature of available user behavior data It indicates.
Next, in step s 130, can be indicated according to the tag bit vector of each user behavior data to determine Similarity between each user behavior data.The similarity can be to be determined using Jaccard similarity as described above. In addition it is also possible to which using other appropriate similarities, the disclosure is without restriction to this.In addition, user behavior number has been determined After similarity between, corresponding similarity matrix can also be constituted, to facilitate subsequent possible analysis processing to be used.
Next, in optional step S140, it can be for the similarity between identified each user behavior data To execute signature analysis.This feature analysis may include one or more below (but being not limited to): clustering, comprising comparing Deng.
Clustering: it, can be for each user behavior data in order to find mode inherent in user behavior data Similarity executes coagulation type hierarchical cluster (agglomerative hierarchical clustering) algorithm.It is basic Theory is that the collection of tag bit vector shares clearly defined distance metric (for example, Jaccard distance) and illustrates higher dimensional space In user behavior data.Using the distance metric, bit vectors closer to each other can be gathered for one kind, and therefore can make Similar user behavior data is gathered in one kind.
Hierarchical clustering algorithm itself does not need to specify the class number to be generated that can generate classification in advance.For poly- The input of class algorithm is the column of the Jaccard similarity value composition between threshold value t (for example, 90%) and each pair of user behavior data Table (or matrix), wherein all user behavior datas in each class after clustered have similar more than or equal to threshold value t Degree.Threshold value t is the " tight of the user behavior data in quantity and each given classification by the user behavior data in cluster Required balance between density " provides.In other words, t is smaller, will be by data point in less classification, each classification In data bulk it is more;T is bigger, will be by data point in more classification, and the data bulk in each classification is less.
Hierarchical cluster starts from an application, belongs to the single classification of its own.Then, it is immediate right to select, And it incorporates them into public classification.The category compares persistently to be carried out with merging process, until being more than input there is no similarity Threshold value t to until.For single association (single- can be used with the of all categories of multiple user behavior datas Linkage) similarity therebetween is defined.That is, classification SaWith classification SbBetween similarity be it is all may maximum between Similarity, i.e. J (Sa, Sb)=max J (A, B) | A ∈ Sa, B ∈ Sb}。
It is thus possible to realize the classification to user behavior data through the above way, it is can will for example have similar The user of behavior is divided into one group.
Comprising comparing: it is that effective work of multiple accounts etc whether is registered for investigating same user comprising comparative analysis Tool.Given that it is known that user behavior data A is the minimum behavioral data of some user, one user when institute of registration may be included Required information, such as identification card number, telephone number, and/or mail address etc..At this time, it may be necessary to investigate other user behavior numbers According to (for example, B) whether be the user A the corresponding user behavior data of other register account numbers.It then can be by two user behaviors The quantity for the public characteristic for including in the tag bit vector of data and calculates percentage divided by the feature total quantity of A.Specifically Ground can be defined as:
If B includes whole features of A, which is 100%.If B includes most of feature of A, for example, A In appeared in B more than the feature of 70% (or other any appropriate percentages), then can be determined that user B is very likely to Another account of (or necessarily) user A.Similarly, above-mentioned deterministic data are not only (for example, identification card number, mail Location etc.) to can be used for this include comparative analysis, and in fact the operation behavior of user, Social behaviors etc. also may be used to this and include Comparative analysis.For example, when the tag bit vector of the user behavior data of some user includes the major part of known malicious user Feature when, it is believed that the user is exactly malicious user.In this case, it might even be possible to carry out actual malice behaviour in the user Before work, it can be identified and carry out accordingly preventing operation, for example, reducing its operating right, interim title, issuing police Accuse letter etc..
Since above method process can be full automation, and these parameters after parameter k, m, t etc. has been determined Being before execution can be pre-set, therefore above method process can considerably reduce labor workload.In addition, Since user behavior data above may include a large amount of different classes of information, rather than just such as user's registration IP's etc Single piece of information, therefore the similitude between user can be investigated in each dimension, so that subject to user's classification, identification etc. more Really, effectively.
In addition, ill-mannered step response of the above scheme in its many stage causes it to be easily achieved incremental update.Change speech It, can be handled by only carrying out update relevant to the user behavior data increased newly, without handling existing user again Behavioral data, to realize the update of overall data.Specifically, pre-treatment step S110 in front and determining feature vector step S120 is substantially just to support incremental update, because being directed to each user behavior data/tag bit in those steps The processing of vector is not related to other user behavior datas/tag bit vector or other any pending datas.
Fig. 2 is to show the similarity matrix of the feature of update existing subscriber's behavioral data according to the embodiment of the present disclosure Schematic diagram.As shown in Fig. 2, similarity matrix A 210 is illustrated between n tag bit vector of existing subscriber's behavioral data Similarity matrix, therefore its size with n*n.In addition, Matrix C 220 and CT230 respectively represent the behavior number that Adds User According to m tag bit vector and existing n tag bit vector between similarity matrix and its transposition, and matrix B 240 represent the similarity matrix between m tag bit vector of the behavioral data that Adds User.As seen in Figure 2, new Increase m tag bit vector when, i.e., in newly-increased m user behavior data, can not to processed matrix A before into Any processing of row, and only need processing array B, C and CT
Therefore, which can for example carry out as follows: (1) calculating the phase between m tag bit vector Like degree matrix B 240;(2) similarity between newly-increased m tag bit vector and existing n tag bit vector is calculated Matrix C 220 (while can determine its transposition CT230);And (3) by aforementioned calculated result and original similarity matrix A 210 are combined together (for example, corresponding row/column data are inserted into former similarity matrix A 210), are updated with acquisition (n+m) * (n+m) similarity matrix.
In addition, for the processing of the subsequent signature analysis such as hierarchical cluster, to the new phase in similarity matrix B and C New cluster result can also be obtained with increment updating method like the processing of degree.
In addition, as previously mentioned, step S110 and step S120 is directed to each user behavior data/tag bit vector place Reason is not related to other user behavior datas/tag bit vector or other any pending datas, so these steps can be with Easily in a distributed manner/parallelization mode is realized.It is, for example, possible to use the MapReduce frames of Google to come in fact User behavior data is now concurrently converted to XML format, XML format data parallel is hashed into feature space and generates phase Tag bit vector answered etc..This can parallelization/distributed treatment characteristic but also according to embodiment of the present disclosure fixed number really It is more suitable for large-scale data processing according to the scheme of similarity.
Fig. 3 is to show the exemplary hardware cloth of the equipment 300 for determining data similarity according to the embodiment of the present disclosure Set figure.As shown in figure 3, electronic equipment 300 may include: processor 310, memory 320, input/output module 330, communication Module 340 and other modules 350.It is to be noted that embodiment illustrated in fig. 3 only illustrates to be used according to the disclosure, and therefore The disclosure is not limited.In fact, the electronic equipment 300 may include more, less or different module, and can To be individual equipment or be distributed in the distributed apparatus of many places.For example, the electronic equipment 300 may include (but being not limited to): Personal computer (PC), server cluster, calculates cloud, work station, terminal, tablet computer, laptop computer, intelligence at server Energy phone, media player, wearable device, and/or household electrical appliance (such as TV, set-top box, DVD player) etc..
Processor 310 can be responsible for the component of the integrated operation of electronic equipment 300, can with other modules/ Assembly communication connection is passed through with receiving pending data and/or instruction from other modules/components and sending to other modules/components Handle data and/or instruction.Processor 310 can be at such as general processor, such as central processing unit (CPU), signal Manage device (DSP), application processor (AP) etc..In this case, the instructions/programs/code that can be stored in memory 320 Instruction under execute one in each step of the method according to the embodiment of the present disclosure for determining data similarity above Or multiple steps.In addition, processor 310 is also possible to such as application specific processor, such as specific integrated circuit (ASIC), scene can Program gate array (FPGA) etc..In this case, it can specially be executed according to its circuit design and be implemented above according to the disclosure One or more steps in each step of the method for determining data similarity of example.In addition, processor 310 can also be with It is any combination of hardware, software and/or firmware.In addition, actually locating although illustrating only a processor 310 in Fig. 3 Managing device 310 also may include the multiple processing units for being distributed in multiple places.
Memory 320 can be configured as interim or persistently store computer executable instructions, which can hold Row instruction can make each step of each method described in the execution disclosure of processor 310 when being executed as processor 310 One or more steps in rapid.In addition, memory 320 can be additionally configured to temporarily or persistently store and these steps Relevant data, such as user behavior data to be processed, user characteristics vector, similarity data etc..Memory 320 can wrap Include volatile memory and/or nonvolatile memory.Volatile memory may include such as (but not limited to): dynamic random Access memory (DRAM), static state RAM (SRAM), synchronous dram (SDRAM), cache etc..Nonvolatile memory can be with Including such as (but not limited to): disposable programmable read only memory (OTPROM), erasable is compiled programming ROM (PROM) Journey ROM (EPROM), electrically erasable ROM (EEPROM), mask-type ROM, flash rom, flash memory (for example, nand flash memory, NOR flash memory etc.), hard disk drive or solid state drive (SSD), high density flash memory (CF), secure digital (SD), miniature SD, fan Your type SD, extreme digital (xD), multimedia card (MMC), memory stick etc..It is set in addition, memory 320 is also possible to remotely store It is standby, such as network connection storage equipment (NAS) etc..Memory 320 also may include the distributed storage for being distributed in multiple places Equipment, such as cloud storage.
Input/output module 330, which can be configured as, to be received input from outside and/or provides output to outside.Although Output/place's module 330 is shown as single module in embodiment illustrated in fig. 3, but actually it can be dedicated for input Module, dedicated for module of output or combinations thereof.For example, output/output module 330 may include (but being not limited to): keyboard, Mouse, microphone, camera, display, touch-screen display, printer, loudspeaker, earphone or any other can be used for it is defeated The equipment etc. for entering/exporting.In addition, input/output module 330 is also possible to the interface for being configured as connecting with above equipment, example Such as earphone interface, microphone interface, keyboard interface, mouse interface.In this case, electronic equipment 300 can be connect by this Mouth connect with external input/output device and realizes input/output function.
Communication module 340 is configured such that electronic equipment 300 can be communicated and be handed over other electronic equipments Change various data.Communication module 340 can be for example: ethernet network interface card, USB module, serial line Internet card, optical fiber interface Card, phone-line modem, xDSL modem, Wi-Fi module, bluetooth module, 2G/3G/4G/5G communication module Deng.In the sense that data input/output, communication module 340 can also be considered as a part of input/output module 330.
In addition, electronic equipment 300 can also include other modules 350, including but not limited to: power module, GPS mould Block, sensor module (for example, proximity sensor, illuminance transducer, acceleration transducer, fingerprint sensor etc.) etc..
However it is to be noted that: above-mentioned module is only the few examples for the module that electronic equipment 300 may include, It is without being limited thereto according to the electronic equipment of the embodiment of the present disclosure.It in other words, can be with according to the electronic equipment of disclosure other embodiments Including more modules, less module or disparate modules.
So far preferred embodiment is had been combined the disclosure is described.It should be understood that those skilled in the art are not In the case where being detached from spirit and scope of the present disclosure, various other changes, replacement and addition can be carried out.Therefore, the disclosure Range be not limited to above-mentioned specific embodiment, and should be defined by the appended claims.
In addition, being described herein as the function of realizing by pure hardware, pure software and/or firmware, can also lead to The modes such as the combination of specialized hardware, common hardware and software are crossed to realize.For example, being described as through specialized hardware (for example, existing Field programmable gate array (FPGA), specific integrated circuit (ASIC) etc.) come the function realized, it can be by common hardware (in for example, Central Processing Unit (CPU), digital signal processor (DSP)) and the mode of combination of software realize that vice versa.

Claims (17)

1. a kind of method for determining data similarity, comprising:
Determine multiple respective feature vectors of first data;And
The similarity between the multiple first data is determined based on described eigenvector.
2. according to the method described in claim 1, wherein, the multiple first data are to be related to the user behavior number of user behavior According to.
3. according to the method described in claim 2, wherein, the user behavior data includes at least one of the following: the note of user Volume information, the operation information of user and the social information of user.
4. according to the method described in claim 1, wherein it is determined that the step of multiple first data respective feature vector include:
For the first data of each of the multiple first data, the k- of each first data is calculated using k-gram algorithm gram;
Djb2 hash function is executed to calculated k-gram, hashed value will be obtained as corresponding feature;And
The individual features vector of each first data is formed according to obtained feature.
5. according to the method described in claim 4, wherein, coefficient k used in the k-gram algorithm is 5.
6. according to the method described in claim 1, wherein, determined based on described eigenvector the multiple first data it Between similarity after, the method also includes:
Determine multiple respective feature vectors of second data;And
Determined based on described eigenvector similarity between the multiple first data and the multiple second data and Similarity in the multiple second data.
7. according to the method described in claim 1, wherein, the method also includes:
For the similarity between the multiple first data, classified using clustering method to the multiple first data.
8. according to the method described in claim 7, wherein, the clustering method is Hierarchical clustering methods.
9. a kind of equipment for determining data similarity, comprising:
Processor;
Memory is configured as store instruction, and described instruction makes the processor when being executed by the processor:
Determine multiple respective feature vectors of first data;And
The similarity between the multiple first data is determined based on described eigenvector.
10. equipment according to claim 9, wherein the multiple first data are to be related to the user behavior of user behavior Data.
11. equipment according to claim 10, wherein the user behavior data includes at least one of the following: user's The social information of registration information, the operation information of user and user.
12. equipment according to claim 9, wherein described instruction also makes described when being executed by the processor Manage device:
For the first data of each of the multiple first data, the k- of each first data is calculated using k-gram algorithm gram;
Djb2 hash function is executed to calculated k-gram, hashed value will be obtained as corresponding feature;And
The individual features vector of each first data is formed according to obtained feature.
13. equipment according to claim 12, wherein coefficient k used in the k-gram algorithm is 5.
14. equipment according to claim 9, wherein described instruction also makes described when being executed by the processor Manage device:
Determine multiple respective feature vectors of second data;And
Determined based on described eigenvector similarity between the multiple first data and the multiple second data and Similarity in the multiple second data.
15. equipment according to claim 9, wherein described instruction also makes described when being executed by the processor Manage device:
For the similarity between the multiple first data, classified using clustering method to the multiple first data.
16. equipment according to claim 15, wherein the clustering method is Hierarchical clustering methods.
17. a kind of computer readable storage medium including instruction, described instruction also make institute when executed by the processor State method described in processor execution according to claim 1~any one of 8.
CN201810957255.5A 2018-08-21 2018-08-21 Method, apparatus, and computer-readable storage medium for determining data similarity Active CN109145162B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810957255.5A CN109145162B (en) 2018-08-21 2018-08-21 Method, apparatus, and computer-readable storage medium for determining data similarity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810957255.5A CN109145162B (en) 2018-08-21 2018-08-21 Method, apparatus, and computer-readable storage medium for determining data similarity

Publications (2)

Publication Number Publication Date
CN109145162A true CN109145162A (en) 2019-01-04
CN109145162B CN109145162B (en) 2021-06-15

Family

ID=64790960

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810957255.5A Active CN109145162B (en) 2018-08-21 2018-08-21 Method, apparatus, and computer-readable storage medium for determining data similarity

Country Status (1)

Country Link
CN (1) CN109145162B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110033036A (en) * 2019-04-04 2019-07-19 厦门小圈网络科技有限公司 A kind of social networks classification method based on circle
CN111523012A (en) * 2019-02-01 2020-08-11 慧安金科(北京)科技有限公司 Method, apparatus, and computer-readable storage medium for detecting abnormal data
CN112016927A (en) * 2019-05-31 2020-12-01 慧安金科(北京)科技有限公司 Method, apparatus, and computer-readable storage medium for detecting abnormal data
CN112016934A (en) * 2019-05-31 2020-12-01 慧安金科(北京)科技有限公司 Method, apparatus, and computer-readable storage medium for detecting abnormal data
CN113888203A (en) * 2021-09-04 2022-01-04 北京优全智汇信息技术有限公司 Online insurance product sale system and sale method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104102706A (en) * 2014-07-10 2014-10-15 西安交通大学 Hierarchical clustering-based suspicious taxpayer detection method
CN106126649A (en) * 2016-06-24 2016-11-16 北京千安哲信息技术有限公司 A kind of similar Chinese crude drug method for digging and device
CN106296422A (en) * 2016-07-29 2017-01-04 重庆邮电大学 A kind of social networks junk user detection method merging many algorithms
CN106557565A (en) * 2016-11-22 2017-04-05 福州大学 A kind of text message extracting method based on website construction
CN106796640A (en) * 2014-09-26 2017-05-31 迈克菲股份有限公司 Classification malware detection and suppression
CN107391760A (en) * 2017-08-25 2017-11-24 平安科技(深圳)有限公司 User interest recognition methods, device and computer-readable recording medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104102706A (en) * 2014-07-10 2014-10-15 西安交通大学 Hierarchical clustering-based suspicious taxpayer detection method
CN106796640A (en) * 2014-09-26 2017-05-31 迈克菲股份有限公司 Classification malware detection and suppression
CN106126649A (en) * 2016-06-24 2016-11-16 北京千安哲信息技术有限公司 A kind of similar Chinese crude drug method for digging and device
CN106296422A (en) * 2016-07-29 2017-01-04 重庆邮电大学 A kind of social networks junk user detection method merging many algorithms
CN106557565A (en) * 2016-11-22 2017-04-05 福州大学 A kind of text message extracting method based on website construction
CN107391760A (en) * 2017-08-25 2017-11-24 平安科技(深圳)有限公司 User interest recognition methods, device and computer-readable recording medium

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111523012A (en) * 2019-02-01 2020-08-11 慧安金科(北京)科技有限公司 Method, apparatus, and computer-readable storage medium for detecting abnormal data
CN111523012B (en) * 2019-02-01 2024-01-09 慧安金科(北京)科技有限公司 Method, apparatus and computer readable storage medium for detecting abnormal data
CN110033036A (en) * 2019-04-04 2019-07-19 厦门小圈网络科技有限公司 A kind of social networks classification method based on circle
CN112016927A (en) * 2019-05-31 2020-12-01 慧安金科(北京)科技有限公司 Method, apparatus, and computer-readable storage medium for detecting abnormal data
CN112016934A (en) * 2019-05-31 2020-12-01 慧安金科(北京)科技有限公司 Method, apparatus, and computer-readable storage medium for detecting abnormal data
CN112016927B (en) * 2019-05-31 2023-10-27 慧安金科(北京)科技有限公司 Method, apparatus and computer readable storage medium for detecting abnormal data
CN112016934B (en) * 2019-05-31 2023-12-29 慧安金科(北京)科技有限公司 Method, apparatus and computer readable storage medium for detecting abnormal data
CN113888203A (en) * 2021-09-04 2022-01-04 北京优全智汇信息技术有限公司 Online insurance product sale system and sale method
CN113888203B (en) * 2021-09-04 2022-09-13 北京优全智汇信息技术有限公司 Online insurance product sale system and sale method

Also Published As

Publication number Publication date
CN109145162B (en) 2021-06-15

Similar Documents

Publication Publication Date Title
CN109145162A (en) For determining the method, equipment and computer readable storage medium of data similarity
CN111177569B (en) Recommendation processing method, device and equipment based on artificial intelligence
Ullman et al. Atoms of recognition in human and computer vision
Drew et al. Polymorphic malware detection using sequence classification methods and ensembles: BioSTAR 2016 Recommended Submission-EURASIP Journal on Information Security
CN104574192B (en) Method and device for identifying same user in multiple social networks
WO2021051517A1 (en) Information retrieval method based on convolutional neural network, and device related thereto
CN103678702A (en) Video duplicate removal method and device
WO2019062021A1 (en) Method for pushing loan advertisement in application program, electronic device, and medium
CN107257390B (en) URL address resolution method and system
CN107408115B (en) Web site filter, method and medium for controlling access to content
Chen et al. Automatic detection of pornographic and gambling websites based on visual and textual content using a decision mechanism
WO2019061664A1 (en) Electronic device, user&#39;s internet surfing data-based product recommendation method, and storage medium
CN110929145A (en) Public opinion analysis method, public opinion analysis device, computer device and storage medium
CN110855648A (en) Early warning control method and device for network attack
CN110543603A (en) Collaborative filtering recommendation method, device, equipment and medium based on user behaviors
US11797617B2 (en) Method and apparatus for collecting information regarding dark web
CN111552865A (en) User interest portrait method and related equipment
CN113268597A (en) Text classification method, device, equipment and storage medium
CN108875374B (en) Malicious PDF detection method and device based on document node type
CN107665443B (en) Obtain the method and device of target user
CN115470489A (en) Detection model training method, detection method, device and computer readable medium
CN111597453B (en) User image drawing method, device, computer equipment and computer readable storage medium
WO2019019711A1 (en) Method and apparatus for publishing behaviour pattern data, terminal device and medium
CN113901077A (en) Method and system for producing entity object label, storage medium and electronic equipment
CN113591881A (en) Intention recognition method and device based on model fusion, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Method, device, and computer-readable storage medium for determining data similarity

Effective date of registration: 20230626

Granted publication date: 20210615

Pledgee: Zhongguancun Branch of Bank of Beijing Co.,Ltd.

Pledgor: HUIANJINKE (BEIJING) TECHNOLOGY Co.,Ltd.

Registration number: Y2023110000253

PE01 Entry into force of the registration of the contract for pledge of patent right