Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The embodiment of the invention aims to analyze the similarity of the equipment by analyzing the fingerprint index of the collected JS equipment, and if a user performs a large number of similar operations on similar equipment, the clustering of the equipment can be reflected. The embodiment of the invention provides a method for identifying risk equipment by utilizing similarity of acquired equipment indexes, aiming at the problem of poor fingerprint stability of JS equipment.
Fig. 1 is a flowchart of an embodiment of a method for determining device fingerprint homology according to the present invention, which can be applied to search from massive device fingerprint data to find device aggregations, and further identify risk devices.
As shown in fig. 1, the method includes:
step S101: acquiring a first device fingerprint and a second device fingerprint;
for example, a JS device fingerprint technology can be used to obtain a device fingerprint (the device fingerprint refers to a device feature or a unique device identifier that can be used to uniquely identify the device), and by embedding a JavaScript code in a front-end page, when a user accesses the page with a browser, the JavaScript code will collect various information of the user device, and after the information collection is completed, the information is reported to a server.
Wherein the first device fingerprint and the second device fingerprint each comprise at least one of: operating system name, operating system version, font list, and plug-in list of the corresponding device.
In an alternative embodiment, the first device fingerprint and the second device fingerprint each include hundreds of device metrics for the respective device, including an operating system name, an operating system version, a font list, a plug-in list, and the like.
Step S102: determining a first similarity hash value according to the first device fingerprint;
illustratively, determining the first similarity hash value from the first device fingerprint comprises:
performing word segmentation processing on the first device fingerprint to obtain a plurality of keywords;
performing hash calculation on the plurality of keywords to obtain a plurality of keyword hash values;
and determining the first similarity hash value according to the plurality of keyword hash values.
Step S103: determining a second similarity hash value according to the second device fingerprint;
illustratively, determining the second similarity hash value from the second device fingerprint comprises: performing word segmentation processing on the second device fingerprint to obtain a plurality of keywords;
performing hash calculation on the plurality of keywords to obtain a plurality of keyword hash values;
and determining the second similarity hash value according to the plurality of keyword hash values.
Step S104: determining a similarity between the first similarity hash value and the second similarity hash value.
For example, the similarity between the first similarity hash value and the second similarity hash value may be represented by a hamming distance between the first similarity hash value and the second similarity hash value. In the information theory, the hamming distance between two character strings with equal length is the number of different characters at the corresponding positions of the two character strings. In other words, it is the number of characters that need to be replaced to convert one string into another.
Step S105: determining that the first device fingerprint and the second device fingerprint are homologous when the similarity is less than a preset threshold. The preset threshold may be flexibly set according to an application scenario, and the present invention is not limited herein.
The method of the embodiment of the invention calculates the device fingerprints through the similarity hash algorithm to obtain the corresponding similarity hash values, and determines whether the different device fingerprints are homologous according to the similarity between the similarity hash values corresponding to the different device fingerprints, so that the method can be used for searching and finding the device aggregation from the acquired massive device fingerprint data to find abnormal devices or abnormal user operation behaviors.
In an alternative embodiment, the determining the first similarity hash value from the plurality of keyword hash values comprises: performing a weighted summation of the plurality of keyword hash values to determine the first similarity hash value; and the weight of each keyword hash value is determined according to the inverse text frequency index of the keyword.
In some embodiments of the present invention, a similarity hash (SimHash) algorithm is used to calculate the similarity between device fingerprints, thereby calculating the aggregation of the devices and improving the stability of the device fingerprints. SimHash is a similarity hash algorithm, the algorithm can convert the device fingerprint into a 64-bit (64-bit) hash value, the converted hash value can reflect the similarity of the device fingerprint, and the similarity of the two device fingerprints can be judged by calculating the Hamming distance between the two hash values.
Fig. 2 is a schematic diagram of an embodiment of calculating a similarity hash value according to the present invention, which is divided into 5 steps, i.e., word segmentation, hash, weighting, merging, and dimension reduction. The following describes in detail an embodiment in the device fingerprint similarity algorithm.
The word segmentation is to split the acquired index of the device fingerprint into key words, and each word is a value corresponding to one device fingerprint index item. Because the indexes of the device fingerprint are different, different keyword splitting schemes can be provided for different indexes. For a simple index such as an operating system name and an operating system version, the value of the field may be directly used as a key. Some indexes are reported in the form of lists, such as font lists, plug-in lists, etc., and these indexes need to split each single value in the list as a keyword. And other indexes are long, the indexes also contain rich information, for example, information such as a browser name and a browser version can be analyzed from a user agent (user agent), the complex indexes occupy less in all acquired JS equipment indexes, and customized splitting processing is required.
All the indexes put together constitute a set of device fingerprint index keys. All the split keywords are set as follows:
f1,f2……fN
where N is the number of keywords, the original information is represented as a vector of keywords.
And (3) hashing: after the vectors of the keywords are obtained, hashing needs to be performed on each keyword, namely, a second step of calculating a similarity hash value. In this application canAny hashing Algorithm is used, for example, the five algorithms of the SHA (Secure Hash Algorithm) family, SHA-1, SHA-224, SHA-256, SHA-384, and SHA-512; or Murmur hash, which is used in the implementation process of the present application. A hash algorithm is selected to convert each keyword in the keyword vector to a 64-bit hash value (2, such as 32-bit or 128-bit)nBit), the conversion formula is as follows:
Hash(fi)=bi1bi2……bi64,i=1,2,…,N
wherein, bimFor 0 or 1, m takes values from 1 to 64, so we convert the original keyword vector with length N into a 64-bit (64-bit) hash value vector with length N.
Weighting: different keywords in the acquired device fingerprint information have different discriminations for the devices, and therefore different weights should be assigned to each keyword. Because the indexes collected by the device fingerprint index are not repeated in general, an inverse text frequency Index (IDF) is used as the weight of the keyword. The formula for the IDF is as follows:
where D is the total number of new devices that have access to the server over a period of time, kiIs the index-related key word f collected in the D devicesiThe logarithmic function used is base 10, or any integer greater than 1. Calculating the weight of each keyword to obtain a weight vector as follows:
ω1,ω2……ωN
order to
Then for the ith keyword fiFor example, the weighting is to be 6The 4-bit hash value is converted into a 64-dimensional vector as follows:
hi=(ωig(bi1,ωig(bi2)……ωig(bi64)),i=1,2,…,N
and merging, and performing weighted operation on the N keywords to obtain weighted vectors of the keywords. And finally, combining and adding the weighted hash vectors of all the words to obtain the weighted hash vector sum of the equipment:
dimension reduction, namely changing a weighted hash vector obtained by weighting and combining the hash values of the N keywords into a 64-bit hash vector. The similarity hash value of the device fingerprint after dimensionality reduction can be obtained by applying the following formula to each bit of the merged hash vector, for example, the weighted hash vector may be: (10, -20, 100, 50, 1, 34, 23), then the dimensionality reduction becomes: (1, 0, 1, 1., 0, 1, 1) (1 for greater than 0, 0 for less than or equal to 0).
The expression of the final similarity hash value is:
fig. 3 is a flowchart of an embodiment of an abnormal device discovery method according to the present invention, where the method includes:
step S201: acquiring the device fingerprint of the current user, calculating the similarity hash value of the device fingerprint of the current user, and taking the current similarity hash value of the device fingerprint of the current user as the current similarity hash value data. Wherein, the similarity hash value of the device fingerprint of the current user can be calculated according to the method shown in fig. 1.
Step S202: device fingerprints of a plurality of devices are collected within a set time and stored as a device fingerprint library.
By setting the time limit, if abnormal behaviors such as too frequent accesses, registrations, orders, and the like from the same device exist in a short time, it can be determined that the device is an abnormal device and the corresponding current operation behavior on the device is an abnormal behavior.
The setting time may be several seconds, several minutes, or several hours, which is not limited by the present invention.
Step S203: and calculating a plurality of similarity hash values of a plurality of device fingerprints in the device fingerprint database, and storing the similarity hash values as a similarity hash value database according to a preset format.
Specifically, the following processing is respectively performed on the plurality of similarity hash values:
dividing the similarity hash value into a plurality of subdivisions;
and storing each part in the plurality of sub parts as an index in association with the corresponding similarity hash value respectively.
In the similarity hash value library, the data obtained above may be stored in the form of a table, for example, as shown in fig. 4.
Step S204: inquiring and determining the number of the similarity hash value data of which the similarity between the current similarity hash value data and the similarity hash value database is smaller than a preset threshold value;
specifically, the current similarity hash value data includes the current similarity hash value and a plurality of subsections corresponding to the current similarity hash value;
at this time, querying and determining the number of the similarity hash value data of which the similarity value with the current similarity hash value data is smaller than a preset threshold in the similarity hash value library includes:
taking any one of the subsections of the current similarity hash value as an index, inquiring the subsection which is the same as the index in a similarity hash value library, and taking the similarity hash value which corresponds to the subsection which is the same as the index as a candidate hash value;
calculating a similarity between each of the candidate hash values and the current similarity hash value respectively to obtain a plurality of similarities (e.g., a hamming distance between each of the candidate hash values and the current similarity hash value may be calculated);
and determining the number of the similarity degrees smaller than a preset threshold value from the plurality of similarity degrees.
In this embodiment, the similarity hash value is divided into a plurality of sub-parts, when the similarity hash value data with the similarity degree smaller than the preset threshold value is queried, any one of the sub-parts is used as an index, the sub-part same as the index is queried in the similarity hash value library, and the similarity hash value corresponding to the sub-part same as the index is used as a candidate hash value, so that a fast search method is provided, which can perform real-time calculation and query in case of an excessively large data size.
Step S205: and when the number is larger than a set value, determining that the equipment of the current user is abnormal equipment. The setting value can be flexibly set according to the application scene, and the invention is not limited herein.
According to the method and the device, the corresponding similarity hash value of the device fingerprint is calculated by adopting a similarity hash algorithm, and whether the different device fingerprints are homologous or not is determined according to the similarity between the similarity hash values corresponding to the different device fingerprints, so that the method and the device can be used for searching and finding the aggregation of the devices from the acquired massive device fingerprint data to find abnormal devices or abnormal user operation behaviors.
The similarity hash of the device fingerprints can be calculated to solve the problem of device similarity, and the method provided by the invention solves the problem of how to perform similarity search under the condition of mass data. Because the time complexity of searching for a device similar to the current similarity hash value is O (n) (there is a function in the parentheses after "O" indicating the relationship between the time/space consumption of a certain algorithm and the amount of data increase, where n represents the amount of input data.
Therefore, the fast search concept of the similarity hash in the embodiment of the present invention is derived from the drawer principle, and n +1 elements are put into n sets, where at least two elements in one set must be present. Since our goal is to find devices with a hamming distance of 3 or less from the current device's similarity hash value. We can split the 64-bit hash into 4 segments of 16 bits each. According to the drawer principle, the similarity hash values of two devices within 3 must have one segment identical.
Based on this theory, for each device, we acquire the device fingerprint of the device, calculate the similarity hash value, divide the similarity hash value into 4 segments, and store the complete similarity hash value corresponding to each segment as an index, as shown in fig. 4, so that the similarity hash values of the same device are stored in 4 copies. During query, the similarity hash value to be queried (namely, the similarity hash value of the device fingerprint of the current user) is also divided into 4 segments, 16 bits of each segment are used as keys, all the similarity hash values corresponding to the segment are queried from the database, and the candidate similarity hash value is obtained.
And traversing the candidate values, respectively calculating the distances between the candidate values and the similarity hash value to be inquired, and finding out the similarity hash value with the real distance within 3. If the hash values are uniform enough, searching for similar equipment can be changed from the original global search into the search only from the range of 4/2^16 ≈ 1/16000 of the original data, the range of data search is greatly reduced, and if the data search range is still large after being reduced by 16000 times, the method can be recursively used for the remaining 48 bits after being divided into 4 segments, and the segmented storage and the search are continued. The method is a space-time-conversion accelerated search method, so that real-time calculation and query can be realized under the condition that the data volume is too large, such as billions.
Illustratively, assume that there are 3 hash values: a1 ═ (a11, a12, a13, a 14);
a2=(a21,a22,a23,a24);a3=(a31,a32,a33,a34);
before the hash value is not split, the hamming distances of a1 from a2 and a3 are calculated, respectively, provided that a device similar to a1 is determined.
After the segmentation, whether a21 and a31 are the same as a11 or not is judged, if a21 and a31 are both the same as a11, a2 and a3 are both used as candidate values, and the Hamming distance between a1 and a2 and the Hamming distance between a1 and a3 are calculated.
In embodiments of the present invention, the actual situation is a lookup problem, not a comparison problem.
In a non-cutting way, if 1 hundred million data exist in a database, one hundred million data are searched from the one hundred million data, and the data similar to a1 need to be extracted, and one data is calculated with a1 until the data a2 and a3 similar to a1 are found. This is at the cost of searching all the data in the database and calculating one hundred million hamming distances.
In a slicing way, the method of querying is to query the data which is the same as the data of the first 16 bits of a1, namely querying according to a11, and can query about 10^8/2^16 ≈ 1500 data including a2 and a 3. Then, according to the query of a12, about 1500 additional pieces of data are obtained. By analogy, four sections of query results are obtained, and the four sections of query results are merged to obtain about 6000 candidate similar data. The hamming distance is then calculated using a1 and the six thousand data, respectively, and the final result would also be a2 and a 3. Compared with the original Hamming distance calculated one hundred million times, the process is reduced by more than ten thousand times.
Ideally, each bit 0 and 1 of all 64-bit hashes has the same probability of occurrence, so that the performance of the method for searching similar devices can reach the theoretical maximum. In the worst case, the first 48 bits of the hash values generated by all the devices are the same, so the above-mentioned segmentation method cannot accelerate the calculation, because all the data are searched by any segment, and the candidate similar data are all the data of the database. Normally, the hash algorithm will make the probability of each bit occurring substantially the same.
In practical application, the search is accelerated in a 4-segment mode, all the devices with the similarity within 3 of the current similarity hash can be returned in real time, so that the similarity between the devices and the current accessed devices can be effectively inquired, and a basis is provided for implementation decision of wind control.
Fig. 5 is a schematic diagram of main blocks of an apparatus 500 for determining device fingerprint homology according to an embodiment of the present invention, the apparatus 500 including:
an information obtaining module 510, configured to obtain a first device fingerprint and a second device fingerprint;
a first hash value determination module 520, configured to determine a first similarity hash value according to the first device fingerprint;
a second hash value determination module 530, configured to determine a second similarity hash value according to the second device fingerprint;
a similarity calculation module 540, configured to determine a similarity between the first similarity hash value and the second similarity hash value;
a homology determining module 550, configured to determine that the first device fingerprint and the second device fingerprint are homologous when the similarity is smaller than a preset threshold.
As shown in fig. 6, in an alternative embodiment, the first hash value determining module 520 includes:
a word segmentation processing unit 521, configured to perform word segmentation processing on the first device fingerprint to obtain a plurality of keywords;
a word hash value determining unit 522, configured to perform a hash calculation on the multiple keywords to obtain multiple keyword hash values;
a first hash value determining unit 523 configured to determine the first similarity hash value according to the plurality of keyword hash values.
In an optional embodiment, the first hash value determination unit is further configured to: performing a weighted summation of the plurality of keyword hash values to determine the first similarity hash value; and the weight of each keyword hash value is determined according to the keyword reverse text frequency index.
In an optional embodiment, the first device fingerprint and the second device fingerprint each comprise at least one of: operating system name, operating system version, font list, and plug-in list of the corresponding device.
Fig. 7 is a schematic diagram of main blocks of an abnormal device discovery apparatus 700 according to an embodiment of the present invention, where the apparatus 700 includes:
a current hash value determining module 710, configured to obtain a device fingerprint of a current user, calculate a similarity hash value of the device fingerprint of the current user, and use the current similarity hash value of the device fingerprint of the current user as a current similarity hash value data;
a calculating module 720, configured to calculate a plurality of similarity hash values of a plurality of device fingerprints in the device fingerprint library;
a query determining module 730, configured to query and determine the number of the similarity hash value data in the similarity hash value database, where the similarity between the current similarity hash value data and the current similarity hash value data is smaller than a preset threshold;
and an abnormal device determining module 740, configured to determine that the device of the current user is an abnormal device when the number is greater than the set value.
In an alternative embodiment, the calculation module 710 is further configured to: dividing the similarity hash value into a plurality of subdivisions; and storing each part in the plurality of sub parts as an index in association with the corresponding similarity hash value respectively.
In an optional embodiment, the current similarity hash value data includes the current similarity hash value and a plurality of subsections corresponding to the current similarity hash value; in this case, the query determining module 720 is further configured to:
taking any one of the subsections of the current similarity hash value as an index, inquiring the subsection which is the same as the index in a similarity hash value library, and taking the similarity hash value which corresponds to the subsection which is the same as the index as a candidate hash value;
respectively calculating the similarity between each similarity hash value in the candidate hash values and the current similarity hash value to obtain a plurality of similarities;
and determining the number of the similarity degrees smaller than a preset threshold value from the plurality of similarity degrees. Current similarity hash value
The device can execute the method provided by the embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.
Fig. 8 shows an exemplary system architecture 800 of a method for determining device fingerprint homology or an apparatus for determining device fingerprint homology to which embodiments of the present invention may be applied.
As shown in fig. 8, the system architecture 800 may include terminal devices 801, 802, 803, a network 804, and a server 805. The network 804 serves to provide a medium for communication links between the terminal devices 801, 802, 803 and the server 805. Network 804 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.
A user may use the terminal devices 801, 802, 803 to interact with a server 805 over a network 804 to receive or send messages or the like. Various communication client applications may be installed on the terminal devices 801, 802, 803.
The terminal devices 801, 802, 803 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 805 may be a server that provides various services, such as a background management server that supports shopping websites browsed by users using the terminal devices 801, 802, 803. The background management server may analyze and perform other processing on the received data such as the product information query request, and feed back a processing result (e.g., target push information and product information) to the terminal device.
It should be noted that the method for reducing inventory provided by the embodiment of the present invention is generally performed by the server 805, and accordingly, the apparatus for determining device fingerprint homology is generally disposed in the server 805.
It should be understood that the number of terminal devices, networks, and servers in fig. 8 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 9, shown is a block diagram of a computer system 900 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 9, the computer system 900 includes a Central Processing Unit (CPU)901 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)902 or a program loaded from a storage section 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data necessary for the operation of the system 900 are also stored. The CPU 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.
The following components are connected to the I/O interface 905: an input portion 906 including a keyboard, a mouse, and the like; an output section 907 including components such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 908 including a hard disk and the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as necessary. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary, so that a computer program read out therefrom is mounted into the storage section 908 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 909, and/or installed from the removable medium 911. The above-described functions defined in the system of the present invention are executed when the computer program is executed by a Central Processing Unit (CPU) 901.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a sending module, an obtaining module, a determining module, and a first processing module. The names of these modules do not in some cases constitute a limitation on the unit itself, and for example, the sending module may also be described as a "module that sends a picture acquisition request to a connected server".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise:
acquiring a first device fingerprint and a second device fingerprint;
determining a first similarity hash value according to the first device fingerprint;
determining a second similarity hash value according to the second device fingerprint;
determining a similarity between the first similarity hash value and the second similarity hash value;
determining that the first device fingerprint and the second device fingerprint are homologous when the similarity is less than a preset threshold.
According to the technical scheme, the corresponding similarity hash values of the device fingerprints are calculated by adopting a similarity hash algorithm, whether the different device fingerprints are homologous (namely whether the different device fingerprints are from operation on the same device) is determined according to the similarity (such as Hamming distance) between the similarity hash values corresponding to the different device fingerprints, so that the device aggregation can be effectively searched and found from the acquired massive device fingerprint data to find abnormal devices or abnormal user operation behaviors, and a basis for wind control implementation decision can be provided for searching and finding the device aggregation from the acquired massive device fingerprint data to find the abnormal devices or the abnormal user operation behaviors.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.