WO2020258101A1 - 用户相似度计算方法、装置、服务端及存储介质 - Google Patents

用户相似度计算方法、装置、服务端及存储介质 Download PDF

Info

Publication number
WO2020258101A1
WO2020258101A1 PCT/CN2019/093109 CN2019093109W WO2020258101A1 WO 2020258101 A1 WO2020258101 A1 WO 2020258101A1 CN 2019093109 W CN2019093109 W CN 2019093109W WO 2020258101 A1 WO2020258101 A1 WO 2020258101A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
similarity
ids
target
user ids
Prior art date
Application number
PCT/CN2019/093109
Other languages
English (en)
French (fr)
Inventor
郭子亮
Original Assignee
深圳市欢太科技有限公司
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市欢太科技有限公司, Oppo广东移动通信有限公司 filed Critical 深圳市欢太科技有限公司
Priority to PCT/CN2019/093109 priority Critical patent/WO2020258101A1/zh
Priority to CN201980091291.0A priority patent/CN113383314B/zh
Publication of WO2020258101A1 publication Critical patent/WO2020258101A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]

Definitions

  • This application relates to the field of communication technology, and in particular to a method, device, server and storage medium for calculating user similarity.
  • the embodiments of the present application provide a user similarity calculation method, device, server, and storage medium, which can reduce the complexity of user similarity calculation.
  • an embodiment of the present application provides a method for calculating user similarity, including:
  • At least two user features of the first user ID are extracted to obtain a feature set of the first user ID, the feature set includes at least the at least two user features, and the first user ID is one of the N user IDs Any one, N is an integer greater than or equal to 2;
  • an embodiment of the present application provides a user similarity calculation device.
  • the user similarity calculation device includes a detection unit, an acquisition unit, and a processing unit, wherein:
  • the extraction unit is configured to extract at least two user characteristics of the first user ID to obtain a characteristic set of the first user ID, the characteristic set including at least the at least two user characteristics, and the first user ID Is any one of N user IDs, and N is an integer greater than or equal to 2;
  • the selection unit is used to select a target hash function
  • the calculation unit is configured to calculate the similarity between the target user characteristics of the N user IDs by using the target hash function to obtain the initial similarity between the N user IDs;
  • the classification unit is configured to divide the N user IDs into M hash buckets according to the initial similarity between the N user IDs, where M is an integer greater than or equal to 2;
  • the calculation unit is further configured to calculate the similarity between any two user IDs in the first hash bucket, where the first hash bucket is any one of the M hash buckets.
  • an embodiment of the present application provides a server, including a processor and a memory, the memory is used to store one or more programs, and the one or more programs are configured to be executed by the processor.
  • the program includes instructions for executing the steps in the first aspect of the embodiments of the present application.
  • an embodiment of the present application provides a computer-readable storage medium, wherein the foregoing computer-readable storage medium stores a computer program for electronic data exchange, wherein the foregoing computer program enables a computer to execute Some or all of the steps described in one aspect.
  • embodiments of the present application provide a computer program product, wherein the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to execute Example part or all of the steps described in the first aspect.
  • the computer program product may be a software installation package.
  • the user similarity calculation method described in the embodiment of this application specifically includes the following steps: extract at least two user characteristics of a first user ID to obtain a feature set of the first user ID, and the feature set At least the at least two user characteristics are included, the first user ID is any one of N user IDs, and N is an integer greater than or equal to 2; the target hash function is selected, and the target hash function is used to calculate the The similarity between the target user characteristics of the N user IDs is used to obtain the initial similarity between the N user IDs; the N users are calculated according to the initial similarity between the N user IDs The ID is divided into M hash buckets, where M is an integer greater than or equal to 2; the similarity between any two user IDs in the first hash bucket is calculated, and the first hash bucket is the M hash buckets.
  • an appropriate hash function can be selected to divide the N user IDs into M hash buckets according to the initial similarity, and only the user IDs in each hash bucket Perform similarity calculation, which avoids calculating the similarity between the user ID and all other user IDs, and can reduce the calculation amount of the user ID similarity calculation, thereby reducing the complexity of the user similarity calculation and improving the user similarity calculation speed.
  • FIG. 1 is a schematic flowchart of a method for calculating user similarity disclosed in an embodiment of the present application
  • FIG. 2 is a schematic flowchart of another user similarity calculation method disclosed in an embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of a user similarity calculation device disclosed in an embodiment of the present application.
  • Fig. 4 is a schematic structural diagram of a server disclosed in an embodiment of the present application.
  • the mobile terminals involved in the embodiments of this application may include various handheld devices with wireless communication functions, vehicle-mounted devices, wearable devices, computing devices or other processing devices connected to wireless modems, as well as various forms of user equipment (User Equipment, UE), mobile station (Mobile Station, MS), terminal device (terminal device), etc.
  • UE User Equipment
  • MS Mobile Station
  • terminal device terminal device
  • FIG. 1 is a schematic flowchart of a user similarity calculation method disclosed in an embodiment of the present application. As shown in FIG. 1, the user similarity calculation method includes the following steps.
  • the server extracts at least two user features of the first user ID to obtain a feature set of the first user ID.
  • the feature set includes at least two user features.
  • the first user ID is any one of N user IDs, and N Is an integer greater than or equal to 2.
  • the server serves the client, and the content of the service includes providing resources to the client and storing client data.
  • the server is a targeted service program, and the device running the server can be called a server.
  • the server can establish connections with multiple clients at the same time, and can provide services to multiple clients at the same time.
  • the service provided by the server for the client in the embodiment of the present application may include a content push service.
  • the content push service may include: browser content push service, application download push service, game content push service, etc.
  • the server can include application server, browser server, game server, etc.
  • User ID can include any one or more of the following types: single sign on identity (SSOID), OpenID, integrated circuit card identity (ICCID), international mobile device identity (International Mobile) Equipment Identity, IMEI), telephone number (telephone, TEL), Globally Unique Identifier (GUID), etc.
  • SSO is in multiple application systems. Users only need to log in once to access all mutually trusted application systems.
  • the server extracts at least two user characteristics of the first user ID based on the historical user behavior data of the first user ID.
  • User behavior data may include: device characteristics, positioning characteristics, and application (Application, APP) characteristics.
  • Device characteristics can include the model of the device, the identification of the device, the media access control address (MAC address) of the device, and the usage habits of the device (for example, the backlight brightness of the device, the volume of the device, and the holding of the device Posture, average use time of the device, power on time of the device, shutdown time of the device, etc.).
  • the positioning feature may include Global Positioning System (GPS) positioning information (for example, latitude and longitude information), location-based service (LBS) location trajectory, etc.
  • Application features can include application setting parameters (for example, application brightness, application volume, application refresh frequency), application opening time, application closing time, application function usage, application Continuous running time, cumulative application running time, application installation data, application uninstallation data, etc.
  • the server can extract device features, location features, and APP features from the user behavior data of the first user ID, and compose the device features, location features, and APP features of the first user ID into a feature set of the first user ID.
  • the server selects a target hash function, uses the target hash function to calculate the similarity between the target user characteristics of the N user IDs, and obtains the initial similarity between the N user IDs.
  • the server may select the target hash function according to at least two user characteristics included in the characteristic set of the first user ID.
  • the server selects the target hash function, specifically:
  • the server determines the types of the at least two user characteristics
  • the server determines the target hash function corresponding to the at least two user characteristics according to the correspondence between the type and the hash function.
  • the types of at least two user characteristics may include a first type and a second type.
  • the first type may include a large data volume feature type
  • the second type may further include a small data volume feature type.
  • the large data volume feature type refers to the type with a large amount of user feature data
  • the small data volume feature type refers to the type with a small amount of user feature data.
  • the server can determine the type of user feature according to the number of bytes of data contained in the user feature.
  • the server can determine that the user feature whose number of bytes of data contained in the user feature is greater than the preset number of bytes is a large data feature type, and determine that the number of bytes of data contained in the user feature is less than or equal to the preset byte
  • the number of user features is a small data volume feature type.
  • the wireless MAC address in the user feature is a feature type of large data volume
  • the latitude and longitude information in the user feature is a feature type of small data volume.
  • the server uses the target hash function to calculate the similarity between the target user characteristics of the N user IDs, which may specifically be:
  • the server uses a Hamming distance calculation formula to calculate the similarity between the target user characteristics of the N user IDs.
  • the server uses the Hamming distance calculation formula to calculate the similarity between the target user characteristics of the N user IDs.
  • the first target vector of the target user characteristic of the first user ID is acquired, and the target user characteristic of the second user ID is acquired.
  • the second target vector of compare whether each bit corresponding to the first target vector and the second target vector is the same. If they are the same, it indicates that the Hamming distance corresponding to the bit is 0. If they are different, it indicates that the Hamming distance corresponding to the bit is 0.
  • the distance is 1, the Hamming distances of all bits are added to obtain the final Hamming distance between the first target vector and the second target vector.
  • the first user ID and the second user ID are two different user IDs among the N user IDs.
  • the target user characteristic of the first user ID is the first wireless MAC address
  • the target user characteristic of the second user ID is the second wireless MAC address.
  • the first target vector and the second target vector are both 10 bits. Whether each bit of the first target vector is the same as each bit of the second target vector, if they are the same, it means that the Hamming distance corresponding to the bit is 0, and if they are different, it means that the Hamming distance corresponding to the bit is 1.
  • the Hamming distances of all bits are added to obtain the final Hamming distance between the first target vector and the second target vector, and the final Hamming distance is between 0-10.
  • the Hamming distance is 0 to 3, it is considered that the similarity between the target user feature of the first user ID and the target user feature of the second user ID is greater than the first preset similarity threshold, and the first user ID and the second user ID Put the user ID in the same hash bucket; the Hamming distance is 4-10, it is considered that the similarity between the target user feature of the first user ID and the target user feature of the second user ID is less than or equal to the first preset similarity Threshold, it is determined that the first user ID and the second user ID do not belong to the same hash bucket.
  • the server uses the target hash function to calculate the similarity between the target user characteristics of the N user IDs, which may specifically be:
  • the server uses the Euclidean distance calculation formula to calculate the similarity between the target user characteristics of the N user IDs.
  • the server uses the Euclidean distance calculation formula to calculate the similarity between the target user features of the N user IDs.
  • the server obtains the longitude and latitude parameters of the first user ID when calculating the similarity between the target user feature of the first user ID and the target user feature of the second user ID, and obtains the first user ID.
  • Longitude and latitude parameters of the user ID are two different user IDs among the N user IDs.
  • the server when calculating the similarity between the target user characteristics of the first user ID and the target user characteristics of the second user ID, the server can obtain the longitude and latitude information of the first user ID (the longitude parameter is x 1 , the latitude parameter y 1 ), the longitude and latitude information of the second user ID (the longitude parameter is x 2 , the latitude parameter y 2 ), the server can use the following Euclidean distance calculation formula to calculate the target user characteristics of the first user ID and the target user characteristics of the second user ID Similarity:
  • d is less than or equal to the preset threshold, it indicates that the similarity between the target user feature of the first user ID and the target user feature of the second user ID is greater than the first preset similarity threshold, and the first user ID and the second user ID are put Into the same hash bucket. If d is greater than the preset threshold, it indicates that the similarity between the target user feature of the first user ID and the target user feature of the second user ID is less than or equal to the first preset similarity threshold, then the first user ID and the second user ID are determined Do not belong to the same hash bucket.
  • the server divides the N user IDs into M hash buckets according to the initial similarity between the N user IDs, where M is an integer greater than or equal to 2.
  • the greater the initial similarity between any two user IDs the greater the probability that any two user IDs will be classified into the same hash bucket; any two user IDs The smaller the initial similarity between the two, the less likely that any two user IDs will be classified into the same hash bucket.
  • step 103 may specifically include the following steps:
  • the server determines whether there is a user ID whose initial similarity with the first user ID is greater than a first preset similarity threshold among the N user IDs;
  • the server classifies the user IDs of the first user ID and the N user IDs whose initial similarity with the first user ID is greater than the first preset similarity threshold into the same Ha Hope in the barrel.
  • the server can randomly select a user ID from N user IDs, such as the first user ID, and compare the initial similarity between the first user ID and other user IDs among the N user IDs, and serve
  • the terminal determines whether there is a user ID whose initial similarity with the first user ID is greater than a first preset similarity threshold among the N user IDs, and if it exists, the server sends the first user ID, the N Among the user IDs, the user IDs whose initial similarity with the first user ID is greater than the first preset similarity threshold are divided into the same hash bucket (for example, the first hash bucket).
  • the server randomly selects a user ID from the N user IDs except the user ID to be allocated in the first hash bucket, such as the second user ID, and compares the second user ID with the other user IDs to be allocated.
  • the initial similarity of the user ID The server determines whether there is a user ID whose initial similarity with the second user ID is greater than the first preset similarity threshold among the user IDs to be allocated. If there is, the server will The second user ID and the user IDs whose initial similarity with the second user ID among the user IDs to be allocated are greater than the first preset similarity threshold are divided into the same hash bucket (for example, the second hash bucket) , And so on, until the N user IDs are divided into M hash buckets.
  • a user ID can only be divided into one of M hash buckets, and the number of all user IDs in M hash buckets is equal to N.
  • the number of user IDs in each hash bucket can be the same or different.
  • the server calculates the similarity between any two user IDs in the first hash bucket, where the first hash bucket is any one of the M hash buckets.
  • the server can obtain any two user IDs in the first hash bucket, such as the first user ID and the second user ID.
  • the server obtains at least two user characteristics of the first user ID and at least two user characteristics of the second user ID, and calculates the first based on at least two user characteristics of the first user ID and at least two user characteristics of the second user ID.
  • the similarity between the user ID and the second user ID In the embodiment of this application, it is only necessary to calculate the similarity of user IDs in the same hash bucket.
  • the calculation amount of similarity calculation between user IDs in each hash bucket is much smaller, which can reduce the calculation amount of user ID similarity calculation, thereby reducing the complexity of user similarity calculation and improving The speed of user similarity calculation.
  • step 104 may also include the following steps:
  • the server obtains the feature set of each user ID in the first hash bucket
  • the server determines the feature vector of each user ID in the first hash bucket based on the feature set of each user ID in the first hash bucket;
  • the server uses the Hamming distance calculation formula to calculate the distance between the feature vectors of any two user IDs in the first hash bucket based on the feature vector of each user ID in the first hash bucket;
  • the server uses the similarity between any two user IDs in the first hash bucket according to the distance between the feature vectors of any two user IDs in the first hash bucket.
  • the Hamming distance calculation formula can be used to greatly simplify the calculation amount of similarity calculation and increase the speed of similarity calculation.
  • an appropriate hash function can be selected to divide the N user IDs into M hash buckets according to the initial similarity, and only the user IDs in each hash bucket Perform similarity calculation, which avoids calculating the similarity between the user ID and all other user IDs, and can reduce the calculation amount of the user ID similarity calculation, thereby reducing the complexity of the user similarity calculation and improving the user similarity calculation speed.
  • FIG. 2 is a flowchart of another user similarity calculation method disclosed in an embodiment of the present application.
  • FIG. 2 is further optimized on the basis of FIG. 1.
  • the user similarity is The calculation method includes the following steps.
  • the server extracts at least two user features of the first user ID to obtain a feature set of the first user ID.
  • the feature set includes at least two user features.
  • the first user ID is any one of N user IDs, and N Is an integer greater than or equal to 2.
  • the server selects a target hash function, uses the target hash function to calculate the similarity between the target user characteristics of the N user IDs, and obtains the initial similarity between the N user IDs.
  • the server divides the N user IDs into M hash buckets according to the initial similarity between the N user IDs, where M is an integer greater than or equal to 2.
  • the server calculates the similarity between any two user IDs in the first hash bucket, where the first hash bucket is any one of the M hash buckets.
  • step 201 to step 204 in the embodiment of the present application can refer to step 101 to step 204 shown in FIG. 1, which will not be repeated here.
  • the server determines whether there are P user IDs with mutual similarity greater than a second preset similarity threshold in the first hash bucket, and P is an integer greater than or equal to 2.
  • the second preset similarity threshold may be preset and stored in a memory (for example, a non-volatile memory) of the server.
  • the server establishes a correspondence relationship between P user IDs and target natural person IDs.
  • the server establishes the corresponding relationship between the user ID and the target natural person ID, and can associate multiple user IDs with one natural person.
  • the natural person ID in the embodiment of this application corresponds to a natural person.
  • This natural person may correspond to a mobile terminal (for example, a mobile phone), at least one phone number, at least one application account, at least one OpenID, one SSOID, at least one ICCID, and at least one IMEI.
  • a mobile terminal for example, a mobile phone
  • the IMEI, phone number, and 5 application accounts of the mobile phone are labeled with a natural person ID.
  • the user behavior data corresponding to these 5 application accounts all belong to the user behavior data of this natural person ID.
  • a real natural person can have many user IDs (for example, the IMEI of a mobile phone, a phone number, and 5 application accounts), but only one unique natural person ID is corresponding.
  • the specific presentation form of the natural person ID can be a string of characters.
  • the natural person ID may correspond to an identification of a mobile terminal.
  • the server can establish a correspondence table of the user ID and the natural person ID after constructing the correspondence between the user ID and the target natural person ID.
  • one natural person ID can correspond to multiple user IDs.
  • the server can push content (for example, various types of push messages) to the natural person ID.
  • content for example, various types of push messages
  • the server can push the content to the mobile terminal corresponding to the natural person ID without sending the pushed content to the application account separately, thereby improving the push efficiency.
  • the server After the server establishes the corresponding relationship between the user ID and the target natural person ID, the corresponding relationship between the user ID and the target natural person ID can be stored in the server's database.
  • the server can analyze the user behavior data of the newly registered user ID, analyze the user behavior data of the newly registered user ID and the user behavior data of all natural person IDs that have been stored, if the above has been stored.
  • the similarity of the natural person ID with the greatest similarity to the newly registered user ID among all the natural person IDs is greater than the preset similarity threshold, and the corresponding relationship between the natural person ID with the greatest similarity and the newly registered user ID is established.
  • the server includes hardware structures and/or software modules corresponding to each function.
  • the present invention can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software-driven hardware depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered as going beyond the scope of the present invention.
  • the embodiment of the present application may divide the server side into functional units according to the foregoing method examples.
  • each functional unit may be divided corresponding to each function, or two or more functions may be integrated into one processing unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit. It should be noted that the division of units in the embodiments of the present application is illustrative, and is only a logical function division, and there may be other division methods in actual implementation.
  • FIG. 3 is a schematic structural diagram of a user similarity calculation device disclosed in an embodiment of the present application.
  • the user similarity calculation device 300 includes an extraction unit 301, a selection unit 302, a calculation unit 303, and a classification unit 304, wherein:
  • the extraction unit 301 is configured to extract at least two user characteristics of a first user ID to obtain a characteristic set of the first user ID, and the characteristic set includes at least the at least two user characteristics, and the first user ID is any one of N user IDs, and N is an integer greater than or equal to 2;
  • the selecting unit 302 is used to select a target hash function
  • the calculation unit 303 is configured to calculate the similarity between the target user characteristics of the N user IDs by using the target hash function to obtain the initial similarity between the N user IDs;
  • the classification unit 304 is configured to divide the N user IDs into M hash buckets according to the initial similarity between the N user IDs, where M is an integer greater than or equal to 2;
  • the calculation unit 303 is further configured to calculate the similarity between any two user IDs in the first hash bucket, where the first hash bucket is any one of the M hash buckets.
  • the selecting unit 302 selects the target hash function, specifically: determining the type of the at least two user characteristics; determining the target corresponding to the at least two user characteristics according to the correspondence between the type and the hash function Hash function.
  • the calculation unit 303 uses the target hash function to calculate the similarity between the target user characteristics of the N user IDs, specifically: if the types of the at least two user characteristics are the first type Calculate the similarity between the target user characteristics of the N user IDs using a Hamming distance calculation formula.
  • the calculating unit 303 uses the target hash function to calculate the similarity between the target user characteristics of the N user IDs, specifically: if the type of the at least two user characteristics is the second type , Using the Euclidean distance calculation formula to calculate the similarity between the target user features of the N user IDs.
  • the classification unit 304 divides the N user IDs into M hash buckets according to the initial similarity between the N user IDs, specifically: determining the N user IDs Whether there is a user ID whose initial similarity with the first user ID is greater than a first preset similarity threshold; if it exists, compare the first user ID and the N user IDs with the first user User IDs whose initial similarity of IDs are greater than the first preset similarity threshold are divided into the same hash bucket.
  • the calculation unit 303 calculates the similarity between any two user IDs in the first hash bucket, specifically: acquiring a feature set of each user ID in the first hash bucket; The feature set of each user ID in the first hash bucket determines the feature vector of each user ID in the first hash bucket; based on the feature vector of each user ID in the first hash bucket, Hamming
  • the distance calculation formula calculates the distance between the feature vectors of any two user IDs in the first hash bucket; according to the distance between the feature vectors of any two user IDs in the first hash bucket, the first The similarity between any two user IDs in the hash bucket.
  • the user similarity calculation device 300 includes a determining unit 305 and a establishing unit 306.
  • the determining unit 305 is configured to calculate the similarity between any two user IDs in the first hash bucket in the calculating unit 303, and the first hash bucket is any of the M hash buckets. After one, determine whether there are P user IDs whose mutual similarity is greater than a second preset similarity threshold in the first hash bucket, where P is an integer greater than or equal to 2;
  • the establishing unit 306 is configured to, when the determining unit 305 determines that there are P user IDs whose mutual similarity is greater than a second preset similarity threshold, in the first hash bucket, establish the Correspondence between P user IDs and target natural person IDs.
  • the extraction unit 301, the selection unit 302, the calculation unit 303, the classification unit 304, the determination unit 305, and the establishment unit 306 in FIG. 3 may be processors.
  • an appropriate hash function can be selected to divide the N user IDs into M hash buckets according to the initial similarity, and only for each hash It is hoped that the user ID in the bucket is calculated for similarity, thereby avoiding the calculation of similarity between the user ID and all other user IDs, which can reduce the calculation amount of the similarity calculation of the user ID, thereby reducing the complexity of the user similarity calculation , Improve the speed of user similarity calculation.
  • FIG. 4 is a schematic structural diagram of a server disclosed in an embodiment of the present application.
  • the server 400 includes a processor 401 and a memory 402.
  • the server 400 may also include a bus 403.
  • the processor 401 and the memory 402 may be connected to each other through the bus 403.
  • the bus 403 may be a peripheral component. Connect the standard (Peripheral Component Interconnect, referred to as PCI) bus or extended industry standard architecture (Extended Industry Standard Architecture, referred to as EISA) bus, etc.
  • PCI Peripheral Component Interconnect
  • EISA Extended Industry Standard Architecture
  • the bus 403 can be divided into an address bus, a data bus, a control bus, and so on. For ease of presentation, only one thick line is used to represent in FIG.
  • the server 400 may further include a communication interface 404, and the server 400 may communicate with the client through the communication interface 404.
  • the memory 402 is used to store one or more programs containing instructions; the processor 401 is used to call the instructions stored in the memory 402 to execute some or all of the method steps in FIGS. 1 to 2.
  • a suitable hash function can be selected to divide N user IDs into M hash buckets according to the initial similarity, and only for each hash bucket.
  • the similarity calculation of the user ID of the user ID thereby avoiding the calculation of the similarity between the user ID and all other user IDs, can reduce the calculation amount of the similarity calculation of the user ID, thereby reducing the complexity of the user similarity calculation and improving the user The speed of similarity calculation.
  • An embodiment of the present application also provides a computer storage medium, wherein the computer storage medium stores a computer program for electronic data exchange, and the computer program enables the computer to execute any of the user similarity calculation methods described in the above method embodiments Some or all of the steps.
  • the embodiments of the present application also provide a computer program product.
  • the computer program product includes a non-transitory computer-readable storage medium storing a computer program.
  • the computer program is operable to cause a computer to execute any of the methods described in the foregoing method embodiments. Part or all of the steps of a user similarity calculation method.
  • the disclosed device may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable memory.
  • the technical solution of the present invention essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory, A number of instructions are included to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present invention.
  • the aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other various media that can store program codes.
  • the program can be stored in a computer-readable memory, and the memory can include: flash disk , Read-only memory (English: Read-Only Memory, abbreviation: ROM), random access device (English: Random Access Memory, abbreviation: RAM), magnetic disk or optical disc, etc.

Abstract

一种用户相似度计算方法、装置、服务端及存储介质,该方法包括:提取第一用户ID的至少两个用户特征,得到所述第一用户ID的特征集,所述特征集至少包括所述至少两个用户特征,所述第一用户ID为N个用户ID中的任一个,N为大于或等于2的整数(101);服务端选取目标哈希函数,采用所述目标哈希函数计算所述N个用户ID的目标用户特征之间的相似度,得到所述N个用户ID之间的初始相似度(102);服务端依据所述N个用户ID之间的初始相似度的大小将所述N个用户ID划分到M个哈希桶中,M为大于或等于2的整数(103);服务端计算第一哈希桶中任意两个用户ID之间的相似度,所述第一哈希桶为所述M个哈希桶中的任一个(104)。上述方法可以降低用户相似度计算的复杂度。

Description

用户相似度计算方法、装置、服务端及存储介质 技术领域
本申请涉及通信技术领域,具体涉及一种用户相似度计算方法、装置、服务端及存储介质。
背景技术
目前,在数据挖掘应用领域,比如,内容推送、广告推荐等,都有一个非常常见问题,即相似度计算,例如,两个集合要计算相似度,通过循环来遍历二者的相同元素,或者是两两计算相应的欧氏距离,或者是其他距离等,在全部计算完成后,再经过排序,找出一定数量的相似数据。
对于低维的小数据集,采用这种方式相对来说简单,也很实用。但是当需要计算的数据集非常庞大,并且数据集维度较高的时候,整个数据集中的数据直接两两计算将会非常耗时,导致相似度计算的复杂度较高。
发明内容
本申请实施例提供了一种用户相似度计算方法、装置、服务端及存储介质,可以降低用户相似度计算的复杂度。
第一方面,本申请实施例提供一种用户相似度计算方法,包括:
提取第一用户ID的至少两个用户特征,得到所述第一用户ID的特征集,所述特征集至少包括所述至少两个用户特征,所述第一用户ID为N个用户ID中的任一个,N为大于或等于2的整数;
选取目标哈希函数,采用所述目标哈希函数计算所述N个用户ID的目标用户特征之间的相似度,得到所述N个用户ID之间的初始相似度;
依据所述N个用户ID之间的初始相似度的大小将所述N个用户ID划分到M个哈希桶中,M为大于或等于2的整数;
计算第一哈希桶中任意两个用户ID之间的相似度,所述第一哈希桶为所述M个哈希桶中的任一个。
第二方面,本申请实施例提供了一种用户相似度计算装置,所述用户相似度计算装置包括检测单元、获取单元和处理单元,其中:
所述提取单元,用于提取第一用户ID的至少两个用户特征,得到所述第一用户ID的特征集,所述特征集至少包括所述至少两个用户特征,所述第一用户ID为N个用户ID中的任一个,N为大于或等于2的整数;
所述选取单元,用于选取目标哈希函数;
所述计算单元,用于采用所述目标哈希函数计算所述N个用户ID的目标用户特征之间的相似度,得到所述N个用户ID之间的初始相似度;
所述分类单元,用于依据所述N个用户ID之间的初始相似度的大小将所述N个用户ID划分到M个哈希桶中,M为大于或等于2的整数;
所述计算单元,还用于计算第一哈希桶中任意两个用户ID之间的相似度,所述第一哈希桶为所述M个哈希桶中的任一个。
第三方面,本申请实施例提供一种服务端,包括处理器、存储器,所述存储器用于存储一个或多个程序,所述一个或多个程序被配置成由所述处理器执行,上述程序包括用于执行本申请实施例第一方面中的步骤的指令。
第四方面,本申请实施例提供了一种计算机可读存储介质,其中,上述计算机可读存储介质存储用于电子数据交换的计算机程序,其中,上述计算机程序使得计算机执行如本申请实施例第一方面中所描述的部分或全部步骤。
第五方面,本申请实施例提供了一种计算机程序产品,其中,上述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,上述计算机程序可操作来使计算机执行如本申请实施例第一方面中所描述的部分或全部步骤。该计算机程序产品可以为一个软件安装包。
可以看出,本申请实施例中所描述的用户相似度计算方法,具体包括如下步骤:提取第一用户ID的至少两个用户特征,得到所述第一用户ID的特征集,所述特征集至少包括所述至少两个用户特征,所述第一用户ID为N个用户ID中的任一个,N为大于或等于2的整数;选取目标哈希函数,采用所述目标哈希函数计算所述N个用户ID的目标用户特征之间的相似度,得到所述N个用户ID之间的初始相似度;依据所述N个用户ID之间的初始相似度的大小将所述N个用户ID划分到M个哈希桶中,M为大于或等于2的整数;计算第 一哈希桶中任意两个用户ID之间的相似度,所述第一哈希桶为所述M个哈希桶中的任一个。实施本申请实施例,在计算用户相似度时,可以选取合适的哈希函数按照初始相似度大小将N个用户ID划分到M个哈希桶中,仅对每个哈希桶内的用户ID进行相似度计算,从而避免了将用户ID与所有的其他用户ID进行相似度计算,可以降低用户ID的相似度计算的计算量,从而可以降低用户相似度计算的复杂度,提高用户相似度计算的速度。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例公开的一种用户相似度计算方法的流程示意图;
图2是本申请实施例公开的另一种用户相似度计算方法的流程示意图;
图3是本申请实施例公开的一种用户相似度计算装置的结构示意图;
图4是本申请实施例公开的一种服务端的结构示意图。
具体实施方式
为了使本技术领域的人员更好地理解本发明方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其他步骤或单元。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本发明的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
本申请实施例所涉及到的移动终端可以包括各种具有无线通信功能的手持设备、车载设备、可穿戴设备、计算设备或连接到无线调制解调器的其他处理设备,以及各种形式的用户设备(User Equipment,UE),移动台(Mobile Station,MS),终端设备(terminal device)等等。为方便描述,上面提到的设备统称为移动终端。
下面对本申请实施例进行详细介绍。
请参阅图1,图1是本申请实施例公开的一种用户相似度计算方法的流程示意图,如图1所示,该用户相似度计算方法包括如下步骤。
101,服务端提取第一用户ID的至少两个用户特征,得到第一用户ID的特征集,特征集至少包括至少两个用户特征,第一用户ID为N个用户ID中的任一个,N为大于或等于2的整数。
本申请实施例中,服务端是为客户端服务的,服务的内容诸如向客户端提供资源,保存客户端数据等。服务端是一种有针对性的服务程序,运行服务端的设备可以称为服务器。服务端可以同时与多个客户端建立连接,可以同时为多个客户端提供服务。本申请实施例中服务端为客户端提供的服务可以包括内容推送服务。内容推送服务可以包括:浏览器内容推送服务、应用程序下载推送服务、游戏内容推送服务等。服务端可以包括应用程序服务端、浏览器服务端、游戏服务端等。
用户ID可以包括如下任意一种或多种类型:单点登录标识(single sign on identity,SSOID)、OpenID、集成电路卡识别码(Integrate circuit card identity,ICCID)、国际移动设备识别码(International Mobile Equipment Identity,IMEI)、电话号码(telephone,TEL)、全局唯一标识符(Globally Unique Identifier,GUID)等。SSO是在多个应用系统中,用户只需登录一次就可以访问所有相互信任的应用系统。
服务端提取第一用户ID的至少两个用户特征可以包括:
服务端基于第一用户ID的历史用户行为数据提取第一用户ID的至少两个用户特征。
用户行为数据可以包括:设备特征、定位特征和应用程序(Application,APP)特征。设备特征可以包括设备的型号、设备的标识、设备的媒体访问控制地址(Media Access Control Address,MAC地址)、设备的使用习惯(比如,设备的背光亮度大小、设备的音量大小、设备的握持姿势、设备的平均使用时长、设备的开机时间、设备的关机时间等)。定位特征可以包括全球定位系统(Global Positioning System,GPS)定位信息(比如,经纬度信息)、基于移动位置服务(Location Based Service,LBS)位置轨迹等。应用程序特征可以包括应用程序的设置参数(比如,应用程序的亮度、应用程序的音量、应用程序的刷新频率)、应用程序开启时间点、应用程序关闭时间点、应用程序功能使用情况、应用程序持续运行时长、应用程序累计运行时长、应用程序安装数据、应用程序卸载数据等。
服务端可以从第一用户ID的用户行为数据中提取设备特征、定位特征和APP特征,将第一用户ID的设备特征、定位特征和APP特征组成第一用户ID的特征集。
102,服务端选取目标哈希函数,采用目标哈希函数计算N个用户ID的目标用户特征之间的相似度,得到N个用户ID之间的初始相似度。
本申请实施例中,服务端可以根据第一用户ID的特征集包括的至少两个用户特征选取目标哈希函数。
可选的,服务端选取目标哈希函数,具体为:
(11)服务端确定所述至少两个用户特征的类型;
(12)服务端根据类型与哈希函数的对应关系确定与所述至少两个用户特征对应的目标哈希函数。
本申请实施例中,至少两个用户特征的类型可以包括第一类型和第二类型。第一类型可以包括大数据量特征类型,第二类型可以还包括小数据量特征类型。大数据量特征类型指的是用户特征的数据量较大的类型,小数据量特征类型指的是用户特征的数据量较小的类型。具体的,服务端可以根据用户特征 包含的数据的字节数来确定该用户特征的类型。比如,服务端可以确定用户特征中包含的数据的字节数大于预设字节数的用户特征为大数据量特征类型,确定用户特征中包含的数据的字节数小于或等于预设字节数的用户特征为小数据量特征类型。举例来说,用户特征中的无线MAC地址为大数据量特征类型、用户特征中的经纬度信息为小数据量特征类型。
可选的,上述步骤102中,服务端采用所述目标哈希函数计算所述N个用户ID的目标用户特征之间的相似度,具体可以为:
若所述至少两个用户特征的类型均为第一类型,服务端采用汉明距离计算公式计算所述N个用户ID的目标用户特征之间的相似度。
本申请实施例中,至少两个用户特征的类型均为第一类型时,服务端采用汉明距离计算公式计算所述N个用户ID的目标用户特征之间的相似度。
例如,在计算第一用户ID的目标用户特征与第二用户ID的目标用户特征的相似度时,获取第一用户ID的目标用户特征的第一目标向量,获取第二用户ID的目标用户特征的第二目标向量,比较第一目标向量与第二目标向量对应的每一位是否相同,若相同,则表明该位对应的汉明距离为0,若不同,则表明该位对应的汉明距离为1,将所有位的汉明距离相加,得到第一目标向量与第二目标向量之间的最终汉明距离。其中,第一用户ID、第二用户ID为N个用户ID中的两个不同的用户ID。
其中,计算的汉明距离越小,其相似度越高;计算的汉明距离越大,其相似度越低。
举例来说,第一用户ID的目标用户特征为第一无线MAC地址,第二用户ID的目标用户特征为第二无线MAC地址,第一目标向量和第二目标向量均为10位,通过比较第一目标向量的每一位与第二目标向量的每一位是否相同,若相同,则表明该位对应的汉明距离为0,若不同,则表明该位对应的汉明距离为1,将所有位的汉明距离相加,得到第一目标向量与第二目标向量之间的最终汉明距离,最终汉明距离为0~10之间。最终汉明距离越大,第一用户ID的目标用户特征与第二用户ID的目标用户特征的相似度越低;最终汉明距离越小,第一用户ID的目标用户特征与第二用户ID的目标用户特征的相似度越高。例如,如果汉明距离为0~3,则认为第一用户ID的目标用户特征与 第二用户ID的目标用户特征的相似度大于第一预设相似度阈值,将第一用户ID和第二用户ID放入同一个哈希桶内;汉明距离为4~10,则认为第一用户ID的目标用户特征与第二用户ID的目标用户特征的相似度小于或等于第一预设相似度阈值,则确定第一用户ID和第二用户ID不属于同一个哈希桶。
可选的,上述步骤102中,服务端采用所述目标哈希函数计算所述N个用户ID的目标用户特征之间的相似度,具体可以为:
若所述至少两个用户特征的类型均为第二类型,服务端采用欧式距离计算公式计算所述N个用户ID的目标用户特征之间的相似度。
本申请实施例中,至少两个用户特征的类型均为第二类型时,服务端采用欧式距离计算公式计算所述N个用户ID的目标用户特征之间的相似度。
例如,如果目标用户特征为经纬度信息,服务端在计算第一用户ID的目标用户特征与第二用户ID的目标用户特征的相似度时,获取第一用户ID的经度参数和维度参数,获取第二用户ID的经度参数和维度参数。其中,第一用户ID、第二用户ID为N个用户ID中的两个不同的用户ID。
举例来说,服务端在计算第一用户ID的目标用户特征与第二用户ID的目标用户特征的相似度时,可以获取第一用户ID的经纬度信息(经度参数为x 1,纬度参数y 1),第二用户ID的经纬度信息(经度参数为x 2,纬度参数y 2),服务端可以采用如下欧式距离计算公式计算第一用户ID的目标用户特征与第二用户ID的目标用户特征的相似度:
Figure PCTCN2019093109-appb-000001
如果d小于或等于预设阈值,表明第一用户ID的目标用户特征与第二用户ID的目标用户特征的相似度大于第一预设相似度阈值,将第一用户ID和第二用户ID放入同一个哈希桶内。如果d大于预设阈值,表明第一用户ID的目标用户特征与第二用户ID的目标用户特征的相似度小于或等于第一预设相似度阈值,则确定第一用户ID和第二用户ID不属于同一个哈希桶。
103,服务端依据N个用户ID之间的初始相似度的大小将N个用户ID划分到M个哈希桶中,M为大于或等于2的整数。
本申请实施例中,N个用户ID中,任意两个用户ID之间的初始相似度越大,该任意两个用户ID分入同一个哈希桶的可能性越大;任意两个用户ID 之间的初始相似度越小,该任意两个用户ID分入同一个哈希桶的可能性越小。
可选的,步骤103具体可以包括如下步骤:
(21)服务端确定所述N个用户ID中是否存在与所述第一用户ID的初始相似度大于第一预设相似度阈值的用户ID;
(22)若存在,服务端将所述第一用户ID、所述N个用户ID中与所述第一用户ID的初始相似度大于第一预设相似度阈值的用户ID划分到同一个哈希桶中。
本申请实施例中,首先,服务端可以从N个用户ID中随机选择一个用户ID,比如第一用户ID,比较该第一用户ID与N个用户ID中其他用户ID的初始相似度,服务端确定所述N个用户ID中是否存在与所述第一用户ID的初始相似度大于第一预设相似度阈值的用户ID,如果存在,服务端将所述第一用户ID、所述N个用户ID中与所述第一用户ID的初始相似度大于第一预设相似度阈值的用户ID划分到同一个哈希桶(比如,第一哈系桶)中。其次,服务端从N个用户ID中除了第一哈希桶中的待分配用户ID中随机选择的一个用户ID,比如第二用户ID,比较该第二用户ID与上述待分配用户ID中其他用户ID的初始相似度,服务端确定所述待分配用户ID中是否存在与所述第二用户ID的初始相似度大于第一预设相似度阈值的用户ID,如果存在,服务端将所述第二用户ID、所述待分配用户ID中与所述第二用户ID的初始相似度大于第一预设相似度阈值的用户ID划分到同一个哈希桶(比如,第二哈系桶)中,以此类推,直到将N个用户ID划分到M个哈希桶中。其中,一个用户ID只能被划分到M个哈希桶中的一个,M个哈希桶中的所有用户ID的数量等于N。每个哈希桶中的用户ID的数量可以相同,也可以不同。
104,服务端计算第一哈希桶中任意两个用户ID之间的相似度,第一哈希桶为M个哈希桶中的任一个。
本申请实施例中,服务端可以获取第一哈希桶中任意两个用户ID,比如第一用户ID和第二用户ID。服务端获取第一用户ID的至少两个用户特征、第二用户ID的至少两个用户特征,基于第一用户ID的至少两个用户特征和第二用户ID的至少两个用户特征计算第一用户ID和第二用户ID之间的相似度。本申请实施例中,只需要对同一个哈希桶内的用户ID的相似度进行计算,由 于每个哈希桶内的用户ID的数量有限,当N的数值很大时,与N个用户ID比起来,每个哈希桶内的用户ID之间的相似度计算的计算量要小很多,可以降低用户ID的相似度计算的计算量,从而可以降低用户相似度计算的复杂度,提高用户相似度计算的速度。
可选的,步骤104还可以包括如下步骤:
(31)服务端获取所述第一哈希桶中每个用户ID的特征集;
(32)服务端基于所述第一哈希桶中每个用户ID的特征集确定所述第一哈希桶中每个用户ID的特征向量;
(33)服务端基于所述第一哈希桶中每个用户ID的特征向量,采用汉明距离计算公式计算所述第一哈希桶中任意两个用户ID的特征向量之间的距离;
(34)服务端根据所述第一哈希桶中任意两个用户ID的特征向量之间的距离所述第一哈希桶中任意两个用户ID之间的相似度。
本申请实施例中,由于特征集中的用户特征的数量相对较多,采用汉明距离计算公式,可以极大的简化相似度计算的计算量,提高相似度计算的速度。
实施本申请实施例,在计算用户相似度时,可以选取合适的哈希函数按照初始相似度大小将N个用户ID划分到M个哈希桶中,仅对每个哈希桶内的用户ID进行相似度计算,从而避免了将用户ID与所有的其他用户ID进行相似度计算,可以降低用户ID的相似度计算的计算量,从而可以降低用户相似度计算的复杂度,提高用户相似度计算的速度。
请参阅图2,图2是本申请实施例公开的另一种用户相似度计算方法的流程示意图,图2是在图1的基础上进一步优化得到的,如图2所示,该用户相似度计算方法包括如下步骤。
201,服务端提取第一用户ID的至少两个用户特征,得到第一用户ID的特征集,特征集至少包括至少两个用户特征,第一用户ID为N个用户ID中的任一个,N为大于或等于2的整数。
202,服务端选取目标哈希函数,采用目标哈希函数计算N个用户ID的目标用户特征之间的相似度,得到N个用户ID之间的初始相似度。
203,服务端依据N个用户ID之间的初始相似度的大小将N个用户ID划 分到M个哈希桶中,M为大于或等于2的整数。
204,服务端计算第一哈希桶中任意两个用户ID之间的相似度,第一哈希桶为M个哈希桶中的任一个。
其中,本申请实施例中的步骤201至步骤204的具体实施可以参见图1所示的步骤101至步骤204,此处不再赘述。
205,服务端确定第一哈希桶中是否存在相互之间的相似度大于第二预设相似度阈值的P个用户ID,P为大于或等于2的整数。
第二预设相似度阈值可以预先进行设定并存储在服务端的存储器(比如,非易失性存储器)中。
206,若存在,服务端建立P个用户ID与目标自然人ID的对应关系。
本申请实施例中,服务端建立用户ID与目标自然人ID的对应关系,可以将多个用户ID与一个自然人对应起来。
其中,本申请实施例中的自然人ID会对应一个自然人。这个自然人可能会对应一个移动终端(比如,手机)、至少一个电话号码、至少一个应用程序账号、至少一个OpenID、一个SSOID、至少一个ICCID、至少一个IMEI。比如说,一个自然人有用一部手机、一个电话号码、5个应用程序账号,则将手机的IMEI、电话号码、5个应用程序账号打上一个自然人ID的标签。这5个应用程序账号对应的用户行为数据都属于这个自然人ID的用户行为数据。这样,一个真实的自然人,可以有很多个用户ID(比如,一个手机的IMEI、一个电话号码、5个应用程序账号),但是却只对应一个唯一的自然人ID。自然人ID的具体呈现形式可以是一串字符。该自然人ID可以对应一个移动终端的标识。
在执行步骤206之后,服务端在构建用户ID与目标自然人ID的对应关系后,可以建立用户ID与自然人ID的对应关系表。用户ID与自然人ID的对应关系表中,一个自然人ID可以对应多个用户ID。
可选的,服务端可以向自然人ID推送内容(比如,各种类型的推送消息)。比如,服务端向该目标自然人ID推送内容时,可以向该自然人ID对应的移动终端推送内容,无需单独向应用程序账号发送推送内容,从而提高推送效率。
服务端建立用户ID与目标自然人ID的对应关系后,可以将用户ID与目 标自然人ID的对应关系存储在服务端的数据库中。
可选的,服务端可以对新注册的用户ID的用户行为数据进行分析,分析该新注册的用户ID的用户行为数据与已经存储的所有自然人ID的用户行为数据进行分析,如果上述已经存储的所有自然人ID中与该新注册的用户ID的相似度最大的自然人ID的相似度大于预设相似度阈值,则建立上述相似度最大的自然人ID与该新注册的用户ID的对应关系。
上述主要从方法侧执行过程的角度对本申请实施例的方案进行了介绍。可以理解的是,服务端为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,本发明能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。
本申请实施例可以根据上述方法示例对服务端进行功能单元的划分,例如,可以对应各个功能划分各个功能单元,也可以将两个或两个以上的功能集成在一个处理单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。需要说明的是,本申请实施例中对单元的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
请参阅图3,图3是本申请实施例公开的一种用户相似度计算装置的结构示意图。如图3所示,该用户相似度计算装置300包括提取单元301、选取单元302、计算单元303和分类单元304,其中:
所述提取单元301,用于提取第一用户ID的至少两个用户特征,得到所述第一用户ID的特征集,所述特征集至少包括所述至少两个用户特征,所述第一用户ID为N个用户ID中的任一个,N为大于或等于2的整数;
所述选取单元302,用于选取目标哈希函数;
所述计算单元303,用于采用所述目标哈希函数计算所述N个用户ID的目标用户特征之间的相似度,得到所述N个用户ID之间的初始相似度;
所述分类单元304,用于依据所述N个用户ID之间的初始相似度的大小 将所述N个用户ID划分到M个哈希桶中,M为大于或等于2的整数;
所述计算单元303,还用于计算第一哈希桶中任意两个用户ID之间的相似度,所述第一哈希桶为所述M个哈希桶中的任一个。
可选的,所述选取单元302选取目标哈希函数,具体为:确定所述至少两个用户特征的类型;根据类型与哈希函数的对应关系确定与所述至少两个用户特征对应的目标哈希函数。
可选的,所述计算单元303采用所述目标哈希函数计算所述N个用户ID的目标用户特征之间的相似度,具体为:若所述至少两个用户特征的类型为第一类型,采用汉明距离计算公式计算所述N个用户ID的目标用户特征之间的相似度。
可选的,所述计算单元303采用所述目标哈希函数计算所述N个用户ID的目标用户特征之间的相似度,具体为:若所述至少两个用户特征的类型为第二类型,采用欧式距离计算公式计算所述N个用户ID的目标用户特征之间的相似度。
可选的,所述分类单元304依据所述N个用户ID之间的初始相似度的大小将所述N个用户ID划分到M个哈希桶中,具体为:确定所述N个用户ID中是否存在与所述第一用户ID的初始相似度大于第一预设相似度阈值的用户ID;若存在,将所述第一用户ID、所述N个用户ID中与所述第一用户ID的初始相似度大于第一预设相似度阈值的用户ID划分到同一个哈希桶中。
可选的,所述计算单元303计算第一哈希桶中任意两个用户ID之间的相似度,具体为:获取所述第一哈希桶中每个用户ID的特征集;基于所述第一哈希桶中每个用户ID的特征集确定所述第一哈希桶中每个用户ID的特征向量;基于所述第一哈希桶中每个用户ID的特征向量,采用汉明距离计算公式计算所述第一哈希桶中任意两个用户ID的特征向量之间的距离;根据所述第一哈希桶中任意两个用户ID的特征向量之间的距离所述第一哈希桶中任意两个用户ID之间的相似度。
可选的,该用户相似度计算装置300包括确定单元305和建立单元306。
所述确定单元305,用于在所述计算单元303计算第一哈希桶中任意两个用户ID之间的相似度,所述第一哈希桶为所述M个哈希桶中的任一个之后, 确定所述第一哈希桶中是否存在相互之间的相似度大于第二预设相似度阈值的P个用户ID,P为大于或等于2的整数;
所述建立单元306,用于在所述确定单元305确定所述第一哈希桶中存在相互之间的相似度大于第二预设相似度阈值的P个用户ID的情况下,建立所述P个用户ID与目标自然人ID的对应关系。
其中,图3中的提取单元301、选取单元302、计算单元303、分类单元304、确定单元305和建立单元306可以是处理器。
实施图3所示的用户相似度计算装置,在计算用户相似度时,可以选取合适的哈希函数按照初始相似度大小将N个用户ID划分到M个哈希桶中,仅对每个哈希桶内的用户ID进行相似度计算,从而避免了将用户ID与所有的其他用户ID进行相似度计算,可以降低用户ID的相似度计算的计算量,从而可以降低用户相似度计算的复杂度,提高用户相似度计算的速度。
请参阅图4,图4是本申请实施例公开的一种服务端的结构示意图。如图4所示,该服务端400包括处理器401和存储器402,其中,服务端400还可以包括总线403,处理器401和存储器402可以通过总线403相互连接,总线403可以是外设部件互连标准(Peripheral Component Interconnect,简称PCI)总线或扩展工业标准结构(Extended Industry Standard Architecture,简称EISA)总线等。总线403可以分为地址总线、数据总线、控制总线等。为便于表示,图4中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。其中,服务端400还可以包括通信接口404,服务端400可以通过通信接口404与客户端进行通信。存储器402用于存储包含指令的一个或多个程序;处理器401用于调用存储在存储器402中的指令执行上述图1至图2中的部分或全部方法步骤。
实施图4所示的服务端,在计算用户相似度时,可以选取合适的哈希函数按照初始相似度大小将N个用户ID划分到M个哈希桶中,仅对每个哈希桶内的用户ID进行相似度计算,从而避免了将用户ID与所有的其他用户ID进行相似度计算,可以降低用户ID的相似度计算的计算量,从而可以降低用户相似度计算的复杂度,提高用户相似度计算的速度。
本申请实施例还提供一种计算机存储介质,其中,该计算机存储介质存储 用于电子数据交换的计算机程序,该计算机程序使得计算机执行如上述方法实施例中记载的任何一种用户相似度计算方法的部分或全部步骤。
本申请实施例还提供一种计算机程序产品,该计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,该计算机程序可操作来使计算机执行如上述方法实施例中记载的任何一种用户相似度计算方法的部分或全部步骤。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明并不受所描述的动作顺序的限制,因为依据本发明,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本发明所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售 或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储器中,存储器可以包括:闪存盘、只读存储器(英文:Read-Only Memory,简称:ROM)、随机存取器(英文:Random Access Memory,简称:RAM)、磁盘或光盘等。
以上对本申请实施例进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。

Claims (10)

  1. 一种用户相似度计算方法,其特征在于,包括:
    提取第一用户ID的至少两个用户特征,得到所述第一用户ID的特征集,所述特征集至少包括所述至少两个用户特征,所述第一用户ID为N个用户ID中的任一个,N为大于或等于2的整数;
    选取目标哈希函数,采用所述目标哈希函数计算所述N个用户ID的目标用户特征之间的相似度,得到所述N个用户ID之间的初始相似度;
    依据所述N个用户ID之间的初始相似度的大小将所述N个用户ID划分到M个哈希桶中,M为大于或等于2的整数;
    计算第一哈希桶中任意两个用户ID之间的相似度,所述第一哈希桶为所述M个哈希桶中的任一个。
  2. 根据权利要求1所述的方法,其特征在于,所述选取目标哈希函数,包括:
    确定所述至少两个用户特征的类型;
    根据类型与哈希函数的对应关系确定与所述至少两个用户特征对应的目标哈希函数。
  3. 根据权利要求2所述的方法,其特征在于,所述采用所述目标哈希函数计算所述N个用户ID的目标用户特征之间的相似度,包括:
    若所述至少两个用户特征的类型为第一类型,采用汉明距离计算公式计算所述N个用户ID的目标用户特征之间的相似度。
  4. 根据权利要求2所述的方法,其特征在于,所述采用所述目标哈希函数计算所述N个用户ID的目标用户特征之间的相似度,包括:
    若所述至少两个用户特征的类型为第二类型,采用欧式距离计算公式计算所述N个用户ID的目标用户特征之间的相似度。
  5. 根据权利要求1所述的方法,其特征在于,所述依据所述N个用户ID之间的初始相似度的大小将所述N个用户ID划分到M个哈希桶中,包括:
    确定所述N个用户ID中是否存在与所述第一用户ID的初始相似度大于第一预设相似度阈值的用户ID;
    若存在,将所述第一用户ID、所述N个用户ID中与所述第一用户ID的初始相似度大于第一预设相似度阈值的用户ID划分到同一个哈希桶中。
  6. 根据权利要求1所述的方法,其特征在于,所述计算第一哈希桶中任意两个用户ID之间的相似度,包括:
    获取所述第一哈希桶中每个用户ID的特征集;
    基于所述第一哈希桶中每个用户ID的特征集确定所述第一哈希桶中每个用户ID的特征向量;
    基于所述第一哈希桶中每个用户ID的特征向量,采用汉明距离计算公式计算所述第一哈希桶中任意两个用户ID的特征向量之间的距离;
    根据所述第一哈希桶中任意两个用户ID的特征向量之间的距离所述第一哈希桶中任意两个用户ID之间的相似度。
  7. 根据权利要求1~6任一项所述的方法,其特征在于,所述计算第一哈希桶中任意两个用户ID之间的相似度,所述第一哈希桶为所述M个哈希桶中的任一个之后,所述方法还包括:
    确定所述第一哈希桶中是否存在相互之间的相似度大于第二预设相似度阈值的P个用户ID,P为大于或等于2的整数;
    若存在,建立所述P个用户ID与目标自然人ID的对应关系。
  8. 一种用户相似度计算装置,其特征在于,所述用户相似度计算装置包括提取单元、选取单元、计算单元和分类单元,其中:
    所述提取单元,用于提取第一用户ID的至少两个用户特征,得到所述第一用户ID的特征集,所述特征集至少包括所述至少两个用户特征,所述第一用户ID为N个用户ID中的任一个,N为大于或等于2的整数;
    所述选取单元,用于选取目标哈希函数;
    所述计算单元,用于采用所述目标哈希函数计算所述N个用户ID的目标用户特征之间的相似度,得到所述N个用户ID之间的初始相似度;
    所述分类单元,用于依据所述N个用户ID之间的初始相似度的大小将所述N个用户ID划分到M个哈希桶中,M为大于或等于2的整数;
    所述计算单元,还用于计算第一哈希桶中任意两个用户ID之间的相似度,所述第一哈希桶为所述M个哈希桶中的任一个。
  9. 一种服务端,其特征在于,包括处理器以及存储器,所述存储器用于存储一个或多个程序,所述一个或多个程序被配置成由所述处理器执行,所述程序包括用于执行如权利要求1~7任一项所述的方法。
  10. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质用于存储电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行如权利要求1~7任一项所述的方法。
PCT/CN2019/093109 2019-06-26 2019-06-26 用户相似度计算方法、装置、服务端及存储介质 WO2020258101A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2019/093109 WO2020258101A1 (zh) 2019-06-26 2019-06-26 用户相似度计算方法、装置、服务端及存储介质
CN201980091291.0A CN113383314B (zh) 2019-06-26 2019-06-26 用户相似度计算方法、装置、服务端及存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/093109 WO2020258101A1 (zh) 2019-06-26 2019-06-26 用户相似度计算方法、装置、服务端及存储介质

Publications (1)

Publication Number Publication Date
WO2020258101A1 true WO2020258101A1 (zh) 2020-12-30

Family

ID=74061169

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/093109 WO2020258101A1 (zh) 2019-06-26 2019-06-26 用户相似度计算方法、装置、服务端及存储介质

Country Status (2)

Country Link
CN (1) CN113383314B (zh)
WO (1) WO2020258101A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117061254A (zh) * 2023-10-12 2023-11-14 之江实验室 异常流量检测方法、装置和计算机设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8515964B2 (en) * 2011-07-25 2013-08-20 Yahoo! Inc. Method and system for fast similarity computation in high dimensional space
CN105608219A (zh) * 2016-01-07 2016-05-25 上海通创信息技术有限公司 一种基于聚类的流式推荐引擎、推荐系统以及推荐方法
CN109255640A (zh) * 2017-07-13 2019-01-22 阿里健康信息技术有限公司 一种确定用户分组的方法、装置及系统
CN109815406A (zh) * 2019-01-31 2019-05-28 腾讯科技(深圳)有限公司 一种数据处理、信息推荐方法及装置

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102622366B (zh) * 2011-01-28 2014-07-30 阿里巴巴集团控股有限公司 相似图像的识别方法和装置
WO2013178286A1 (en) * 2012-06-01 2013-12-05 Qatar Foundation A method for processing a large-scale data set, and associated apparatus
CN106570141B (zh) * 2016-11-04 2020-05-19 中国科学院自动化研究所 近似重复图像检测方法
CN109697641A (zh) * 2017-10-20 2019-04-30 北京京东尚科信息技术有限公司 计算商品相似度的方法和装置
CN109800325B (zh) * 2018-12-26 2021-10-26 北京达佳互联信息技术有限公司 视频推荐方法、装置和计算机可读存储介质
CN109558512B (zh) * 2019-01-24 2020-07-14 广州荔支网络技术有限公司 一种基于音频的个性化推荐方法、装置和移动终端

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8515964B2 (en) * 2011-07-25 2013-08-20 Yahoo! Inc. Method and system for fast similarity computation in high dimensional space
CN105608219A (zh) * 2016-01-07 2016-05-25 上海通创信息技术有限公司 一种基于聚类的流式推荐引擎、推荐系统以及推荐方法
CN109255640A (zh) * 2017-07-13 2019-01-22 阿里健康信息技术有限公司 一种确定用户分组的方法、装置及系统
CN109815406A (zh) * 2019-01-31 2019-05-28 腾讯科技(深圳)有限公司 一种数据处理、信息推荐方法及装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117061254A (zh) * 2023-10-12 2023-11-14 之江实验室 异常流量检测方法、装置和计算机设备
CN117061254B (zh) * 2023-10-12 2024-01-23 之江实验室 异常流量检测方法、装置和计算机设备

Also Published As

Publication number Publication date
CN113383314B (zh) 2023-01-10
CN113383314A (zh) 2021-09-10

Similar Documents

Publication Publication Date Title
US10897685B2 (en) Matching users in a location-based service
WO2020257993A1 (zh) 内容推送方法、装置、服务端及存储介质
TWI659300B (zh) 一種設備標識提供方法及裝置
CN109213781B (zh) 风控数据查询方法及装置
WO2019042180A1 (zh) 资源配置方法及相关产品
WO2018149137A1 (zh) 无线保真Wi-Fi连接方法及相关产品
WO2014180145A1 (en) Methods and systems for connecting a mobile device to a network
WO2023020187A1 (zh) 数据获取方法、装置、电子设备及存储介质
CN109858250A (zh) 一种基于级联分类器的安卓恶意代码检测模型方法
WO2020252639A1 (zh) 内容推送方法及相关产品
CN109121157B (zh) 一种网络限速确定方法及终端、存储介质
WO2020258101A1 (zh) 用户相似度计算方法、装置、服务端及存储介质
CN111405007B (zh) Tcp会话管理方法、装置、存储介质及电子设备
US11323873B2 (en) Method for wireless fidelity connection and related products
US9490914B2 (en) Electronic device and its wireless network communication method
WO2020019524A1 (zh) 数据处理方法及装置
CN111885664B (zh) 用户设备路由选择方法及相关产品
CN106612262B (zh) 用于建立pcc会话的方法、装置以及系统
CN113383360B (zh) 内容推送方法、装置、服务端及存储介质
CN109547317B (zh) 连接隧道的建立方法及装置
CN108028854A (zh) 一种数据传输方法以及宿主机
CN114071455A (zh) 免密认证方法、服务器和系统、网关设备
WO2016058388A1 (zh) 一种短消息发送方法、短消息中心及存储介质
CN117640363B (zh) 微服务配置与管控方法和系统
CN114793234B (zh) 消息处理方法、装置、设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19935166

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19935166

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 25.05.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19935166

Country of ref document: EP

Kind code of ref document: A1