CN108600792B - Similarity measurement method, device, equipment and storage medium - Google Patents

Similarity measurement method, device, equipment and storage medium Download PDF

Info

Publication number
CN108600792B
CN108600792B CN201810284500.0A CN201810284500A CN108600792B CN 108600792 B CN108600792 B CN 108600792B CN 201810284500 A CN201810284500 A CN 201810284500A CN 108600792 B CN108600792 B CN 108600792B
Authority
CN
China
Prior art keywords
user
similarity
determining
articles
entropy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810284500.0A
Other languages
Chinese (zh)
Other versions
CN108600792A (en
Inventor
王璐
陈少杰
张文明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Douyu Network Technology Co Ltd
Original Assignee
Wuhan Douyu Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Douyu Network Technology Co Ltd filed Critical Wuhan Douyu Network Technology Co Ltd
Priority to CN201810284500.0A priority Critical patent/CN108600792B/en
Publication of CN108600792A publication Critical patent/CN108600792A/en
Application granted granted Critical
Publication of CN108600792B publication Critical patent/CN108600792B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/258Client or end-user data management, e.g. managing client capabilities, user preferences or demographics, processing of multiple end-users preferences to derive collaborative data
    • H04N21/25866Management of end-user data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/251Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • H04N21/252Processing of multiple end-users' preferences to derive collaborative data
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/258Client or end-user data management, e.g. managing client capabilities, user preferences or demographics, processing of multiple end-users preferences to derive collaborative data
    • H04N21/25866Management of end-user data
    • H04N21/25891Management of end-user data being end-user preferences
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/262Content or additional data distribution scheduling, e.g. sending additional data at off-peak times, updating software modules, calculating the carousel transmission frequency, delaying a video stream transmission, generating play-lists
    • H04N21/26258Content or additional data distribution scheduling, e.g. sending additional data at off-peak times, updating software modules, calculating the carousel transmission frequency, delaying a video stream transmission, generating play-lists for generating a list of items to be played back in a given order, e.g. playlist, or scheduling item distribution according to such list
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/466Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • H04N21/4667Processing of monitored end-user data, e.g. trend analysis based on the log file of viewer selections
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/466Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • H04N21/4668Learning process for intelligent management, e.g. learning user preferences for recommending movies for recommending content, e.g. movies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/482End-user interface for program selection
    • H04N21/4826End-user interface for program selection using recommendation lists, e.g. of programs or channels sorted out according to their score

Abstract

The embodiment of the invention discloses a similarity measurement method, a similarity measurement device, similarity measurement equipment and a storage medium. The method comprises the following steps: determining a user set between two articles according to user data corresponding to the articles with similarity to be measured, wherein the user set comprises a user intersection set, a user relative complement set and an absolute complement set of a user union set; and determining the similarity between the two articles according to the Shannon entropy of the user set and a preset similarity measurement rule based on the maximum likelihood ratio test. By the technical scheme, the problem of similarity measurement one-sidedness in the recommendation algorithm based on the articles is solved, and the similarity measurement data is more comprehensively and reasonably utilized, so that the similarity between the articles which is more in line with the reality is obtained.

Description

Similarity measurement method, device, equipment and storage medium
Technical Field
The present invention relates to computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for similarity measurement.
Background
In the application field of big data, an important direction is to perform personalized recommendation on users according to mass data. For an internet live broadcast platform, personalized recommendation specifically refers to accurately recommending a live broadcast room which is interested by a current user.
At present, in a plurality of big data algorithm solutions recommended by a live broadcast room, a simple and feasible scheme is to recommend a live broadcast room similar to a historical live broadcast room watched recently to a target user, and the difficulty of the scheme is how to accurately calculate the similarity between every two live broadcast rooms.
In the existing live broadcast room recommendation scheme, a Jaccard's Coefficient (Jaccard's Coefficient) Coefficient algorithm for calculating the similarity of articles in article-based recommendation algorithms is one of common live broadcast room similarity measurement methods. The algorithm is calculated based on a set, and the similarity between every two live broadcast rooms is equal to the number of users watching two live broadcast rooms at the same time divided by the number of users watching at least one live broadcast room. The algorithm has the defect in the similarity measurement of the live broadcast rooms that only watching users watching two live broadcast rooms are considered, and the watching conditions of the watching users to other live broadcast rooms are not considered, so that only partial information in the similarity measurement information of the live broadcast rooms is utilized, and the obtained similarities are compared. For example, some of the above-mentioned watching users only perform watching behaviors on at least one live broadcast room of the above-mentioned two live broadcast rooms due to accidental behaviors, and the watching behaviors are more concentrated on other live broadcast rooms, so that the part of the watching users is simply considered to be interested in at least one live broadcast room of the two live broadcast rooms, and the part of the watching users is directly included in the data of the similarity measurement between the two live broadcast rooms, and the similarity measurement is compared without considering other watching behaviors of the part of the watching users, so that the obtained similarity is distorted.
Disclosure of Invention
Embodiments of the present invention provide a method, an apparatus, a device, and a storage medium for similarity measurement, so as to achieve more comprehensive and reasonable utilization of similarity measurement data, thereby obtaining a similarity between articles that better conforms to reality.
In a first aspect, an embodiment of the present invention provides a similarity measurement method, including:
determining a user set between two articles according to user data corresponding to the articles with similarity to be measured, wherein the user set comprises a user intersection set, a user relative complement set and an absolute complement set of a user union set;
and determining the similarity between the two articles according to the Shannon entropy of the user set and a preset similarity measurement rule based on the maximum likelihood ratio test.
In a second aspect, an embodiment of the present invention further provides a similarity measurement apparatus, where the apparatus includes:
the system comprises a user set determining module, a user matching module and a matching module, wherein the user set determining module is used for determining a user set between two articles according to user data corresponding to the articles with similarity to be measured, and the user set comprises a user intersection set, a user relative complement set and an absolute complement set of a user union set;
and the similarity measurement module is used for determining the similarity between the two articles according to the Shannon entropy of the user set and a preset similarity measurement rule based on the maximum likelihood ratio test.
In a third aspect, an embodiment of the present invention further provides an apparatus, where the apparatus includes:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the similarity metric methods provided by any of the embodiments of the present invention.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the similarity measurement method provided in any embodiment of the present invention.
The method comprises the steps that a user set between two articles is determined through user data corresponding to the articles with similarity to be measured, wherein the user set comprises a user intersection set, a user relative complement set and an absolute complement set of a user union set; and determining the similarity between the two articles according to the Shannon entropy of the user set and a preset similarity measurement rule based on the maximum likelihood ratio test. The problem of similarity measurement one-sidedness in the recommendation algorithm based on the articles is solved, and the similarity measurement data is more comprehensively and reasonably utilized, so that the similarity between the articles which is more in line with the reality is obtained.
Drawings
Fig. 1 is a flowchart of a similarity measurement method according to a first embodiment of the present invention;
FIG. 2 is a flowchart of a similarity measurement method according to a second embodiment of the present invention;
FIG. 3 is a flowchart of a similarity measurement method according to a third embodiment of the present invention;
fig. 4 is a schematic structural diagram of a similarity measurement apparatus according to a fourth embodiment of the present invention;
fig. 5 is a schematic structural diagram of an apparatus in the fifth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Example one
The similarity measurement method provided by the embodiment can be suitable for similarity calculation based on two articles in article recommendation. The method may be performed by a similarity measuring device, which may be implemented by software and/or hardware, and may be integrated into a device with computing and network functions, such as a typical user terminal device, for example, a server, a tablet computer, a desktop computer, or the like. Referring to fig. 1, the method of the present embodiment specifically includes the following steps:
s110, determining a user set between the two articles according to the user data corresponding to the articles with the similarity to be measured.
The articles with the similarity to be measured refer to articles belonging to the same category as articles related to the historical operation behavior of the user, and the articles can be daily consumer goods, learning courses, audios and videos, live broadcasting rooms and the like. For example, if the article related to the user historical operation behavior is a certain live broadcast room, the articles with the similarity to be measured are a plurality of live broadcast rooms including the live broadcast room.
The user data refers to user-related data having an operation behavior on the article to be measured for similarity, and may be, for example, user identification information, operation behavior information of each user on the article, and the like. The user data may be acquired from the network platform corresponding to the article in a demand time period, and the demand time period may be set according to the requirement of the similarity metric, for example, set as a valid storage period of the user data or a fixed time period such as one month.
The user intersection is the intersection (marked as I ∩ J) of the user set I and the user set J, the user relative complement is the relative complement (marked as I \ J) of the user set J in the user set I and/or the relative complement (marked as J \ I) of the user set I in the user set J, and the absolute complement of the user union is the absolute complement (marked as C) of the union of the user set I and the user set JZ(I ∪ J)), when the method is implemented, user data can be traversed, and a corresponding user set between every two articles needing to measure the similarity is determined.
Specifically, user data corresponding to an article satisfying a set condition is obtained from a network platform corresponding to an article whose similarity is to be measured, where the set condition refers to a condition for screening the article, for example, user data corresponding to all articles is obtained from the network platform, or user data corresponding to an article obtained by sampling according to a certain sampling method is obtained. And traversing the user data according to the user or the article to determine a user set between two articles needing to measure the similarity.
S120, determining the similarity between the two articles according to the Shannon entropy of the user set and a preset similarity measurement rule based on the maximum likelihood ratio test.
The shannon entropy also becomes the information entropy, and can indicate the disorder degree of the information, and the more disordered the information is, the larger the information entropy is. The shannon entropy is adopted, and the information quantity of the measurement object can be more truly represented based on the shannon entropy. The preset similarity metric rule is a similarity metric rule determined in advance from the statistics of the maximum likelihood ratio test. The maximum likelihood ratio test is adopted here in order to reasonably process the various user sets together and to make the result as close to the real situation as possible.
Specifically, the above process is: determining a preset similarity measurement rule based on the statistic of the maximum likelihood ratio test; and determining the Shannon entropy of the user set, and determining the similarity between the two articles according to the Shannon entropy and a preset similarity measurement rule.
In specific implementation, the maximum likelihood ratio test statistic needed is determined according to similarity measurement requirements in the embodiment of the invention, such as comprehensive intersection, relative complement and absolute complement, and the information quantity requirements of each set can be well balanced. The statistic T adopted in the embodiment of the invention is as follows:
T=-2*(maxrix_entropy-row_entropy-colume_entropy) (1)
wherein, note NGeneral assembly=N11+N12+N21+N22Then, the process of the present invention,
Figure BDA0001615619100000061
in the above, entopy is Shannon entropy, N11Is the user intersection, N, of the user set I corresponding to the item I and the user set J corresponding to the item J12A relative complement of users, N, for user set J in user set I21Relative complement of users, N, for user set I in user set J22The absolute complement of the union of the users of the user set I and the user set J.
Then, it is considered that the numerical range of the statistic T is [0, ∞ ], and the numerical range of the similarity is [ -1, 1 ]. Therefore, the statistic T needs to be improved to obtain a satisfactory similarity metric rule. Illustratively, the preset similarity measure rule is as follows:
Figure BDA0001615619100000062
wherein S isijIs the similarity between two items, entopy is Shannon entropy, N11Is the user intersection, N, of the user set I corresponding to the item I and the user set J corresponding to the item J12A relative complement of users, N, for user set J in user set I21Relative complement of users, N, for user set I in user set J22The absolute complement of the union of the users of the user set I and the user set J.
As can be seen from the preset similarity measure rule (3), firstly, the algorithm does not purposely perform special processing on the articles with higher popularity (i.e. the hot articles), such as dividing the similarity of the articles including the hot articles by the heat representation quantity representing the heat of the hot articles to artificially reduce the heat of the hot articles. Secondly, under the condition that the user intersection occupation ratios corresponding to the two articles are the same, the higher the article heat degree is, the higher the obtained similarity is. In the case that the user distributions of the item i and the item j are the same, the better the correlation is, and the smaller the matrix _ entry is, the higher the obtained similarity is.
And finally, according to a preset similarity measurement rule (3), determining the corresponding shannon entropy of the user set in the S110 according to a shannon entropy formula (2). And then, calculating the similarity between the two articles of which the similarity needs to be measured by using the obtained Shannon entropy and a preset similarity measurement rule (3).
In the technical scheme of the embodiment, a user set between two articles is determined according to user data corresponding to the articles with similarity to be measured, wherein the user set comprises a user intersection, a user relative complement and an absolute complement of a user union; and determining the similarity between the two articles according to the Shannon entropy of the user set and a preset similarity measurement rule based on the maximum likelihood ratio test. The problem of similarity measurement one-sidedness in the recommendation algorithm based on the articles is solved, and the similarity measurement data is more comprehensively and reasonably utilized, so that the similarity between the articles which is more in line with the reality is obtained.
Example two
In this embodiment, on the basis of the first embodiment, the user data truncation processing is added, and optimization is further performed on "determining a user set between two articles according to the user data corresponding to the articles with the similarity to be measured". Wherein explanations of the same or corresponding terms as those of the above embodiments are omitted. Referring to fig. 2, the similarity measurement method provided in this embodiment includes:
s210, determining the hot articles and the user behavior data of the set types corresponding to the hot articles.
The set type of user behavior data refers to preset type of user behavior data related to the popularity of the article. The items are different, and the corresponding set types of user behavior data are different, for example, when the items are network courses, the set types of user behavior data may be learning duration, learning notes or comments, course sharing, and the like. Illustratively, the item is a live room; accordingly, the set types of user behavior data include viewing duration, number of bullet screen releases, and attention behaviors.
Specifically, in the similarity measurement algorithm in the embodiment of the present invention, the articles to be measured for similarity and the corresponding user data need to be traversed to obtain the user set. The algorithm complexity is high when the process is realized in engineering, especially for hot goods, the user data is usually large, and the corresponding algorithm realization complexity is high. In order to reduce the implementation complexity of the similarity measurement algorithm, the user data of the popular item is cut off in the embodiment of the present invention, so as to retain the user data (i.e., valid user data) corresponding to the set number of users (i.e., valid users) who are really interested in the popular item.
In the embodiment of the invention, the hot articles are determined as the data truncation objects, on one hand, the user data corresponding to the hot articles contain more useless user behavior data, for example, some users only simply click and check the hot articles, but not the real interests of the users. Such invalid user behavior data should be truncated while the truncation process is performed, so that accidental behavior data is minimized. On the other hand, the user data amount corresponding to the hot article is large, so that the similarity measurement result between the hot article and other articles is high, and the Harry Baud phenomenon is easy to occur in the article recommendation result, namely the recommendation degree of the hot article is high. Therefore, the user data corresponding to the hot articles is cut off so as to reduce the incidence rate of the Harry baud phenomenon.
In actual implementation, the hot articles are determined according to the definition of the hot articles. And then, acquiring the user behavior data of the set type corresponding to the popular item from the network platform.
Illustratively, determining the hot item includes: and determining hot articles according to the number of users corresponding to the articles in the first preset time period and the preset number of users.
The first preset time period is a time period with a preset time length and is used for representing the validity period of the hot goods. The preset user number refers to the number of users corresponding to the preset articles and is used for representing the heat degree of the hot articles.
Specifically, in a first preset time period, the number of users corresponding to the article with the similarity to be measured is counted to obtain the counted number of users. And comparing the statistical user number with a preset user number. If the counted user number is larger than or equal to the preset user number, determining the articles corresponding to the counted user number as popular articles; on the contrary, if the counted user number is smaller than the preset user number, the article corresponding to the counted user number cannot be defined as a popular article.
S220, determining a user preference value of the user corresponding to the popular item according to the user behavior data and the preset type weight value.
The preset type weight value refers to a preset weight value corresponding to user behavior data of each set type.
Specifically, the obtaining process of the user preference value is as follows: and multiplying the user data of the single set type in the popular item and the corresponding preset type weight value one by one, and accumulating all the products corresponding to the popular item to obtain the user preference value of the single user. And for each user corresponding to the popular item, acquiring the user preference value according to the process so as to determine the user preference value of the user corresponding to the popular item.
Exemplarily, determining the user preference value of the user corresponding to the popular item according to the user behavior data and the preset type weight value includes: standardizing the user behavior data in a second preset time period to obtain standard user behavior data; and determining the user preference value of the user corresponding to the popular item according to the standard user behavior and the preset type weight value.
The second preset time period is a time period with a preset time length and is used for representing the standardized statistical time length. The statistical time length needs to be set to a proper time length, if the statistical time length is set to be too long, the recent data of the user cannot be represented, and if the statistical time length is set to be too short, the statistical significance is not achieved. Therefore, the second preset time period needs to be determined according to specific articles.
Specifically, since the dimensions of the user data are different between different setting types, the user data needs to be standardized. In specific implementation, the set type user behavior data in the second preset time period is counted to obtain the maximum value and the minimum value of the corresponding set type user behavior data, and the set type user behavior data is standardized according to the set type user behavior data and the maximum value and the minimum value obtained through the counting to obtain standard user behavior data. And then, determining the user preference value of the user corresponding to the popular item according to the standard user behavior data and the corresponding preset type weight value. The method has the advantages that the user preference value in the appropriate time length can be obtained, the subsequent data interception processing is more in line with the actual situation, and the similarity measurement is more real.
And S230, determining effective user data corresponding to the hot articles according to the user preference value.
Specifically, the user data corresponding to the hot article is cut off according to the user preference value, and only the effective user data is reserved. For example, a preference threshold may be set, valid users having user preference values greater than or equal to the preference threshold are reserved, and user data corresponding to the valid users are determined as valid user data corresponding to hot articles; the users corresponding to the popular items can also be sorted according to the user preference values, a certain number of effective users are reserved according to the sorting result, and the user data corresponding to the effective users are determined as the effective user data corresponding to the popular items.
And S240, determining the effective user data as the hot user data corresponding to the hot article.
Specifically, in this embodiment, for the similarity measure including the popular item, the used similarity measure data should be valid user data corresponding to the popular item determined in S230, but not all user data. Therefore, the valid user data corresponding to the hot item determined in S230 is determined as the hot user data corresponding to the hot item.
And S250, determining a user set between the two articles according to the popular user data and/or the user data corresponding to the non-popular articles.
Wherein, the non-hot articles are the articles with the similarity to be measured except the hot articles.
Specifically, if two articles with similarity to be measured are both popular articles, determining that the data on which the user set is based is the popular user data corresponding to the two articles; if one of the two articles is a hot article, determining that the data on which the user set is based is hot user data corresponding to the hot article and user data corresponding to a non-hot article; if both items are non-popular items, the data on which the user set is determined is the user data corresponding to the non-popular items.
S260, determining the similarity between the two articles according to the Shannon entropy of the user set and a preset similarity measurement rule based on the maximum likelihood ratio test.
According to the technical scheme, the user data corresponding to the popular item is cut off, so that the implementation complexity of the similarity algorithm can be effectively reduced, the occurrence probability of the popular item in the item recommendation based on the similarity can be effectively reduced, and the item recommendation result based on the similarity is more in line with the actual situation.
EXAMPLE III
In this embodiment, on the basis of the above embodiments, the article is taken as a live broadcast room for example, so as to describe the similarity measure. On the basis, live broadcast room recommendation based on similarity is further increased. Wherein explanations of the same or corresponding terms as those of the above embodiments are omitted. Referring to fig. 3, the similarity measurement method provided in this embodiment includes:
s310, determining the hot live broadcast room and the set type of user behavior data corresponding to the hot live broadcast room.
Specifically, a live broadcast room with the number of viewing users being greater than 20 ten thousand in 30 days is determined as a hot live broadcast room, and the user viewing duration, the bullet screen release times and whether to pay attention to the user viewing duration, the bullet screen release times and the attention behavior corresponding to the hot live broadcast room are acquired as the user behavior data of the set type of the hot live broadcast room.
S320, determining a user preference value of the user corresponding to the hot live broadcast room according to the user behavior data and the preset type weight value.
Specifically, according to the watching duration, the bullet screen release times and the watching duration weight value, the bullet screen release weight value and the attention weight value corresponding to the attention behavior, the user preference value of the user corresponding to the hot live broadcast room is determined according to the following preference value calculation formula (4).
Figure BDA0001615619100000121
Score (u) is a user preference value of a user u to a live broadcast room, α, β and gamma are preset type weight values of viewing duration, bullet screen publishing times and attention behaviors respectively, values of 0.4, 0.4 and 0.2 are respectively obtained in the embodiment, std _ time (u) is the viewing duration standardized by the user u, time (u) is the viewing duration of the user u, min _ time is the minimum value of the viewing duration of all users having operation behaviors to the live broadcast room in a second preset time period of 30 days, max _ time is the maximum value of the corresponding viewing duration, std _ msg _ cnt (u) is the number of bullet screen publishing times standardized by the user u, msg _ cnt (u) is the number of bullet screen publishing times of the user u, min _ msg _ cnt is the minimum value of the bullet screen publishing times of all users having operation behaviors to the live broadcast room in the second preset time period of 30 days, max _ g _ cnt is the maximum value of the corresponding bullet screen publishing times, and if the attention of the user u is 0.
S330, determining effective user data corresponding to the hot live broadcast room according to the user preference value.
Specifically, all users with operation behaviors in the hot live broadcast rooms are sorted in a reverse order according to the user preference values, 1 ten thousand users with the user preference values in front in each hot live broadcast room are reserved, and the users are determined to be effective user data.
S340, determining the effective user data to be hot user data corresponding to the hot live broadcast room.
S350, determining a user set between the two live broadcast rooms according to the hot user data and/or the user data corresponding to the non-hot live broadcast rooms.
And S360, determining the similarity between the two live broadcasting rooms according to the Shannon entropy of the user set and a preset similarity measurement rule based on the maximum likelihood ratio test.
Specifically, assume that the relative viewing people (i.e., user data) of live room 1 and live room 2 are:
N11=1000,N12=5000,N21=2000,N22=100000
thus:
Figure BDA0001615619100000131
Figure BDA0001615619100000132
Figure BDA0001615619100000141
then, the similarity between live rooms 1 and 2 is:
Figure BDA0001615619100000142
and S370, determining the historical watching live broadcast room of the target user.
The target user refers to a user of the live broadcast room to be recommended.
Specifically, a historical watching live room of a target user is obtained from a live platform.
And S380, sequencing the live broadcast rooms to be recommended according to the similarity between the live broadcast rooms to be recommended and the historical watching live broadcast rooms.
The to-be-recommended live broadcast room is a live broadcast room in a live broadcast platform capable of being recommended to a target user.
Specifically, according to a preset similarity measurement rule, similarity between the historical watching live broadcast rooms and the live broadcast rooms to be recommended is calculated one by one, and according to the similarity numerical value, the live broadcast rooms to be recommended are sorted in a reverse order.
And S390, determining the preset number of to-be-recommended live broadcast rooms in the sequencing result as target recommended live broadcast rooms.
The preset number refers to the number of preset recommended live broadcast rooms, and can be set by a live broadcast platform in a default mode or set by a user.
Specifically, in the sorting result of the live broadcast rooms to be recommended, which are sorted in the front and are set in number, are determined as target live broadcast rooms, and the target live broadcast rooms can be recommended to target users.
According to the technical scheme of the embodiment, the similarity between the two live broadcasting rooms is measured by integrating preset similarity measurement rules of various user sets, so that the similarity measurement between the live broadcasting rooms is more comprehensive; through the interception processing of the user data, the processing of the hot live broadcast room in the similarity measurement is more scientific, the hot live broadcast room is not simply and artificially reduced, the similarity of the live broadcast room is more comprehensive and more in line with the actual situation, the recommendation of the live broadcast room based on the similarity is more in line with the user interest, and the user experience is improved.
The following is an embodiment of a similarity measurement apparatus provided in an embodiment of the present invention, which belongs to the same inventive concept as the similarity measurement methods in the foregoing embodiments, and details that are not described in detail in the embodiment of the similarity measurement apparatus may refer to the above embodiment of the similarity measurement method.
Example four
The present embodiment provides a similarity measurement apparatus, referring to fig. 4, the apparatus specifically includes:
a user set determining module 410, configured to determine a user set between two articles according to user data corresponding to an article with similarity to be measured, where the user set includes a user intersection, a user relative complement, and an absolute complement of a user union;
and the similarity measurement module 420 is used for determining the similarity between the two articles according to the shannon entropy of the user set and a preset similarity measurement rule based on the maximum likelihood ratio test.
Optionally, the preset similarity measure rule is:
Figure BDA0001615619100000151
row_entropy=entropy(N11+N12,N21+N22)
column_entropy=entropy(N11+N21,N12+N22)
maxtrix_entropy=entropy(N11,N12,N21,N22)
wherein S isijIs the similarity between two items, entopy is Shannon entropy, N11Is the user intersection, N, of the user set I corresponding to the item I and the user set J corresponding to the item J12A relative complement of users, N, for user set J in user set I21Relative complement of users, N, for user set I in user set J22The absolute complement of the union of the users of the user set I and the user set J.
Optionally, on the basis of the above apparatus, the apparatus further includes:
the hot article determination module is used for determining a hot article and user behavior data of a set type corresponding to the hot article before determining a user set between the two articles according to user data corresponding to the articles with the similarity to be measured;
and the user preference value determining module is used for determining the user preference value of the user corresponding to the popular item according to the user behavior data and the preset type weight value.
Accordingly, the user set determination module 410 is specifically configured to:
determining effective user data as hot user data corresponding to the hot article;
and determining a user set between the two items according to the hot user data and/or the user data corresponding to the non-hot items.
Optionally, the hot item determination module is specifically configured to:
and determining hot articles according to the number of users corresponding to the articles in the first preset time period and the preset number of users.
Optionally, the user preference value determining module is specifically configured to:
standardizing the user behavior data in a second preset time period to obtain standard user behavior data;
and determining the user preference value of the user corresponding to the popular item according to the standard user behavior and the preset type weight value.
Optionally, the item is a live room; the set type of user behavior data includes viewing duration, bullet screen release times and attention behaviors.
Optionally, on the basis of the above apparatus, the apparatus further includes: the live broadcast room recommending module is used for:
after determining the similarity between two articles according to the Shannon entropy of the user set and a preset similarity measurement rule based on the maximum likelihood ratio test, determining a historical watching live broadcast room of a target user;
sequencing the live broadcast rooms to be recommended according to the similarity between the live broadcast rooms to be recommended and the historical watching live broadcast rooms;
and determining the preset number of to-be-recommended live broadcast rooms in the sequencing result as target recommended live broadcast rooms.
Through the similarity measurement device in the fourth embodiment of the invention, the problem of one-sidedness of similarity measurement in the recommendation algorithm based on the articles is solved, and the similarity measurement data is more comprehensively and reasonably utilized, so that the similarity between the articles which is more in line with the reality is obtained.
The similarity measurement device provided by the embodiment of the invention can execute the similarity measurement method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.
It should be noted that, in the embodiment of the similarity measuring apparatus, the included units and modules are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
EXAMPLE five
Referring to fig. 5, the present embodiment provides an apparatus 500 comprising: one or more processors 520; the storage 510 is used to store one or more programs, and when the one or more programs are executed by the one or more processors 520, the one or more processors 520 implement the similarity measurement method provided by the embodiment of the present invention, including:
determining a user set between two articles according to user data corresponding to the articles with the similarity to be measured, wherein the user set comprises a user intersection, a user relative complement and an absolute complement of a user union;
and determining the similarity between the two articles according to the Shannon entropy of the user set and a preset similarity measurement rule based on the maximum likelihood ratio test.
Of course, those skilled in the art will understand that the processor 520 may also implement the technical solution of the similarity measure method provided in any embodiment of the present invention.
The device 500 shown in fig. 5 is only an example and should not bring any limitations to the functionality and scope of use of the embodiments of the present invention.
As shown in fig. 5, the apparatus 500 includes a processor 520, a storage device 510, an input device 530, and an output device 540; the number of the processors 520 in the device may be one or more, and one processor 520 is taken as an example in fig. 5; the processor 520, the memory device 510, the input device 530 and the output device 540 of the apparatus may be connected by a bus or other means, such as by a bus 550 in fig. 5.
Storage 510 is provided as a computer-readable storage medium that may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to method … … in accordance with embodiments of the present invention.
The storage device 510 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the storage 510 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, storage 510 may further include memory located remotely from processor 520, which may be connected to devices over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 530 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the apparatus. The output device 540 may include a display device such as a display screen.
EXAMPLE six
The present embodiments provide a storage medium containing computer-executable instructions which, when executed by a computer processor, are operable to perform a method of similarity measurement, the method comprising:
determining a user set between two articles according to user data corresponding to the articles with the similarity to be measured, wherein the user set comprises a user intersection, a user relative complement and an absolute complement of a user union;
and determining the similarity between the two articles according to the Shannon entropy of the user set and a preset similarity measurement rule based on the maximum likelihood ratio test.
Of course, the storage medium provided by the embodiments of the present invention contains computer-executable instructions, and the computer-executable instructions are not limited to the above method operations, and may also perform related operations in the similarity measurement method provided by any embodiments of the present invention.
Based on the understanding that the technical solutions of the present invention can be embodied in the form of software products, such as floppy disks, Read-Only memories (ROMs), Random Access Memories (RAMs), flash memories (F L ASHs), hard disks or optical disks of a computer, and the like, which can include instructions for enabling a computer device (which may be a personal computer, a server, or a network device, and the like) to execute the similarity measurement method according to the embodiments of the present invention.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (9)

1. A method for similarity measurement, comprising:
determining a user set between two articles according to user data corresponding to the articles with similarity to be measured, wherein the user set comprises a user intersection set, a user relative complement set and an absolute complement set of a user union set;
determining the similarity between the two articles according to the Shannon entropy of the user set and a preset similarity measurement rule based on the maximum likelihood ratio test;
wherein the preset similarity measure rule is as follows:
Figure FDA0002529253230000011
row_entropy=entropy(N11+N12,N21+N22)
column_entropy=entropy(N11+N21,N12+N22)
maxtrix_entropy=entropy(N11,N12,N21,N22)
wherein S isijIs the similarity between two said articles, entropy is Shannon entropy, N11Is the user intersection, N, of the user set I corresponding to the item I and the user set J corresponding to the item J12A relative complement of users, N, for user set J in user set I21Relative complement of users, N, for user set I in user set J22The absolute complement of the union of the users of the user set I and the user set J.
2. The method of claim 1, further comprising, prior to said determining a set of users between two items based on user data corresponding to the items for which similarity is to be measured:
determining a hot article and user behavior data of a set type corresponding to the hot article;
determining a user preference value of a user corresponding to the popular item according to the user behavior data and a preset type weight value;
determining effective user data corresponding to the hot articles according to the user preference value;
the determining a user set between two items according to user data corresponding to the items with similarity to be measured comprises:
determining that the effective user data is hot user data corresponding to the hot article;
and determining a user set between the two items according to the popular user data and/or the user data corresponding to the non-popular items.
3. The method of claim 2, wherein said determining a hot item comprises:
and determining the popular articles according to the number of users corresponding to the articles in a first preset time period and the preset number of users.
4. The method of claim 2, wherein determining the user preference value of the user corresponding to the trending object according to the user behavior data and the preset type weight value comprises:
standardizing the user behavior data in a second preset time period to obtain standard user behavior data;
and determining the user preference value of the user corresponding to the popular item according to the standard user behavior and the preset type weight value.
5. The method of claim 2, wherein the item is a live room;
the set type of user behavior data comprises watching duration, bullet screen issuing times and attention behaviors.
6. The method according to claim 5, wherein after determining the similarity between two of the items according to the shannon entropy of the user set and a preset similarity measure rule based on a maximum likelihood ratio test, further comprising:
determining a historical watching live broadcast room of a target user;
sequencing the live broadcast rooms to be recommended according to the similarity between the live broadcast rooms to be recommended and the historical watching live broadcast rooms;
and determining the preset number of the live broadcast rooms to be recommended in the sequencing result as target recommended live broadcast rooms.
7. A similarity metric apparatus, comprising:
the system comprises a user set determining module, a user matching module and a matching module, wherein the user set determining module is used for determining a user set between two articles according to user data corresponding to the articles with similarity to be measured, and the user set comprises a user intersection set, a user relative complement set and an absolute complement set of a user union set;
the similarity measurement module is used for determining the similarity between the two articles according to the Shannon entropy of the user set and a preset similarity measurement rule based on the maximum likelihood ratio test;
wherein the preset similarity measure rule is as follows:
Figure FDA0002529253230000031
row_entropy=entropy(N11+N12,N21+N22)
column_entropy=entropy(N11+N21,N12+N22)
maxtrix_entropy=entropy(N11,N12,N21,N22)
wherein S isijIs the similarity between two said articles, entropy is Shannon entropy, N11Is the user intersection, N, of the user set I corresponding to the item I and the user set J corresponding to the item J12A relative complement of users, N, for user set J in user set I21Relative complement of users, N, for user set I in user set J22For users of user set I and user set JThe absolute complement of the union.
8. A similarity metric apparatus, characterized in that the apparatus comprises:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the similarity metric method of any of claims 1-6.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the similarity measure method according to any one of claims 1-6.
CN201810284500.0A 2018-04-02 2018-04-02 Similarity measurement method, device, equipment and storage medium Active CN108600792B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810284500.0A CN108600792B (en) 2018-04-02 2018-04-02 Similarity measurement method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810284500.0A CN108600792B (en) 2018-04-02 2018-04-02 Similarity measurement method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN108600792A CN108600792A (en) 2018-09-28
CN108600792B true CN108600792B (en) 2020-08-04

Family

ID=63625196

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810284500.0A Active CN108600792B (en) 2018-04-02 2018-04-02 Similarity measurement method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN108600792B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109413461A (en) * 2018-09-30 2019-03-01 武汉斗鱼网络科技有限公司 A kind of recommended method and relevant device of direct broadcasting room
CN109299316B (en) * 2018-11-09 2023-04-18 平安科技(深圳)有限公司 Music recommendation method and device and computer equipment
CN111209713B (en) * 2020-01-03 2023-08-18 长江存储科技有限责任公司 Wafer data processing method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105260414A (en) * 2015-09-24 2016-01-20 精硕世纪科技(北京)有限公司 User behavior similarity computing method and device
JP2016066135A (en) * 2014-09-24 2016-04-28 日本電信電話株式会社 Similarity evaluation device, similarity evaluation system, similarity evaluation device, and similarity evaluation program
CN106651542A (en) * 2016-12-31 2017-05-10 珠海市魅族科技有限公司 Goods recommendation method and apparatus
CN107172452A (en) * 2017-04-25 2017-09-15 北京潘达互娱科技有限公司 Direct broadcasting room recommends method and device
CN107613395A (en) * 2017-08-28 2018-01-19 武汉斗鱼网络科技有限公司 Recommend method and system in live room

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9269028B2 (en) * 2014-07-07 2016-02-23 General Electric Company System and method for determining string similarity

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016066135A (en) * 2014-09-24 2016-04-28 日本電信電話株式会社 Similarity evaluation device, similarity evaluation system, similarity evaluation device, and similarity evaluation program
CN105260414A (en) * 2015-09-24 2016-01-20 精硕世纪科技(北京)有限公司 User behavior similarity computing method and device
CN106651542A (en) * 2016-12-31 2017-05-10 珠海市魅族科技有限公司 Goods recommendation method and apparatus
CN107172452A (en) * 2017-04-25 2017-09-15 北京潘达互娱科技有限公司 Direct broadcasting room recommends method and device
CN107613395A (en) * 2017-08-28 2018-01-19 武汉斗鱼网络科技有限公司 Recommend method and system in live room

Also Published As

Publication number Publication date
CN108600792A (en) 2018-09-28

Similar Documents

Publication Publication Date Title
CN108491529B (en) Information recommendation method and device
CN108600792B (en) Similarity measurement method, device, equipment and storage medium
CN110929052B (en) Multimedia resource recommendation method and device, electronic equipment and storage medium
US10268960B2 (en) Information recommendation method, apparatus, and server based on user data in an online forum
US9380073B2 (en) Reputation system in a default network
US20220294821A1 (en) Risk control method, computer device, and readable storage medium
US10673803B2 (en) Analyzing interests based on social media data
CN107566897A (en) A kind of discrimination method, device and the electronic equipment of video brush amount
CN108959329B (en) Text classification method, device, medium and equipment
CN111107416B (en) Bullet screen shielding method and device and electronic equipment
Congosto Digital sources: a case study of the analysis of the Recovery of Historical Memory in Spain on the social network Twitter
CN111159563A (en) Method, device and equipment for determining user interest point information and storage medium
CN111523035B (en) Recommendation method, device, server and medium for APP browsing content
CN108694174B (en) Content delivery data analysis method and device
CN111225246A (en) Video recommendation method and device and electronic equipment
CN110909258A (en) Information recommendation method, device, equipment and storage medium
CN112788351B (en) Target live broadcast room identification method, device, equipment and storage medium
CN112561636A (en) Recommendation method, recommendation device, terminal equipment and medium
CN108304298B (en) Supervision system for realizing multi-management-end student mobile terminal based on education industry
CN113365113B (en) Target node identification method and device
CN111327609B (en) Data auditing method and device
CN112788356B (en) Live broadcast auditing method, device, server and storage medium
CN109753585B (en) Method and device for determining recommended video, electronic equipment and storage medium
CN109600639B (en) User similarity calculation method, device, equipment and medium based on user preference
CN109582863B (en) Recommendation method and server

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant