CN108600792A - A kind of method for measuring similarity, device, equipment and storage medium - Google Patents

A kind of method for measuring similarity, device, equipment and storage medium Download PDF

Info

Publication number
CN108600792A
CN108600792A CN201810284500.0A CN201810284500A CN108600792A CN 108600792 A CN108600792 A CN 108600792A CN 201810284500 A CN201810284500 A CN 201810284500A CN 108600792 A CN108600792 A CN 108600792A
Authority
CN
China
Prior art keywords
user
article
similarity
data
popular
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810284500.0A
Other languages
Chinese (zh)
Other versions
CN108600792B (en
Inventor
王璐
陈少杰
张文明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Douyu Network Technology Co Ltd
Original Assignee
Wuhan Douyu Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Douyu Network Technology Co Ltd filed Critical Wuhan Douyu Network Technology Co Ltd
Priority to CN201810284500.0A priority Critical patent/CN108600792B/en
Publication of CN108600792A publication Critical patent/CN108600792A/en
Application granted granted Critical
Publication of CN108600792B publication Critical patent/CN108600792B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/258Client or end-user data management, e.g. managing client capabilities, user preferences or demographics, processing of multiple end-users preferences to derive collaborative data
    • H04N21/25866Management of end-user data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/251Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • H04N21/252Processing of multiple end-users' preferences to derive collaborative data
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/258Client or end-user data management, e.g. managing client capabilities, user preferences or demographics, processing of multiple end-users preferences to derive collaborative data
    • H04N21/25866Management of end-user data
    • H04N21/25891Management of end-user data being end-user preferences
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/262Content or additional data distribution scheduling, e.g. sending additional data at off-peak times, updating software modules, calculating the carousel transmission frequency, delaying a video stream transmission, generating play-lists
    • H04N21/26258Content or additional data distribution scheduling, e.g. sending additional data at off-peak times, updating software modules, calculating the carousel transmission frequency, delaying a video stream transmission, generating play-lists for generating a list of items to be played back in a given order, e.g. playlist, or scheduling item distribution according to such list
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/466Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • H04N21/4667Processing of monitored end-user data, e.g. trend analysis based on the log file of viewer selections
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/466Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • H04N21/4668Learning process for intelligent management, e.g. learning user preferences for recommending movies for recommending content, e.g. movies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/482End-user interface for program selection
    • H04N21/4826End-user interface for program selection using recommendation lists, e.g. of programs or channels sorted out according to their score

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Graphics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a kind of method for measuring similarity, device, equipment and storage mediums.This method includes:According to user data corresponding with the article of similarity measure, determine user's set between two articles, the user gather include user's intersection, user's relative complement set and user's union absolute complement of set;Default measuring similarity rule according to the Shannon entropy of user set, and based on Likelihood Test, determines the similarity between two articles.Through the above technical solutions, solving the problems, such as measuring similarity one-sided in the proposed algorithm based on article, realize more comprehensively and reasonably using measuring similarity data, to obtain the similarity being more in line between actual article.

Description

A kind of method for measuring similarity, device, equipment and storage medium
Technical field
The present embodiments relate to computer technology more particularly to a kind of method for measuring similarity, device, equipment and storages Medium.
Background technology
In the application field of big data, an important direction is to carry out personalization to user according to mass data to push away It recommends.For platform is broadcast live in internet, personalized recommendation is specially accurately to recommend its interested live streaming to active user Between.
Currently, in numerous big data algorithm solutions that direct broadcasting room is recommended, the scheme of a simple possible is to mesh Mark user recommends to watch the similar direct broadcasting room of history direct broadcasting room recently with it, and the difficult point of the program is how accurately to calculate Similarity between direct broadcasting room two-by-two.
In existing direct broadcasting room suggested design, outstanding person's card that article similarity is calculated in the proposed algorithm based on article is German Method (Jaccard ' s Coefficient) is figured, is one of common direct broadcasting room method for measuring similarity.The algorithm is based on collection It closes and is calculated, the similarity between direct broadcasting room is equal to the number of users for watching two direct broadcasting rooms simultaneously divided by least watches two-by-two The number of users of one of direct broadcasting room.Above-mentioned algorithm is in the defects of direct broadcasting room measuring similarity, only only accounts for The viewing user for having viewed two direct broadcasting rooms does not consider viewing of these viewing users to other direct broadcasting rooms, to only sharp With the partial information in direct broadcasting room measuring similarity information so that the similarity-rough set of acquisition is unilateral.For example, above-mentioned viewing is used Certain viewing users in family only see at least one of above-mentioned two direct broadcasting room direct broadcasting room for accidental behavior See behavior, viewing behavior is more to concentrate on other direct broadcasting rooms, then being simply considered that the partially viewed user couple two At least one of direct broadcasting room direct broadcasting room is interested, and the partially viewed user is directly included in the two direct broadcasting room similarity degree In the data of amount, and other viewing behaviors of the partially viewed user are not considered, will so that measuring similarity is more unilateral, The similarity of acquisition is distorted.
Invention content
A kind of method for measuring similarity of offer of the embodiment of the present invention, device, equipment and storage medium, it is more comprehensive to realize And measuring similarity data are reasonably utilized, to obtain the similarity being more in line between actual article.
In a first aspect, an embodiment of the present invention provides a kind of method for measuring similarity, including:
According to user data corresponding with the article of similarity to be measured, user's collection between two articles is determined Close, user set include user's intersection, user's relative complement set and user's union absolute complement of set;
Default measuring similarity rule according to the Shannon entropy of user set, and based on Likelihood Test, Determine the similarity between two articles.
Second aspect, the embodiment of the present invention additionally provide a kind of measuring similarity device, which includes:
User gathers determining module, for according to user data corresponding with the article of similarity to be measured, determining two User's set between a article, user's set includes the exhausted of user's intersection, user's relative complement set and user's union To supplementary set;
Measuring similarity module, the Shannon entropy for gathering according to the user, and based on Likelihood Test Default measuring similarity rule, determines the similarity between two articles.
The third aspect, the embodiment of the present invention additionally provide a kind of equipment, which includes:
One or more processors;
Storage device, for storing one or more programs,
When one or more of programs are executed by one or more of processors so that one or more of processing Device realizes the method for measuring similarity that any embodiment of the present invention is provided.
Fourth aspect, the embodiment of the present invention additionally provide a kind of computer readable storage medium, are stored thereon with computer Program realizes the method for measuring similarity that any embodiment of the present invention is provided when the computer program is executed by processor.
The embodiment of the present invention determines two articles by user data corresponding with the article of similarity to be measured Between user's set, user set include user's intersection, user's relative complement set and user's union absolute complement of set;And according to According to the Shannon entropy that the user gathers, and the rule of the default measuring similarity based on Likelihood Test, two institutes are determined State the similarity between article.It solves the problems, such as measuring similarity one-sided in the proposed algorithm based on article, realizes more Add comprehensively and reasonably utilize measuring similarity data, to obtain the similarity being more in line between actual article.
Description of the drawings
Fig. 1 is a kind of flow chart of method for measuring similarity in the embodiment of the present invention one;
Fig. 2 is a kind of flow chart of method for measuring similarity in the embodiment of the present invention two;
Fig. 3 is a kind of flow chart of method for measuring similarity in the embodiment of the present invention three;
Fig. 4 is a kind of structural schematic diagram of measuring similarity device in the embodiment of the present invention four;
Fig. 5 is a kind of structural schematic diagram of equipment in the embodiment of the present invention five.
Specific implementation mode
The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention rather than limitation of the invention.It also should be noted that in order to just Only the parts related to the present invention are shown in description, attached drawing rather than entire infrastructure.
Embodiment one
Method for measuring similarity provided in this embodiment is applicable in recommending based on article similarity between article two-by-two It calculates.This method can be executed by measuring similarity device, which can be realized by the mode of software and/or hardware, should Device can be integrated in the equipment for having operation and network function, such as typically subscriber terminal equipment, such as server, Tablet computer or desktop computer etc..Referring to Fig. 1, the method for this implementation specifically comprises the following steps:
S110, foundation user data corresponding with the article of similarity to be measured, determine the user between two articles Set.
Wherein, the article of similarity to be measured refers to the object that a classification is belonged to user's history operation behavior relative article Product, article here can be common consumer product, learned lesson, audio and video or direct broadcasting room etc..For example, user's history operation behavior Relative article is some direct broadcasting room, then the article of similarity to be measured is to include multiple direct broadcasting rooms including above-mentioned direct broadcasting room.
User data, which refers to the article for treating measurement similarity, the user related data of operation behavior, such as can use The operation behavior information etc. of family identification information and each user to article.User data can be in required time section from article The corresponding network platform obtains, which can be set according to measuring similarity demand, for example be set as using Effective storage period of user data or such as one month fixation duration.
User's set refers to the set that the corresponding user data of different articles is constituted.Illustratively, user's set includes using Family intersection, user's relative complement set and user's union absolute complement of set.Specifically, the corresponding user data of article i constitutes user's collection The corresponding user data of I, article j constitutes user and collects J.So, user's intersection is that user collects I and the intersection of user's collection J (is denoted as I ∩J);User's relative complement set refers to that user collects that user in I collects the relative complement set (be denoted as I J) of J and/or user collects user in J and collects The relative complement set (be denoted as J I) of I;The absolute complement of set of user's union refers to the absolute complement of set that user collects that I collects the union of J with user (it is denoted as CZ(I∪J)).When it is implemented, can be traversed to user data, the object two-by-two for needing to measure similarity is determined Corresponding user's set between product.The advantages of this arrangement are as follows when measuring similarity, not only between two articles of consideration Common user's operation data, it is also contemplated that operation data of the corresponding user of the two articles to other articles so that similarity Metric data is more comprehensive, so as to more truly reflect user interest so that the similarity of measurement is more accurate.
Specifically, from the corresponding network platform of article of similarity to be measured, the article pair for meeting and imposing a condition is obtained The user data answered, impose a condition the condition for referring to screening article here, for example obtains and own from the network platform The corresponding user data of article, or according to the corresponding user data of certain methods of sampling sampling resulting articles.Later, according to Family or article, traverse user data, determine user's set between two articles for needing to measure similarity.
S120, the Shannon entropy gathered according to user, and the rule of the default measuring similarity based on Likelihood Test, Determine the similarity between two articles.
Wherein, Shannon entropy also becomes comentropy, can show that the confusion degree of information, information is more chaotic, and comentropy is bigger. Here Shannon entropy is used, is that can more truly characterize the information content for weighing object based on Shannon entropy.Default measuring similarity Rule is the measuring similarity rule determined previously according to the statistic of Likelihood Test.Use Likelihood to examine here It tests, is reasonably to be handled jointly to gather above-mentioned various users, and make result close in true Situation.
Specifically, the above process is:Statistic based on Likelihood Test determines default measuring similarity rule; It determines the Shannon entropy of user's set, and according to Shannon entropy and default measuring similarity rule, determines similar between two articles Degree.
When it is implemented, the measuring similarity demand in first according to embodiments of the present invention, such as comprehensive intersection, relative complement set And absolute complement of set, and the demand of the information content of each set can be preferably weighed, confirm the Likelihood inspection for needing to use Test statistic.The statistic T used in the embodiment of the present invention for:
T=-2* (maxrix_entropy-row_entropy-colume_entropy) (1)
Wherein, remember NAlways=N11+N12+N21+N22, then,
Above-mentioned, entropy is Shannon entropy, N11Collect the user that the corresponding users of I and article j collect J for the corresponding users of article i Intersection, N12Collect user's relative complement set that user in I collects J, N for user21Collect user's relative complement set that user in J collects I for user, N22Collect the absolute complement of set of I and user's union of user's collection J for user.
Later, it is contemplated that the numberical range of above-mentioned statistic T be [0, ∞), and the numberical range of similarity be [- 1,1]. It is therefore desirable to be improved to statistic T, to obtain satisfactory measuring similarity rule.Illustratively, similarity is preset Measuring rule is:
Wherein, SijFor the similarity between two articles, entropy is Shannon entropy, N11Collect I for the corresponding users of article i User corresponding with article j collects user's intersection of J, N12Collect user's relative complement set that user in I collects J, N for user21Collect for user User collects user's relative complement set of I, N in J22Collect the absolute complement of set of I and user's union of user's collection J for user.
From default measuring similarity regular (3) as can be seen that first, the algorithm not deliberately to temperature higher building product (i.e. Popular article) especially handled, such as during calculating the similarity comprising popular article, divided by the popular article heat of characterization The temperature token state of degree, taking human as the temperature for reducing popular article.Second, it is identical in the corresponding user's intersection accounting of two articles In the case of, article temperature is higher, and it is higher to obtain similarity.User distribution in article i and article j is identical, phase Guan Xingyue is good, and matrix_entropy is smaller, then the similarity of gained is higher.
Finally, regular (3) according to default measuring similarity, determine user's set in S110 according to Shannon entropy formula (2) Corresponding Shannon entropy.It recycles the Shannon entropy obtained and default measuring similarity regular (3), calculates and need to measure similarity Similarity between two articles.
The technical solution of the present embodiment, the embodiment of the present invention pass through number of users corresponding with the article of similarity to be measured According to, determine between two articles user set, user set include the exhausted of user's intersection, user's relative complement set and user's union To supplementary set;Default measuring similarity rule and according to the Shannon entropy of user's set, and based on Likelihood Test, determines Similarity between two articles.It solves the problems, such as measuring similarity one-sided in the proposed algorithm based on article, realizes More comprehensively and reasonably using measuring similarity data, to obtain the similarity being more in line between actual article.
Embodiment two
The present embodiment on the basis of the above embodiment 1, increases the truncation of user data, further to " foundation User data corresponding with the article of similarity to be measured determines the set of the user between two articles " it optimizes.Wherein Details are not described herein for the explanation of identical as the various embodiments described above or corresponding term.Referring to Fig. 2, similarity provided in this embodiment Measure includes:
S210, popular article, and the user behavior data of setting type corresponding with popular article are determined.
Wherein, the user behavior data for setting type refers to, the use of preset type relevant with the temperature of article Family behavioral data.Article is different, and the user behavior data of corresponding setting type is different, such as when article is network courses, It can be that study duration, learning notes or comment and course are shared that it, which sets the user behavior data of type,.Illustratively, Article is direct broadcasting room;Correspondingly, the user behavior data for setting type includes viewing duration, barrage publication number and concern row For.
Specifically, in the measuring similarity algorithm in the embodiment of the present invention, need to be traversed for similarity to be measured article and Corresponding user data, to obtain user's set.Above process algorithm complexity when being realized in engineering is higher, especially for Popular article, user data is usually larger, corresponding algorithm implementation complexity higher.In order to reduce measuring similarity algorithm Implementation complexity carries out truncation in the embodiment of the present invention to the user data of popular article, with retain setting quantity, it is right The real corresponding user data (i.e. validated user data) of interested user's (i.e. validated user) of popular article.
Popular article is determined as to the object of data truncation, on the one hand, in view of popular article pair in the embodiment of the present invention Include more useless user behavior data in the user data answered, for example certain user is to simply click on to popular article It checks, and the true interest of non-user.So, this kind of inactive users behavior number should be just blocked when carrying out truncation According to reducing accidental behavioral data as far as possible.On the other hand, the corresponding amount of user data of popular article is larger so that popular article Measuring similarity result between other articles is high, so that based in article recommendation results, easily " Harry Potter " occurs The recommendation of phenomenon, i.e., popular article is higher.So truncation is carried out to the corresponding user data of popular article, to drop The incidence of low " Harry Potter " phenomenon.
When actual implementation, first according to the definition of popular article, popular article is determined.And then it is obtained from the network platform Take the user behavior data of the corresponding setting type of above-mentioned popular article.
Illustratively, determine that popular article includes:According to the corresponding number of users of article in the first preset time period and in advance If number of users, hot topic article is determined.
Wherein, the first preset time period refers to presetting the period of duration, is used to characterize the effective of popular article Phase.Pre-set user quantity refers to the quantity that preset article corresponds to user, is used to characterize the temperature of popular article.
Specifically, in the first preset time period, the corresponding number of users of article for treating measurement similarity is counted, Obtain counting user quantity.The counting user quantity is compared with pre-set user quantity.If counting user quantity is more than Or it is equal to pre-set user quantity, then the corresponding article of counting user quantity is determined as popular article;, whereas if statistics Number of users is less than pre-set user quantity, then the corresponding article of counting user quantity cannot then be defined as popular article.
S220, foundation user behavior data and preset kind weighted value determine that hot topic article corresponds to the user preference of user Value.
Wherein, preset kind weighted value refers to the corresponding power of user behavior data of preset, each setting type Weight values.
Specifically, the acquisition process of user preference value is:The user data of type will be individually set in popular article one by one And corresponding preset kind weighted value is multiplied, and the corresponding all products of the hot topic article are added up, it is single to obtain The user preference value of user.Each user corresponding for popular article, obtains user preference value, with true as procedure described above Fixed hot topic article corresponds to the user preference value of user.
Illustratively, according to user behavior data and preset kind weighted value, determine that hot topic article corresponds to the user of user Preference value includes:User behavior data in second preset time period is standardized, Standard User behavioral data is obtained;According to According to Standard User behavior and preset kind weighted value, determine that hot topic article corresponds to the user preference value of user.
Wherein, the second preset time period refers to presetting the period of duration, when being used for the statistics of signature criteria It is long.The statistics duration needs to set a suitable duration, if statistics duration setting is long, can not characterize the close of user Issue evidence does not have statistical significance if statistics duration setting is too short.So second preset time period is needed according to tool The article of body determines.
Specifically, since the dimension between the user data of different set type is different, therefore need to be to user data into rower Quasi-ization processing.When it is implemented, first being counted to the user behavior data of the setting type in the second preset time period, to obtain Respective settings type of user behavioral data maximum value and minimum value, and according to setting type user behavior data and on The maximum value and minimum value for stating statistics gained, the user behavior data to setting type are standardized, and are obtained standard and are used Family behavioral data.And then according to the Standard User behavioral data and corresponding preset kind weighted value, determine hot topic article The user preference value of corresponding user.The advantages of this arrangement are as follows the user preference value in suitable duration can be obtained so that after Continuous data truncation processing is more in line with actual conditions, so that measuring similarity is truer.
S230, foundation user preference value determine the corresponding validated user data of hot topic article.
Specifically, the corresponding user data of popular article is blocked according to user preference value, only retains validated user Data.For example, preference value threshold value can be arranged, retain the validated user that user preference value is greater than or equal to the preference value threshold value, The corresponding user data of validated user is determined as the corresponding validated user data of popular article;Popular article can also be corresponded to User sort according to user preference value, according to ranking results, retain a certain number of validated users, validated user is corresponding User data is determined as the corresponding validated user data of popular article.
S240, determine that validated user data are the corresponding popular user data of popular article.
Specifically, in the present embodiment, for including the measuring similarity of popular article, used in measuring similarity number According to should be corresponding validated user data of popular article that S230 is determined, and not all user data.So by true in S230 The corresponding validated user data of fixed popular article are determined as the corresponding popular user data of popular article.
S250, according to popular user data and/or the corresponding user data of non-popular article, determine between two articles User gathers.
Wherein, non-popular article be similarity measure article in remove other articles except hot topic article.
Specifically, if two articles of similarity to be measured are popular article, then it is determined that user gathers institute's foundation Data be exactly the corresponding popular user data of two articles;If being popular article there are one in two articles, then it is determined that Data based on user's set are exactly the corresponding popular user data of popular article and the corresponding number of users of non-popular article According to;If two articles are non-popular article, corresponded to then it is determined that data based on user's set are exactly non-popular article User data.
S260, the Shannon entropy gathered according to user, and the rule of the default measuring similarity based on Likelihood Test, Determine the similarity between two articles.
The technical solution of the present embodiment can be effective by carrying out truncation to the corresponding user data of popular article The implementation complexity of similarity algorithm is reduced, and going out for popular article during article based on similarity is recommended can be effectively reduced Existing probability so that the article recommendation results based on similarity are more in line with actual conditions.
Embodiment three
The present embodiment on the basis of the above embodiments, is illustrated, to carry out explaining for measuring similarity by direct broadcasting room of article It states.On this basis, the direct broadcasting room based on similarity is further increased to recommend.It is wherein identical or corresponding as the various embodiments described above Details are not described herein for the explanation of term.Referring to Fig. 3, method for measuring similarity provided in this embodiment includes:
S310, popular direct broadcasting room, and the user behavior data of setting type corresponding with popular direct broadcasting room are determined.
Specifically, direct broadcasting room of the number of users more than 200,000 will be watched in 30 days and be determined as popular direct broadcasting room, and obtain heat The concern the behavior whether corresponding user of door direct broadcasting room watches duration, barrage publication number and pay close attention to, as popular direct broadcasting room Setting type user behavior data.
S320, foundation user behavior data and preset kind weighted value determine that the user that popular direct broadcasting room corresponds to user is inclined Good value.
Specifically, it is sent out according to viewing duration, barrage publication number and the corresponding viewing duration weighted value of concern behavior, barrage Cloth weighted value and concern weighted value determine that the user that popular direct broadcasting room corresponds to user is inclined according to following preference value calculation formula (4) Good value.
Wherein, score (u) is user preference values of the user u to direct broadcasting room;α, β and γ are viewing duration, barrage hair respectively The preset kind weighted value of cloth number and concern behavior, value is 0.4,0.4 and 0.2 respectively in the present embodiment;std_time(u) It is the viewing duration after user's u standardization, time (u) is the viewing duration of user u, and min_time is the second preset time period 30 There is the minimum value that duration is watched in whole users of operation behavior in it to direct broadcasting room, and max_time is corresponding viewing duration Maximum value;Std_msg_cnt (u) is the hair barrage number after user's u standardization, and msg_cnt (u) is the barrage hair of user u Cloth number, min_msg_cnt, which is the second preset time period, has direct broadcasting room in 30 days barrage in whole users of operation behavior to send out The minimum value of cloth number, max_msg_cnt are the maximum values of corresponding barrage publication number;Is_attention is user u It is no to being paid close attention to the direct broadcasting room, if user have concern if value be 1, be otherwise 0.
S330, foundation user preference value determine the corresponding validated user data of hot topic direct broadcasting room.
Specifically, to there is whole users of operation behavior to carry out Bit-reversed according to user preference value in popular direct broadcasting room, And retain the 1 general-purpose family that user preference value is forward in each popular direct broadcasting room, it is determined as validated user data.
S340, determine that validated user data are the corresponding popular user data of popular direct broadcasting room.
S350, according to popular user data and/or the corresponding user data of non-popular direct broadcasting room, determine two direct broadcasting rooms it Between user set.
S360, the Shannon entropy gathered according to user, and the rule of the default measuring similarity based on Likelihood Test, Determine the similarity between two direct broadcasting rooms.
Specifically, it is assumed that the viewing number (i.e. user data) related to direct broadcasting room 2 of direct broadcasting room 1 is respectively:
N11=1000, N12=5000, N21=2000, N22=100000
Then:
So, the similarity of direct broadcasting room 1 and 2 is:
S370, between determining that the conception of history of target user is seen live.
Wherein, target user refers to the user of direct broadcasting room to be recommended.
Specifically, between the conception of history for obtaining target user in live streaming platform is seen live.
S380, according to direct broadcasting room to be recommended and the conception of history see live between similarity, to direct broadcasting room to be recommended sort.
Wherein, direct broadcasting room to be recommended is the direct broadcasting room in the live streaming platform for referring to recommend to target user.
Specifically, according to default measuring similarity rule, one by one calculate the conception of history see live between with direct broadcasting room to be recommended it Between similarity, and according to similarity numerical values recited, Bit-reversed is carried out to direct broadcasting room to be recommended.
S390, determine that the direct broadcasting room to be recommended of preset quantity in ranking results is that target recommends direct broadcasting room.
Wherein, preset quantity refers to the preset quantity for recommending direct broadcasting room, can be live streaming platform default setting, It can also be by user's sets itself.
Specifically, in direct broadcasting room ranking results to be recommended, determine that sequence is forward, sets the direct broadcasting room to be recommended of quantity For target direct broadcasting room, and the target direct broadcasting room can be recommended target user.
The technical solution of the present embodiment, by combining the default measuring similarity rule of a variety of user's set, to two Similarity between direct broadcasting room is measured so that the measuring similarity between direct broadcasting room is more comprehensive;Pass through user data Truncation so as to the processing more science of popular direct broadcasting room in measuring similarity, rather than simply artificial reduction is popular The temperature of direct broadcasting room, so that direct broadcasting room similarity is more comprehensive, is more in line with actual conditions, so that being based on similarity Direct broadcasting room recommendation be more in line with user interest, improve user experience.
It is the embodiment of measuring similarity device provided in an embodiment of the present invention below, the device and the various embodiments described above Method for measuring similarity belongs to the same inventive concept, in the embodiment of measuring similarity device in the details of not detailed description Hold, the embodiment of above-mentioned method for measuring similarity can be referred to.
Example IV
The present embodiment provides a kind of measuring similarity devices, and referring to Fig. 4, which specifically includes:
User gathers determining module 410, for according to user data corresponding with the article of similarity to be measured, determining Between two articles user set, user set include user's intersection, user's relative complement set and user's union absolute complement of set;
Measuring similarity module 420, for the Shannon entropy according to user's set, and based on the pre- of Likelihood Test If measuring similarity is regular, the similarity between two articles is determined.
Optionally, presetting measuring similarity rule is:
Row_entropy=entropy (N11+N12,N21+N22)
Column_entropy=entropy (N11+N21,N12+N22)
Maxtrix_entropy=entropy (N11,N12,N21,N22)
Wherein, SijFor the similarity between two articles, entropy is Shannon entropy, N11Collect I for the corresponding users of article i User corresponding with article j collects user's intersection of J, N12Collect user's relative complement set that user in I collects J, N for user21Collect for user User collects user's relative complement set of I, N in J22Collect the absolute complement of set of I and user's union of user's collection J for user.
Optionally, on the basis of above-mentioned apparatus, which further includes:
Popular article determining module, for according to user data corresponding with the article of similarity to be measured, determining Before user's set between two articles, hot topic article, and the user of setting type corresponding with popular article are determined Behavioral data;
User preference value determining module, for according to user behavior data and preset kind weighted value, determining hot topic article The user preference value of corresponding user.
Correspondingly, user gathers determining module 410 and is specifically used for:
Determine that validated user data are the corresponding popular user data of popular article;
According to popular user data and/or the corresponding user data of non-popular article, the user between two articles is determined Set.
Optionally, popular article determining module is specifically used for:
According to the corresponding number of users of article and pre-set user quantity in the first preset time period, hot topic article is determined.
Optionally, user preference value determining module is specifically used for:
User behavior data in second preset time period is standardized, Standard User behavioral data is obtained;
Establishing criteria user behavior and preset kind weighted value determine that hot topic article corresponds to the user preference value of user.
Optionally, article is direct broadcasting room;Set type user behavior data include viewing duration, barrage publication number and Concern behavior.
Optionally, on the basis of above-mentioned apparatus, which further includes:Direct broadcasting room recommending module, is used for:
Default measuring similarity rule in the Shannon entropy according to user's set, and based on Likelihood Test, really After similarity between fixed two articles, between determining that the conception of history of target user is seen live;
Similarity between being seen live according to direct broadcasting room to be recommended and the conception of history sorts to direct broadcasting room to be recommended;
Determine that the direct broadcasting room to be recommended of preset quantity in ranking results is that target recommends direct broadcasting room.
Four a kind of measuring similarity device through the embodiment of the present invention, solves similar in the proposed algorithm based on article The problem of degree measurement one-sided, realizes more comprehensively and reasonably using measuring similarity data, to be more in line with Similarity between actual article.
The executable any embodiment of the present invention of measuring similarity device that the embodiment of the present invention is provided is provided similar Measure is spent, has the corresponding function module of execution method and advantageous effect.
It is worth noting that, in the embodiment of above-mentioned measuring similarity device, included each unit and module are It is divided according to function logic, but is not limited to above-mentioned division, as long as corresponding function can be realized;Separately Outside, the specific name of each functional unit is also only to facilitate mutually distinguish, the protection domain being not intended to restrict the invention.
Embodiment five
Referring to Fig. 5, a kind of equipment 500 is present embodiments provided comprising:One or more processors 520;Storage device 510, for storing one or more programs, when one or more programs are executed by one or more processors 520 so that one Or multiple processors 520 realize the method for measuring similarity that the embodiment of the present invention is provided, including:
According to user data corresponding with the article of similarity to be measured, user's set between two articles is determined, User set include user's intersection, user's relative complement set and user's union absolute complement of set;
Default measuring similarity rule according to the Shannon entropy of user's set, and based on Likelihood Test, determines Similarity between two articles.
Certainly, it will be understood by those skilled in the art that processor 520 can also realize that any embodiment of the present invention is provided Method for measuring similarity technical solution.
The equipment 500 that Fig. 5 is shown is only an example, should not be brought to the function and use scope of the embodiment of the present invention Any restrictions.
As shown in figure 5, the equipment 500 includes processor 520, storage device 510, input unit 530 and output device 540;The quantity of processor 520 can be one or more in equipment, in Fig. 5 by taking a processor 520 as an example;Place in equipment Reason device 520, storage device 510, input unit 530 can be connected with output device 540 by bus or other modes, in Fig. 5 For being connected by bus 550.
Storage device 510 is used as a kind of computer readable storage medium, can be used for storing software program, computer executable Program and module, the corresponding program instruction/module of method in the embodiment of the present invention ....
Storage device 510 can include mainly storing program area and storage data field, wherein storing program area can store operation Application program needed for system, at least one function;Storage data field can be stored uses created data etc. according to terminal. In addition, storage device 510 may include high-speed random access memory, can also include nonvolatile memory, for example, at least One disk memory, flush memory device or other non-volatile solid state memory parts.In some instances, storage device 510 It can further comprise that the memory remotely located relative to processor 520, these remote memories can be by network connections extremely Equipment.The example of above-mentioned network includes but not limited to internet, intranet, LAN, mobile radio communication and combinations thereof.
Input unit 530 can be used for receiving the number or character information of input, and generate with the user setting of equipment with And the related key signals input of function control.Output device 540 may include that display screen etc. shows equipment.
Embodiment six
The present embodiment provides a kind of storage mediums including computer executable instructions, and computer executable instructions are by counting For executing a kind of method for measuring similarity when calculation machine processor executes, this method includes:
According to user data corresponding with the article of similarity to be measured, user's set between two articles is determined, User set include user's intersection, user's relative complement set and user's union absolute complement of set;
Default measuring similarity rule according to the Shannon entropy of user's set, and based on Likelihood Test, determines Similarity between two articles.
Certainly, a kind of storage medium including computer executable instructions that the embodiment of the present invention is provided, computer Executable instruction is not limited to method operation as above, and the measuring similarity side that any embodiment of the present invention is provided can also be performed Relevant operation in method.
By the description above with respect to embodiment, it is apparent to those skilled in the art that, the present invention It can be realized by software and required common hardware, naturally it is also possible to which by hardware realization, but the former is more in many cases Good embodiment.Based on this understanding, technical scheme of the present invention substantially in other words contributes to the prior art Part can be expressed in the form of software products, which can be stored in computer readable storage medium In, such as the floppy disk of computer, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), flash memory (FLASH), hard disk or CD etc., including some instructions are with so that a computer is set Standby (can be personal computer, server or the network equipment etc.) executes the measuring similarity side of each embodiment of the present invention Method.
Note that above are only presently preferred embodiments of the present invention and institute's application technology principle.It will be appreciated by those skilled in the art that The present invention is not limited to specific embodiments described here, can carry out for a person skilled in the art it is various it is apparent variation, It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being carried out to the present invention by above example It is described in further detail, but the present invention is not limited only to above example, without departing from the inventive concept, also May include other more equivalent embodiments, and the scope of the present invention is determined by scope of the appended claims.

Claims (10)

1. a kind of method for measuring similarity, which is characterized in that including:
According to user data corresponding with the article of similarity to be measured, user's set between two articles is determined, User set include user's intersection, user's relative complement set and user's union absolute complement of set;
Default measuring similarity rule according to the Shannon entropy of user set, and based on Likelihood Test, determines Similarity between two articles.
2. according to the method described in claim 1, it is characterized in that, the default measuring similarity rule is:
Row_entropy=entropy (N11+N12,N21+N22)
Column_entropy=entropy (N11+N21,N12+N22)
Maxtrix_entropy=entropy (N11,N12,N21,N22)
Wherein, SijFor the similarity between two articles, entropy is Shannon entropy, N11Collect I for the corresponding users of article i User corresponding with article j collects user's intersection of J, N12Collect user's relative complement set that user in I collects J, N for user21Collect for user User collects user's relative complement set of I, N in J22Collect the absolute complement of set of I and user's union of user's collection J for user.
3. according to the method described in claim 1, it is characterized in that, corresponding with the article of similarity to be measured in the foundation User data, determine between two articles user set before, further include:
Determine hot topic article, and the user behavior data of setting type corresponding with the hot topic article;
According to the user behavior data and preset kind weighted value, determine that the popular article corresponds to the user preference of user Value;
According to the user preference value, the corresponding validated user data of the popular article are determined;
Foundation user data corresponding with the article of similarity to be measured determines user's collection between two articles Conjunction includes:
Determine that the validated user data are the corresponding popular user data of the popular article;
According to the popular user data and/or the corresponding user data of non-popular article, determine two articles it Between user set.
4. according to the method described in claim 3, it is characterized in that, the determining popular article includes:
According to the corresponding number of users of the article and pre-set user quantity in the first preset time period, the popular object is determined Product.
5. according to the method described in claim 3, it is characterized in that, described weigh according to the user behavior data and preset kind Weight values determine that the user preference value that the popular article corresponds to user includes:
The user behavior data in second preset time period is standardized, Standard User behavioral data is obtained;
According to the Standard User behavior and the preset kind weighted value, determine that the user that the popular article corresponds to user is inclined Good value.
6. according to the method described in claim 3, it is characterized in that, the article is direct broadcasting room;
The user behavior data of the setting type includes viewing duration, barrage publication number and concern behavior.
7. according to the method described in claim 6, it is characterized in that, it is described according to the user set Shannon entropy, and Default measuring similarity rule based on Likelihood Test, after determining the similarity between two articles, is also wrapped It includes:
Between determining that the conception of history of target user is seen live;
Similarity between being seen live according to direct broadcasting room to be recommended and the conception of history sorts to the direct broadcasting room to be recommended;
Determine that the direct broadcasting room to be recommended of preset quantity in ranking results is that target recommends direct broadcasting room.
8. a kind of measuring similarity device, which is characterized in that including:
User gathers determining module, for according to user data corresponding with the article of similarity to be measured, determining two institutes User's set between article is stated, user's set includes the absolute benefit of user's intersection, user's relative complement set and user's union Collection;
Measuring similarity module, the Shannon entropy for gathering according to the user, and it is default based on Likelihood Test Measuring similarity rule, determines the similarity between two articles.
9. a kind of equipment, which is characterized in that the equipment includes:
One or more processors;
Storage device, for storing one or more programs,
When one or more of programs are executed by one or more of processors so that one or more of processors are real The now method for measuring similarity as described in any in claim 1-7.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program The method for measuring similarity as described in any in claim 1-7 is realized when being executed by processor.
CN201810284500.0A 2018-04-02 2018-04-02 Similarity measurement method, device, equipment and storage medium Active CN108600792B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810284500.0A CN108600792B (en) 2018-04-02 2018-04-02 Similarity measurement method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810284500.0A CN108600792B (en) 2018-04-02 2018-04-02 Similarity measurement method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN108600792A true CN108600792A (en) 2018-09-28
CN108600792B CN108600792B (en) 2020-08-04

Family

ID=63625196

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810284500.0A Active CN108600792B (en) 2018-04-02 2018-04-02 Similarity measurement method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN108600792B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299316A (en) * 2018-11-09 2019-02-01 平安科技(深圳)有限公司 Music recommended method, device and computer equipment
CN109413461A (en) * 2018-09-30 2019-03-01 武汉斗鱼网络科技有限公司 A kind of recommended method and relevant device of direct broadcasting room
CN111209713A (en) * 2020-01-03 2020-05-29 长江存储科技有限责任公司 Wafer data processing method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160004937A1 (en) * 2014-07-07 2016-01-07 General Electric Company System and method for determining string similarity
CN105260414A (en) * 2015-09-24 2016-01-20 精硕世纪科技(北京)有限公司 User behavior similarity computing method and device
JP2016066135A (en) * 2014-09-24 2016-04-28 日本電信電話株式会社 Similarity evaluation device, similarity evaluation system, similarity evaluation device, and similarity evaluation program
CN106651542A (en) * 2016-12-31 2017-05-10 珠海市魅族科技有限公司 Goods recommendation method and apparatus
CN107172452A (en) * 2017-04-25 2017-09-15 北京潘达互娱科技有限公司 Direct broadcasting room recommends method and device
CN107613395A (en) * 2017-08-28 2018-01-19 武汉斗鱼网络科技有限公司 Recommend method and system in live room

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160004937A1 (en) * 2014-07-07 2016-01-07 General Electric Company System and method for determining string similarity
JP2016066135A (en) * 2014-09-24 2016-04-28 日本電信電話株式会社 Similarity evaluation device, similarity evaluation system, similarity evaluation device, and similarity evaluation program
CN105260414A (en) * 2015-09-24 2016-01-20 精硕世纪科技(北京)有限公司 User behavior similarity computing method and device
CN106651542A (en) * 2016-12-31 2017-05-10 珠海市魅族科技有限公司 Goods recommendation method and apparatus
CN107172452A (en) * 2017-04-25 2017-09-15 北京潘达互娱科技有限公司 Direct broadcasting room recommends method and device
CN107613395A (en) * 2017-08-28 2018-01-19 武汉斗鱼网络科技有限公司 Recommend method and system in live room

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109413461A (en) * 2018-09-30 2019-03-01 武汉斗鱼网络科技有限公司 A kind of recommended method and relevant device of direct broadcasting room
CN109299316A (en) * 2018-11-09 2019-02-01 平安科技(深圳)有限公司 Music recommended method, device and computer equipment
CN109299316B (en) * 2018-11-09 2023-04-18 平安科技(深圳)有限公司 Music recommendation method and device and computer equipment
CN111209713A (en) * 2020-01-03 2020-05-29 长江存储科技有限责任公司 Wafer data processing method and device
CN111209713B (en) * 2020-01-03 2023-08-18 长江存储科技有限责任公司 Wafer data processing method and device

Also Published As

Publication number Publication date
CN108600792B (en) 2020-08-04

Similar Documents

Publication Publication Date Title
CN108600792A (en) A kind of method for measuring similarity, device, equipment and storage medium
US20220114583A1 (en) Blockchain based service information processing method, device and readable storage medium
EP4020349A1 (en) Risk control method, computer device, and readable storage medium
CN107566897B (en) A kind of discrimination method, device and the electronic equipment of video brush amount
CN107613395A (en) Recommend method and system in live room
CN110049372A (en) Main broadcaster stablizes prediction technique, device, equipment and the storage medium of retention ratio
CN109729376A (en) A kind of processing method of life cycle, device, equipment and storage medium
CN108920965A (en) A kind of block chain deposits card method and device
Xu et al. Aggregate bandwagon effects of popularity information on audiences' movie selections
CN109359217B (en) User interest degree calculation method, server and readable storage medium
CN109508405B (en) Method and device for determining recommended video, electronic equipment and storage medium
US20160133341A1 (en) Signal transition analysis of a circuit
CN110377521A (en) A kind of target object verification method and device
CN110428368A (en) A kind of algorithm evaluation method, device, electronic equipment and readable storage medium storing program for executing
CN108989881A (en) A kind of main broadcaster's state determines method and device
CN103955846A (en) Control method and device for controlling multi-terminal intelligent feedback in information processing system
CN110233840A (en) A kind of method for processing business, device, equipment and storage medium
CN109829592A (en) A kind of attrition prediction method, apparatus, equipment and the storage medium of express delivery cabinet user
CN113422834A (en) Questionnaire research method, questionnaire research device, computer equipment and storage medium
CN108874676A (en) Method and device for distributing test resources
CN112771816B (en) Method and device for predicting network rate
CN110717653B (en) Risk identification method and apparatus, and electronic device
CN108074108A (en) A kind of display methods and its terminal of net recommendation
CN111104384A (en) Data preprocessing method, device, equipment and storage medium
CN116578911A (en) Data processing method, device, electronic equipment and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant