CN105718488A

CN105718488A - Computer system based recommendation method and apparatus

Info

Publication number: CN105718488A
Application number: CN201410736666.3A
Authority: CN
Inventors: 潘晓彤; 金柯; 刘忠义; 魏虎
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2014-12-04
Filing date: 2014-12-04
Publication date: 2016-06-29
Also published as: WO2016086802A1

Abstract

The present invention relates to recommendation technologies for implementation of computer systems, and discloses a computer system based recommendation method and apparatus. In the recommendation method of the present invention, firstly, clustering is performed according to an item scoring record of each user, so as to divide user feature data into a plurality of categories, and then, in each category of user feature data, items are recommended for target users based on the items, so that a high efficient recommendation method is implemented based on big data, and system stability and recommendation diversification are ensured. In addition, each calculation node does not need to store all categories of user feature data, so that the occasion of insufficient internal storage is prevented.

Description

Recommendation method based on computer system and device thereof

Technical field

The present invention relates to the recommended technology realized with computer system, particularly to based on computer system Recommendation method and device.

Background technology

Proposed algorithm is generally divided into content-based recommendation, recommendation based on correlation rule, based on collaborative Filtered recommendation, and the combination of some basic skills.But, it was found by the inventors of the present invention that currently There are some problems in CF (Collaborative Filtering, collaborative filtering) algorithm, particularly in distribution Under formula environment, some problem becomes apparent from, and understands from CF operation logic, algorithm bottleneck mainly with Lower three places:

First is present in data scale, and no matter which time is recommended, each calculating joint of Distributed Architecture Point will retain global data because each reducer can not learn in advance present node allocated be Which user, so only storage local data can affect data precision.The most each reducer is just by reality Example turns to a small-sized recommendation scene.Assume the calculating resource of total t unit, then global data is superfluous More than store t-1 part, the most each reducer only can run into fraction number in real recommendation process According to calculating, other data will also result in the great wasting of resources.Therefore when data scale is bigger, no matter From the time or in storage, it is huge burden to each calculating node.Experimentation at us In, due to programming language and the local design of compiler, when user or project any data amount exceed During millions, the excessive problem of crossing the border of array will necessarily occur, when user or project any data amount are thousand During ten thousand ranks, then due in cluster each calculate node configuration uneven, some low node of joining will Low memory problem occurs.

Second point is data skew problem.From the point of view of CF algorithmic procedure, either based on project or base In user, we are required for the similarity between calculating project.Here there is a hidden problem: real In the application scenarios of border, some project belongs to " enliven one's share of expenses for a joint undertaking ", some belongs to " inactive one's share of expenses for a joint undertaking ", such as, exist When using MapReduce framework, under<key, value>data schema (pattern), Value corresponding for some key can be a lot, and some can seldom, and this quantity is inconsistent, uneven Situation, referred to as data skew (data skew).When value quantity differ between different key 3 with During the upper order of magnitude, between calculating project, during similarity, will result in serious data skew, " live Jump one's share of expenses for a joint undertaking " cause calculating time long-tail.In like manner, in recommendation process, the row of accumulation before some user For many, before some user, the behavior of accumulation is few, at this moment " any active ues " overall calculation mistake will be tied down Journey.

It it is thirdly Sparse Problem.In object set, produce the object of relation to seldom；Can To be interpreted as all objects to be divided into a matrix, wherein (i j) represents i-th user and jth project Between relation, if great majority point is 0 (representing that it doesn't matter), be then defined as Sparse.Number According to dense in contrast.Particularly primary data is the most incomplete, at this moment phase between calculating project Just be easy to Sparse Problem occur when seemingly spending, i.e. most of position of user items matrix is all 0.

Summary of the invention

It is an object of the invention to provide a kind of recommendation method based on computer system and device thereof, can To realize recommending efficiently method under big data, it is ensured that the stability of system and the multiformity of recommendation.

For solving above-mentioned technical problem, embodiments of the present invention disclose a kind of based on computer system Recommendation method, the method comprises the following steps:

Obtain each user project scoring record to projects；

Project scoring record according to each user clusters, and user characteristic data is divided into R class In not, R is greater than the integer of 1；

In the user characteristic data of each classification, it is targeted customer's recommended project based on project.

Embodiments of the present invention also disclose a kind of recommendation apparatus based on computer system, device bag Include:

User items initial relation computing module, for obtaining each user project scoring note to projects Record；

Cluster module, the item of each user for obtaining according to user items initial relation computing module Mesh scoring record cluster, user characteristic data is divided in R classification, R be greater than 1 whole Number；And

Recommending module, in the user characteristic data of each classification divided at cluster module, base It is targeted customer's recommended project in project.

Compared with prior art, the main distinction and effect thereof are embodiment of the present invention:

In the recommendation method of the present invention, first cluster according to the project scoring record of each user, will User characteristic data is divided in multiple classification, then based on project in the user characteristic data of each classification For targeted customer's recommended project, can realize recommending efficiently method under big data, it is ensured that system Stability and the multiformity of recommendation.

Further, each calculating node need not preserve the user characteristic data of all categories, it is to avoid The problem of low memory.

Further, for each project in each classification or each user, only choose and its relation The strongest several projects rather than retain all items of associated system, can avoid relation more weak The data skew problem that project produces.

Further, use Sparse degree that Sparse Problem is detected, and find data After Sparse Problems, carry out similarity completion by two degree of relations between project, to avoid Sparse to pushing away Recommend the impact of accuracy.

Further, choose whether user to be clustered according to number of users, with the suitableeest Should under small data and big data under project recommendation.

Accompanying drawing explanation

Fig. 1 is that in first embodiment of the invention, the flow process of a kind of recommendation method based on computer system is shown It is intended to；

In Fig. 2 first embodiment of the invention, in a kind of recommendation method based on computer system, cluster judges Schematic flow sheet；

Fig. 3 is to recommend step in second embodiment of the invention in a kind of recommendation method based on computer system Rapid schematic flow sheet；

Fig. 4 is to recommend step in second embodiment of the invention in a kind of recommendation method based on computer system Rapid schematic flow sheet；

Fig. 5 is to recommend step in second embodiment of the invention in a kind of recommendation method based on computer system Rapid schematic flow sheet；

Fig. 6 is that in second embodiment of the invention, in a kind of recommendation method based on computer system, data are mended Full schematic flow sheet；

Fig. 7 is the existing schematic diagram calculating user's similarity；

Fig. 8 and Fig. 9 is the schematic diagram of existing collaborative filtering based on user；

Figure 10 and Figure 11 is the schematic diagram of existing project-based collaborative filtering；

Figure 12 is the existing MapReduce frame diagram realizing Distributed C F algorithm；

Figure 13 is the flow process of a kind of recommendation method based on computer system in second embodiment of the invention Schematic diagram；

Figure 14 is the flow process of a kind of recommendation method based on computer system in second embodiment of the invention Schematic diagram；

Figure 15 is the structure of a kind of recommendation apparatus based on computer system in third embodiment of the invention Schematic diagram；

Figure 16 is to recommend in a kind of recommendation apparatus based on computer system in four embodiment of the invention The structural representation of module.

Detailed description of the invention

In the following description, many technology are proposed in order to make reader be more fully understood that the application thin Joint.But, even if it will be understood by those skilled in the art that do not have these ins and outs and based on The many variations of following embodiment and amendment, it is also possible to realize the required guarantor of each claim of the application The technical scheme protected.

For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing to this The embodiment of invention is described in further detail.

First embodiment of the invention relates to a kind of recommendation method based on computer system.Fig. 1 is this base Schematic flow sheet in the recommendation method of computer system.As it is shown in figure 1, the method includes following step Rapid:

In a step 101, each user project scoring record to projects is obtained.It is appreciated that at this In each embodiment of invention, project can be commodity, service or other recommended.

Then into step 102, cluster according to the project scoring record of each user, user is special Levying data to be divided in R classification, R is greater than the integer of 1.It is appreciated that at each of the present invention In embodiment, K-means algorithm can be used directly user characteristic data to be clustered, it is possible to First to use Canopy algorithm slightly to cluster, then K-means algorithm is used carefully to cluster.

First use Canopy algorithm slightly to cluster, then use K-means algorithm carefully to cluster, While ensureing accuracy, improve cluster speed.

Furthermore, it is to be understood that user characteristic data is to item by user profile, project information and user The data of purpose scoring record composition.

Then into step 103, in the user characteristic data of each classification, it is that target is used based on project Family recommended project.It is appreciated that in various embodiments of the present invention, can use based on working in coordination with Filter, come for targeted customer's recommended project based on correlation rule or proposed algorithm based on effectiveness.

Hereafter process ends.

Certainly, in other embodiments of the present invention, it is also possible to cluster with project for object, Come for targeted customer's recommended project based on user in the user characteristic data of each classification again, or cluster It is all based on user with recommendation or is all based on project.

In the recommendation method of present embodiment, first gather according to the project scoring record of each user Class, is divided into user characteristic data in multiple classification, then base in the user characteristic data of each classification It is targeted customer's recommended project in project, can realize recommending efficiently method under big data, it is ensured that The stability of system and the multiformity of recommendation.

Preferably, above computer system is distributed system.This computer system includes at least two meter Operator node.

In step 103, user characteristic data of all categories is distributed to multiple calculating node, Mei Geji Operator node at most preserves the user characteristic data of R-1 classification, and each calculating node is every preserved The user characteristic data of individual classification is targeted customer's recommended project based on project.Each calculating node is not required to The user characteristic data of all categories to be preserved, it is to avoid the problem of low memory.

Preferably, each calculating node preserves the user characteristic data of a classification and processes.This Outward, it will be understood that in the embodiments of the present invention, can be according to the configuration of each calculating node by two Individual or two or more classification user characteristic data is distributed to the calculating node of high configuration and is processed.When So, user characteristic data amount is not the biggest when, it is also possible to calculated node by one and process.

As optional embodiment, as in figure 2 it is shown, further comprising the steps of before step 102:

In step 201, it is judged that whether number of users is more than userbase threshold value.If number of users is less than Userbase threshold value, then enter step 202；If number of users is more than userbase threshold value, then enter step Rapid 102.

In step 202., it is directly targeted customer's recommendation items based on project in all user characteristic data Mesh.

Hereafter process ends.

Choose whether user to be clustered according to number of users, to be better adapted to small data Project recommendation down and under big data.

Furthermore, it is to be understood that in other embodiments of the present invention, it is also possible to not to number of users Judge, directly user characteristic data is clustered.

Second embodiment of the invention relates to a kind of recommendation method based on computer system.Fig. 3 is this base The schematic flow sheet of recommendation step in the recommendation method of computer system.

Second embodiment has been substantially carried out following two improvement on the basis of the first embodiment:

First is improved to, for each project in each classification or each user, only choose and close with it It is the strongest several projects rather than all items retaining associated system, relation can be avoided more weak Project produce data skew problem.Specifically:

In step 103, using project-based collaborative filtering is targeted customer's recommended project.As Shown in Fig. 3, this step 103 includes following sub-step:

In sub-step 301, according to the project scoring record of user each in above-mentioned classification, calculate above-mentioned Similarity between all items in classification, and choose, for each project, M the project that similarity is the highest, M is Predefined integer.

Then into sub-step 302, according to the project scoring record of targeted customer in above-mentioned classification, for mesh Mark user chooses T the project that scoring is the highest, and T is predefined integer.

Then into sub-step 303, by T the project chosen for targeted customer and for every in T project M the project that individual project is chosen combines, and therefrom removes the item in the bulleted list of targeted customer Mesh, forms initial recommendation result.Such as, T the project chosen for targeted customer is A, B, C, and M the project chosen for project A, B, C is respectively (D, E), (C, F) and (B, H), then shape The initial recommendation result become is (D, E, F, H).

Hereafter process ends.

Preferably, as shown in Figure 4, after sub-step 303, following sub-step is also included:

In sub-step 401, it is judged that whether the number of entry in initial recommendation result is more than N, N is pre- The integer of definition.If the number of entry in initial recommendation result is more than N, then enter sub-step 402；If The number of entry in initial recommendation result is less than N, then enter sub-step 403.

In sub-step 402, choose from initial recommendation result the highest N number of project recommendation of similarity to Targeted customer.

Hereafter process ends.

In sub-step 403, by all items in the bulleted list of targeted customer and for targeted customer's M the project that in bulleted list, each project is chosen combines, and therefrom removes the project of targeted customer All items in list, forms user data completion recommendation results.It is appreciated that user data completion The formation of recommendation results is similar with the formation of initial recommendation result, does not repeats them here.

Hereafter process ends.

More preferably, as it is shown in figure 5, also include following sub-step after sub-step 403:

In sub-step 501, it is judged that whether the number of entry in user data completion recommendation results is more than N, N are predefined integer.If the number of entry in user data completion recommendation results is more than N, then Enter sub-step 502；If the number of entry in user data completion recommendation results is less than N, then enter son Step 503.

In sub-step 502, from user data completion recommendation results, choose N number of item that similarity is the highest Mesh recommends targeted customer.

Hereafter process ends.

In sub-step 503, by all items in the bulleted list of targeted customer with and targeted customer In bulleted list, each project has all items of similarity relation and combines, and therefrom removes target and use All items in the bulleted list at family, forms project data completion recommendation results.It is appreciated that project The formation of Supplementing Data recommendation results is similar with the formation of initial recommendation result, does not repeats them here.

Hereafter process ends.

Second is improved to use Sparse degree to detect Sparse Problem, and is finding number After Sparse Problems, carry out similarity completion, to avoid Sparse pair by two degree of relations between project Recommend the impact of accuracy.Specifically:

As shown in Figure 6, further comprising the steps of after step 103:

In step 601, it is judged that whether Sparse degree is more than Sparse degree threshold value, Sparse degreeWherein k has the quantity of project pair of similarity relation, l in being calculated classification For the quantity of project in classification,If Sparse degree is less than Sparse degree threshold value, then Enter step 602；If Sparse degree is more than Sparse degree threshold value, then enter step 603.

In step 602, be one group with first item, second items and third item, first item with Between second items, between second items and third item, there is similarity relation, be first by second items Project and third item set up similarity relation, and tie up in classification according to similarity pass between supplementary project The project of being again based on is targeted customer's recommended project.

Hereafter process ends.

In step 603, targeted customer will be recommended based on the calculated recommended project of project.

Hereafter process ends.

If furthermore, it is to be understood that after data being carried out similarity completion by two degree of relations between project Yet suffer from Sparse Problem, three degree between project, four degree or higher degree relation pair can be continued through Data carry out similarity completion, to avoid the Sparse impact on recommending accuracy.

Generally proposed algorithm be divided into content-based recommendation, recommendation based on correlation rule, based on collaborative The recommendation filtered, and the combination of some basic skills.Content-based recommendation is according to user (user) Recommend with the project (item) degree of similarity on some attribute, typical such as vector space mould Type；Recommendation based on correlation rule is based on correlation rule, using project of purchasing as rule head, rule Body is recommended；The degree of depth between promotion expo excavation project based on collaborative filtering or between user is closed System, according to the group behavior rule of user, (crowd that i.e. have purchased this project can tend to any other item Mesh？) it is that user does and recommends, such as recommend strong relation project.Have strong between two users's (project) During relation, referring to that both have higher similarity, weak relation is in contrast.

Above-mentioned collaborative filtering has two kinds of implementation methods, and the first is based on user (user-based), the Two kinds is based on project (item-based).

1. collaborative filtering based on user

As its name suggests, first to calculate the most like n of active user adjacent for user-based collaborative filtering User, the preference project of selected n neighbor user in recommendation process, calculating similarity between user Time, need to calculate, as shown in Figure 7 according to the project preference of two users.

Whole process sets up contact by the relation between user, and the physical relationship between user passes through Project calculates as intermediate medium.As shown in Figure 8 and Figure 9, concrete steps can be such that

(1) calculate the neighbor list of active user (i.e. targeted customer), during calculating, want profit By the project list of preferences of active user Yu arbitrary neighbours, using the relation between project as pass between user The bridge of system.

(2) n neighbor user of Top is taken, as recommended candidate.

(3) in n neighbor user of Top, find out the project not occurred in active user's list of preferences, Set up recommended candidate list (candidate list).

(4) to each item i in candidate list, the list of preferences of itself and active user is calculated In the preference of each project, and draw final score (final score).

(5) to each item i in candidate list, sort according to final score, take Top m Individual project is as recommendation results.

The most project-based collaborative filtering

Item-based collaborative filtering, according to user-project relationship, first calculates similarity between project, According to the existing behavior of active user, it is recommended that its n most like project, as shown in Figure 10 and Figure 11.

Whole flow process sets up contact by the similarity between project, and concrete steps can be such that

(1) by user as bridge, the similarity between item i and item j is calculated.

(2) one matrix of structure, (i j) represents the similarity between item i and item j to point.

(3) to each item in the list of preferences of active user, its Top n is calculated similar items。

(4) all similar items are sorted according to score, using Top n items as recommending knot Really.

In both CF algorithms, it is required for carrying out Similarity Measure, but total algorithm framework not office Being limited to certain specific similarity calculating method, system is simply designed as open connecing Similarity Measure Mouthful, actually we can use multiple similarity algorithm, and (Europe is several for such as Euclidean distance In must be apart from), jaccard coefficient (outstanding block German number) etc..

In application scenarios, it is more outstanding that we are difficult to talk clearly which kind of algorithm, and algorithm performance depends on reality Border data distribution:

1. denser when item-item matrix, the relation between major part item can be by one When score expresses, and when this relation has a preferable discrimination (score distribution uniform, and not It is limited to certain interval), item-based algorithm tends to show more preferably.

2. another one selects the scene of item-based algorithm to be that item quantity is significantly less than user number Amount；Whereas if user quantity is less than item quantity, then select user-based algorithm.

3. data stability is also a reference factor of selection algorithm, and which is more steady for item and user Fixed, which kind of algorithm often will obtain better effects.

4., if we pursue the multiformity of recommendation rather than accuracy, user-based algorithm can show more Good.

Some of the above experience is not always the most effective, in actual applications, will be found out by great many of experiments Preferably suggested design.

How to evaluate the recommendation effect of a commending system, the standard that industry is the most unified, except Precision/recall conventional in machine learning (machine learning) (look into standard/recall) etc. refers to Outside mark, it is the richest that we the most also can pay close attention to the multiformity of recommendation, the i.e. recommendation results of a user Rich.

At big data age, the proposed algorithm of uniprocessor version has been difficult to exercise one's ability, application MapReduce framework (framework), hadoop framework have been realized in complete set CF algorithm, algorithm bag name is Mahout, and it not only achieves item-based and user-based Algorithm, and achieve multiple similarity and neighbor algorithm.Additionally, under the Computational frame of higher level Collaborative filtering, such as Spark framework can also be realized.

User-based algorithm:

(1) set up data model (data model), initialize user2item and item2user Data structure

(2) according to user-item-neighborhood relationship, certain similarity operator is utilized Method, calculates Top n neighborhood to each user in the overall situation (all users)

(3) utilize user-neighborhood-item relationship, calculate possible items

(4) utilize item-possible item similarity, recommend for active user Item-based algorithm:

(1) set up data model, initialize user2item and item2user data structure

(2) according to user-item-user-item relationship, the possible of each user is calculated items

(3) degree of association of calculating possible item and current user:

{pref}_{i 2 i} (j) = Σ_{i = 0}^{n} {sim}_{i 2 i} (i, j) * pref (i)

{sim}_{i 2 i} (j) = Σ_{i = 0}^{n} {sim}_{i 2 i} (i, j)

preference (j) = \frac{{pref}_{i 2 i} (j)}{{sim}_{i 2 i} (j)}

(4) sort according to preference score, select high score person as recommendation Items (recommended project).

Above-mentioned MapReduce framework is a kind of distributed computing framework, a task is resolved For map process and reduce process, wherein map process is output as<key, value>schema (pattern), its all value are done specific algorithm for each key by reduce process.Such as Figure 12 Shown in, in order to realize Distributed C F algorithm, in MapReduce framework, it would be desirable to During map, arrange input data, such as, resolve input data, load primary data schema (pattern), by unified for data for<key, value>form, wherein key is that (user marks userID Know), value is itemID (project label) and score.And initialize during reduce Mahout data model and some global data structures (neighborhood object, Recommender object, similarity object etc.), then carry out real recommendation process (user-based or item-based recommendation).

But, existing CF algorithm there is also big data problem, data skew problem and Sparse and asks Topic.Problems above can be solved by above-mentioned recommendation method based on computer system.Below will be from This recommendation method based on computer system is further described in detail by these three aspect.

1. clustering method solves big data problem

In the actual application scenarios that data scale is bigger, such as in hundred million rank data volumes, we use Clustering method degrades problem.Cluster is a kind of unsupervised learning algorithm, for a certain class object, than Such as user or project, it is divided in multiple classification according to object properties, it is not necessary to manually mark, I.e. without under any manual intervention premise, we are expressed as a feature list (feature each item List), clustering algorithm can be automatically performed cluster (cluster) process.

Preferably, we choose user as clustering object, i.e. similar on feature User gathers in same class；The most why not choose item as clustering object？Reason be if We select item as clustering object, and in final cluster result, the items of certain classification only can limit to On certain several item, so run counter to recommending diversity index, affect the multiformity of recommendation results, So we are using user as cluster result.Another reason is that we use item-based algorithm to make For main body proposed algorithm, if in cluster process or use item to cluster, to a certain extent Can recommend to produce with item-based and repeat, the most also can affect the multiformity of arithmetic result.Certainly, In other embodiments of the invention, it would however also be possible to employ user-based is as main body proposed algorithm, choosing Take item as clustering object.

Prepare Feature: we each user as an object (object), then by this User characterization, every historical record of this user is counted as a feature, such as user i one Bar record<i, t, s>, represents that user i is s to the preference of item t, then we add a feature for it " t:s ", the most each user is characterized.

Alternatively, the scale of cluster is so to calculate, and about 10,000,000users can be gathered one In individual classification, this can ensure that and not have deadlock phenomenon on Distributed Computing Platform.Certainly, according to It is actually needed to arrange and the user of other quantity is gathered in a classification.

The bottleneck of clustering algorithm is to calculate between item in similarity, it is preferable that we use Canopy algorithm determines initial center, then does final cluster with Kmeans.Canopy algorithm Total data can first be divided into r son concentrate, two sons are concentrated and are likely to occur data overlap, then exist Each subset clusters with Kmeans algorithm, between the data in different subsets, similarity meter will not be carried out Calculate.The flow chart of clustering method is as shown in figure 13.Certainly, in other embodiments of the invention, also Can directly use Kmeans algorithm or other clustering algorithms that total data is clustered.

Wherein, Canopy algorithmic procedure is specific as follows:

(1) put into internal memory after data set vectorization being obtained a list (list), select two distances Threshold value: T1 and T2, wherein T1 > value of T2, T1 and T2 can determine with cross check；

(2) appoint from list and take 1 P, quickly calculate a P with all by the low this method that is calculated as Distance between Canopy is (if there is currently no Canopy, then using a P as one Canopy), if fruit dot P and certain Canopy distance are within T1, then a P is joined this Canopy；

(3) such as fruit dot P once with the distance of certain Canopy within T2, then need a some P From list delete, this step is to think that a P has now reached near with this Canopy, therefore it The center of other Canopy cannot be done again；

(4) repeat step 2,3, until list is that sky terminates.

2. reconstruct CF algorithm, solves data skew problem by top N method

As shown in figure 14, the CF algorithm of reconstruct is as follows:

(1) according to the historgraphic data recording of each user, calculate the different item under same user it Between relation, data schema are<item1, score1, item2, score2>.

(2) with item1_item2 as key, the similarity between two item is calculated.

(3) each item only retains top M similar items, forms topItemList, for using Also fetch data when recommending from this topItemList in family.

(4) in userItemList (i.e. the bulleted list of user), each user only takes top T Individual items, generates betterItemList (i.e. the list of preferences of user).

(5) from the betterItemList of each user, items is taken out, in conjunction with each item's TopItemList, filters out the items of behavior, generates itemCandidateList (the most initial Recommendation results).

(6) if item number is less than N in itemCandidateList, the most first reduce BetterItemList is userItemList, if item number is the most not in itemCandidateList Foot, then reduction topItemList is total data.

(7) in itemCandidateList, top N is calculated according to similarity and user preference Items is as recommendation results.

3. solve Sparse method

In experimentation, it has been found that some experimental data there will be serious Sparse Problem, i.e. When calculating similarity between item, the most little a part of item pair (project to) has relation, greatly Without direct relation between part item, therefore we define Sparse degree:Wherein l For the i2i pair quantity calculated by CF algorithm, k is different item quantity, and this metric is the least Then data are the most sparse.It is appreciated that in other embodiments of the invention, it is possible to use other data Degree of rarefication definition detects Sparse Problem.

Preferably, the method solving Sparse is as follows:

(1) traditional method calculates CF

(2) statistical result DSP, if DSP is less than threshold (i.e. Sparse degree threshold value), Then do i2i completion；Concrete threshold is defined as DST=α, and wherein α is self-defined

(3) I2i completion algorithm is itemA-> itemB-> itemC, and i.e. utilizing middle item is both sides Item sets up contact, and wherein itemA and itemB, itemB and itemC are neighbours.Such as, itemA Having similarity SAB with itemB, itemB Yu itemC has similarity SBC, then itemA with ItemC has similarity S_AC=S_AB*S_BC, or

It is demonstrated experimentally that completion algorithm can generally increase by 30% new data, for recommending to have done strong number According to supplementing.

These are only a preferred embodiment of the present invention, after each improvement combination, form the preferable of the present invention Embodiment, but each improvement can also use respectively.Further, each parameter mentioned in the above-described embodiments is also Relative set can be carried out as required.

The each method embodiment of the present invention all can realize in modes such as software, hardware, firmwares.No The pipe present invention is to realize with software, hardware or firmware mode, and instruction code may be stored in any In the addressable memorizer of computer of type (the most permanent or revisable, volatibility or Non-volatile, solid-state or non-solid, fixing or removable medium etc.).With Sample, memorizer can e.g. programmable logic array (Programmable Array Logic, be called for short " PAL "), random access memory (Random Access Memory, be called for short " RAM "), programmable read only memory (Programmable Read Only Memory, letter Claim " PROM "), read only memory (Read-Only Memory, be called for short " ROM "), Electrically Erasable Read Only Memory (Electrically Erasable Programmable ROM, letter Claim " EEPROM "), disk, CD, digital versatile disc (Digital Versatile Disc, It is called for short " DVD ") etc..

Third embodiment of the invention relates to a kind of recommendation apparatus based on computer system.Figure 15 is this The structural representation of recommendation apparatus based on computer system.As shown in figure 15, this device includes:

User items initial relation computing module, for obtaining each user project scoring note to projects Record.

Cluster module, the item of each user for obtaining according to user items initial relation computing module Mesh scoring record cluster, user characteristic data is divided in R classification, R be greater than 1 whole Number.And

Recommending module, in the user characteristic data of each classification divided at cluster module, base It is targeted customer's recommended project in project.It is appreciated that in various embodiments of the present invention, above-mentioned Recommending module can use based on collaborative filtering, based on correlation rule or proposed algorithm based on effectiveness come for Targeted customer's recommended project.

Furthermore, it is to be understood that in other embodiments of the present invention, cluster module can also be to item Mesh clusters, it is recommended that module is used for target based on user again in the user characteristic data of each classification Family recommended project, or cluster and recommendation are all based on user or are all based on project.

In the recommendation apparatus of present embodiment, cluster module is first marked according to the project of each user and is remembered Record clusters, and user characteristic data is divided in multiple classification, it is recommended that module is again in each classification User characteristic data is targeted customer's recommended project based on project, can realize efficient under big data Recommendation method, it is ensured that the stability of system and the multiformity of recommendation.

Above-mentioned recommending module is for distributing to multiple calculating node by user characteristic data of all categories, often Individual calculating node at most preserves the user characteristic data of R-1 classification, and each calculating node is being preserved Each classification user characteristic data in be targeted customer's recommended project based on project.Each calculating node Need not preserve the user characteristic data of all categories, it is to avoid the problem of low memory.

As optional embodiment, said apparatus also includes userbase judge module, in cluster Before module clusters, it is judged that whether number of users is more than userbase threshold value.

If for userbase judge module, recommending module confirms that number of users is less than userbase threshold value, It is directly then targeted customer's recommended project based on project in all user characteristic data.

If for userbase judge module, cluster module confirms that number of users is more than userbase threshold value, Then cluster according to the project scoring record of each user, user characteristic data is divided into R classification In, R is greater than the integer of 1.

Furthermore, it is to be understood that in other embodiments of the present invention, it is also possible to not to number of users Judge, directly user is clustered.

First embodiment is the method embodiment corresponding with present embodiment, and present embodiment can Work in coordination enforcement with the first embodiment.The relevant technical details mentioned in first embodiment is in this reality Execute in mode still effective, in order to reduce repetition, repeat no more here.Correspondingly, in present embodiment The relevant technical details mentioned is also applicable in the first embodiment.

Four embodiment of the invention relates to a kind of recommendation apparatus based on computer system.Figure 16 is this The structural representation of recommending module in recommendation apparatus based on computer system.

4th embodiment has been substantially carried out following two improvement on the basis of the 3rd embodiment:

Above-mentioned recommending module uses project-based collaborative filtering to be targeted customer's recommended project.As Shown in Figure 16, this recommending module includes:

Item similarity submodule, for the project scoring record according to user each in classification, calculates Similarity between all items in classification, and choose, for each project, M the project that similarity is the highest, M is Predefined integer.

User recommends submodule, for according to the project scoring record of targeted customer in classification, for target User chooses T the project that scoring is the highest, and T is predefined integer.And

Initial recommendation submodule, for user recommended submodule be T project choosing of targeted customer and Item similarity submodule is that M the project that in T project, each project is chosen combines, and therefrom goes Except the project in the bulleted list of targeted customer, form initial recommendation result.

Preferably, above-mentioned recommending module also includes:

Initial recommendation judges submodule, for judging the initial recommendation knot that initial recommendation submodule is formed Whether the number of entry in Guo is more than N, N is predefined integer.

For initial recommendation, initial recommendation screening submodule, if judging that submodule confirms initial recommendation result In the number of entry more than N, choose from initial recommendation result the highest N number of project recommendation of similarity to Targeted customer.And

For initial recommendation, user data scale reduction submodule, if judging that submodule confirms initial recommendation The number of entry in result is less than N, then by all items in the bulleted list of targeted customer with for target M the project that in the bulleted list of user, each project is chosen combines, and therefrom removes targeted customer Bulleted list in all items, formed user data completion recommendation results.

More preferably, above-mentioned recommending module also includes:

Completion is recommended to judge submodule, for judging the use that user data scale reduction submodule is formed Whether the number of entry in user data completion recommendation results is more than N, N is predefined integer.

Screening submodule is recommended in completion, if recommending to judge that submodule confirms user data completion for completion The number of entry in recommendation results is more than N, chooses similarity the highest from user data completion recommendation results N number of project recommendation to targeted customer.And

Project data scale reduction submodule, if recommending to judge that submodule confirms user data for completion The number of entry in completion recommendation results is less than N, then by all items in the bulleted list of targeted customer With and the bulleted list of targeted customer in each project there is all items of similarity relation combine, and Therefrom remove all items in the bulleted list of targeted customer, form project data completion recommendation results.

Said apparatus also includes:

Recommendation results Sparse degree judge module, is used for judging that Sparse degree is the dilutest more than data Dredge degree threshold value, Sparse degreeWherein k has similarity pass in being calculated classification The quantity of the project pair of system, l is the quantity of project in classification,And

For recommendation results Sparse degree judge module, Sparse completion module, if confirming that data are dilute Dredge degree less than Sparse degree threshold value, be then one group with first item, second items and third item, the Between one project and second items, between second items and third item, there is similarity relation, pass through Section 2 Mesh is first item and third item sets up similarity relation.

Recommending module similarity between the project supplemented according to Sparse completion module is closed and is tied up to class The project that is again based in not is targeted customer's recommended project, and if recommendation results Sparse degree judge mould Block confirms that Sparse degree, then will be based on the calculated recommended project of project more than Sparse degree threshold value Recommend targeted customer.

Form the better embodiment of the present invention above after each improvement combination, but each improvement can also be distinguished Use.

Second embodiment is the method embodiment corresponding with present embodiment, and present embodiment can Work in coordination enforcement with the second embodiment.The relevant technical details mentioned in second embodiment is in this reality Execute in mode still effective, in order to reduce repetition, repeat no more here.Correspondingly, in present embodiment The relevant technical details mentioned is also applicable in the second embodiment.

To sum up, the application scenarios faced due to us be user quantity and item quantity all in hundred million ranks, Traditional algorithm cannot meet our demand, so in above-mentioned recommendation based on computer system In method and apparatus, use cluster can solve this problem with reconstruct two kinds of methods of CF algorithm.Improve Afterwards, in the case of using 600 reducer, hundred million rank data volumes can be realized in 90 minutes Recommendation.And by defining the evaluation index of Sparse, when item-item Similarity Measure terminates After, if result is less than a certain threshold value of evaluation index, then calculate the higher degree relation between item, Do similarity completion, improve and recommend accuracy.

It should be noted that each module mentioned in the present invention each equipment embodiment is all logic mould Block, physically, a logic module can be a physical module, it is also possible to be a physical module A part, it is also possible to realize with the combination of multiple physical modules, the physics reality of these logic modules itself Existing mode is not most important, and the combination of the function that these logic modules are realized is only the solution present invention The key of the technical problem proposed.Additionally, for the innovative part highlighting the present invention, the present invention is above-mentioned Each equipment embodiment is not by the mould the closest with solving technical problem relation proposed by the invention Block introduces, and this is not intended that the said equipment embodiment does not exist other module.

It should be noted that in the claim and description of this patent, such as first and second etc. Etc relational terms be used merely to by an entity or operation separate with another entity or operating space Come, and not necessarily require or imply these entities or operation between exist any this reality relation or Person's order.And, term " includes ", " comprising " or its any other variant are intended to non-row Comprising, so that include that the process of a series of key element, method, article or equipment not only wrap of his property Include those key elements, but also include other key elements being not expressly set out, or also include for this mistake The key element that journey, method, article or equipment are intrinsic.In the case of there is no more restriction, by statement The key element " including one " and limiting, it is not excluded that include the process of described key element, method, article or Person's equipment there is also other identical element.

Although by referring to some of the preferred embodiment of the invention, the present invention being shown And description, but it will be understood by those skilled in the art that and can in the form and details it be made Various changes, without departing from the spirit and scope of the present invention.

Claims

1. a recommendation method based on computer system, it is characterised in that the method includes following step Rapid:

Obtain each user project scoring record to projects；

Project scoring record according to each user clusters, and user characteristic data is divided into R In classification, R is greater than the integer of 1；

In the user characteristic data of each described classification, it is targeted customer's recommended project based on project.

Recommendation method based on computer system the most according to claim 1, it is characterised in that Described computer system includes that at least two calculates node；

Described " in the user characteristic data of each described classification, is that targeted customer recommends based on project Project " step in, the user characteristic data of each described classification is distributed to multiple calculating node, often Individual calculating node at most preserves the user characteristic data of R-1 described classification, and each calculating node is in institute The user characteristic data of each described classification preserved is targeted customer's recommended project based on project.

Recommendation method based on computer system the most according to claim 1, it is characterised in that Described " in the user characteristic data of each described classification, is targeted customer's recommendation items based on project Mesh " step in, using project-based collaborative filtering is targeted customer's recommended project；

Described " in the user characteristic data of each described classification, is that targeted customer recommends based on project Project " step include following sub-step:

Project scoring record according to user each in described classification, calculates all items in described classification Between similarity, and choose, for each project, M the project that similarity is the highest, M is predefined whole Number；

Project scoring record according to targeted customer described in described classification, chooses for described targeted customer Marking T the highest project, T is predefined integer；

T the project chosen for described targeted customer is chosen with for each project in described T project M project combine, and therefrom remove the project in the bulleted list of described targeted customer, formed Initial recommendation result.

Recommendation method based on computer system the most according to claim 3, it is characterised in that Following sub-step is also included after the sub-step forming initial recommendation result:

It is predefined whole for judging whether the number of entry in described initial recommendation result is more than N, N Number；

If the number of entry in described initial recommendation result is more than N, then from described initial recommendation result Choose the highest N number of project recommendation of similarity to described targeted customer；

If the number of entry in described initial recommendation result is less than N, then by the project of described targeted customer M the item that all items in list is chosen with each project in the bulleted list for described targeted customer Mesh combines, and therefrom removes all items in the bulleted list of described targeted customer, forms user Supplementing Data recommendation results.

Recommendation method based on computer system the most according to claim 4, it is characterised in that Following sub-step is also included after the sub-step forming user data completion recommendation results:

It is predetermined for judging whether the number of entry in described user data completion recommendation results is more than N, N The integer of justice；

If the number of entry in described user data completion recommendation results is more than N, then from described number of users According to completion recommendation results being chosen the highest N number of project recommendation of similarity to described targeted customer；

If the number of entry in described user data completion recommendation results is less than N, then described target is used All items in the bulleted list at family with and the bulleted list of described targeted customer in each project have The all items of similarity relation combines, and therefrom removes in the bulleted list of described targeted customer All items, forms project data completion recommendation results.

Recommendation method based on computer system the most according to claim 1, it is characterised in that Described " in the user characteristic data of each described classification, is targeted customer's recommendation items based on project Mesh " step after further comprising the steps of:

Judge that whether Sparse degree is more than Sparse degree threshold value, described Sparse degreeWherein k has the number of project pair of similarity relation in being calculated described classification Amount, l is the quantity of project in described classification,

If described Sparse degree is less than Sparse degree threshold value, then with first item, second items and Third item is one group, and between described first item and described second items, described second items is with described There is between third item similarity relation, be described first item and described by described second items Three projects set up similarity relation, and close according to similarity between supplementary project and tie up in described classification again Secondary is described targeted customer's recommended project based on project；

If described Sparse degree is more than Sparse degree threshold value, then will push away based on project is calculated Recommend project recommendation to described targeted customer.

Recommendation method based on computer system the most according to any one of claim 1 to 6, It is characterized in that, described " cluster, by user characteristics according to the project of each user record of marking Data are divided in R classification, and R is greater than the integer of 1 " step before further comprising the steps of:

Judge that whether number of users is more than userbase threshold value；

If described number of users is less than userbase threshold value, then direct base in all user characteristic data It is targeted customer's recommended project in project；

If described number of users is more than userbase threshold value, then enters and " comment according to the project of each user Member record clusters, and user characteristic data is divided in R classification, and R is greater than the integer of 1 " Step.

8. a recommendation apparatus based on computer system, it is characterised in that described device includes:

Cluster module, for each user obtained according to described user items initial relation computing module Project scoring record cluster, user characteristic data is divided in R classification, R is greater than 1 Integer；And

Recommending module, the user characteristics number of each described classification for being divided at described cluster module According to, it is targeted customer's recommended project based on project.

Recommendation apparatus based on computer system the most according to claim 8, it is characterised in that Described computer system includes that at least two calculates node；

Described recommending module saves for the user characteristic data of each described classification is distributed to multiple calculating Point, each calculating node at most preserves the user characteristic data of R-1 described classification, and each calculating saves Point is targeted customer's recommendation items based on project in the user characteristic data of each described classification preserved Mesh.

Recommendation apparatus based on computer system the most according to claim 8, its feature exists In, described recommending module uses project-based collaborative filtering to be targeted customer's recommended project；

Described recommending module includes:

Item similarity submodule, for the project scoring record according to user each in described classification, Calculate in described classification similarity between all items, and choose the highest M of similarity for each project Project, M is predefined integer；

User recommends submodule, marks for the project according to targeted customer described in described classification and remembers Record, chooses, for described targeted customer, T the project that scoring is the highest, and T is predefined integer；And

Initial recommendation submodule, is that described targeted customer chooses for described user is recommended submodule T project is that the M that in described T project, each project is chosen is individual with described item similarity submodule Project combines, and therefrom removes the project in the bulleted list of described targeted customer, is formed and initially pushes away Recommend result.

11. recommendation apparatus based on computer system according to claim 10, its feature exists In, described recommending module also includes:

Initial recommendation judges submodule, for judging that what described initial recommendation submodule formed initially pushes away Recommend the number of entry in result whether being more than N, N is predefined integer；

For described initial recommendation, initial recommendation screening submodule, if judging that submodule confirmation is described initially The number of entry in recommendation results is more than N, chooses, from described initial recommendation result, the N that similarity is the highest Described targeted customer is given in individual project recommendation；And

User data scale reduction submodule, if it is described to judge that submodule confirms for described initial recommendation The number of entry in initial recommendation result is less than N, then by the institute in the bulleted list of described targeted customer M the project having project to choose with each project in the bulleted list for described targeted customer combines, And therefrom remove all items in the bulleted list of described targeted customer, form user data completion and push away Recommend result.

12. recommendation apparatus based on computer system according to claim 11, its feature exists In, described recommending module also includes:

Completion is recommended to judge submodule, is used for judging that described user data scale reduction submodule is formed User data completion recommendation results in the number of entry be whether predefined integer more than N, N；

Screening submodule is recommended in completion, if recommending to judge that submodule confirms described user for described completion The number of entry in Supplementing Data recommendation results is more than N, from described user data completion recommendation results Choose the highest N number of project recommendation of similarity to described targeted customer；And

Project data scale reduction submodule, if it is described to recommend to judge that submodule confirms for described completion The number of entry in user data completion recommendation results is less than N, then the project of described targeted customer arranged All items in table with and the bulleted list of described targeted customer in each project there is similarity relation All items combine, and therefrom remove all items in the bulleted list of described targeted customer, Form project data completion recommendation results.

13. recommendation apparatus based on computer system according to claim 8, its feature exists In, described device also includes:

Recommendation results Sparse degree judge module, is used for judging that Sparse degree is the dilutest more than data Dredge degree threshold value, described Sparse degreeHave during wherein k is calculated described classification Having the quantity of the project pair of similarity relation, l is the quantity of project in described classification,With And

Sparse completion module, if confirming institute for described recommendation results Sparse degree judge module State Sparse degree and be less than Sparse degree threshold value, then with first item, second items and third item It it is one group, between described first item and described second items, described second items and described third item Between there is similarity relation, be described first item by described second items and described third item built Vertical similarity relation；

Described recommending module is similarity between the project supplemented according to described Sparse completion module It is described targeted customer's recommended project that pass ties up to be again based on project in described classification, if pushing away described in and Recommend result data degree of rarefication judge module and confirm that described Sparse degree is more than Sparse degree threshold value, then Described targeted customer will be recommended based on the calculated recommended project of project.

14. according to Claim 8 to recommendation based on the computer system dress according to any one of 13 Put, it is characterised in that described device also includes userbase judge module, at described cluster mould Before block cluster, it is judged that whether number of users is more than userbase threshold value；

If described recommending module confirms described number of users less than using for described userbase judge module Family size threshold, then be directly targeted customer's recommendation items based on project in all user characteristic data Mesh；

If described cluster module confirms described number of users more than using for described userbase judge module Family size threshold, then cluster, by user characteristic data according to the project scoring record of each user Being divided in R classification, R is greater than the integer of 1.