CN107798083A

CN107798083A - A kind of information based on big data recommends method, system and device

Info

Publication number: CN107798083A
Application number: CN201710967315.7A
Authority: CN
Inventors: 陈贤耿; 孔祥明; 胡旭
Original assignee: Guangdong Industry Kaiyuan Science And Technology Co Ltd
Current assignee: Guangdong Industry Kaiyuan Science And Technology Co Ltd
Priority date: 2017-10-17
Filing date: 2017-10-17
Publication date: 2018-03-13

Abstract

The invention discloses a kind of information based on big data to recommend method, system and device, fast and effective training is carried out to extensive information document by implying Di Li Crays method parallel, and returned by Logistic and ask for optimal coefficient, so as to calculate the weighted score of new information document, take into full account the theme distribution of information document and combine user personality behavior, and then provide the user personalized recommendation, and the similarity based method based on document that compares, its algorithm complex greatly reduces, effectively increase execution efficiency, reduce the memory headroom taken, and reduce model error, greatly improve accuracy rate.It the composite can be widely applied in information recommendation.

Description

A kind of information based on big data recommends method, system and device

Technical field

The present invention relates to big data technical field, more particularly to a kind of information based on big data recommend method, system and Device.

Background technology

With the popularization of internet and the development of technology, various information promulgating platform progressively appears in regarding for people Among open country so that the mode that people obtain information is simpler, and method is more various, and furthered media and the distance of people.With This produces the problem of substantial amounts of information also brings information explosion daily simultaneously.Although people can obtain greatly easily daily The information of amount, but be easy to become vast and hazy.Because obtaining the information useful to oneself from complicated substantial amounts of information becomes very Difficulty, cost are very high.Information is filtered according to the collaborative filtering of classics, for substantial amounts of data, accurately Property is not high.And the cost that user behavior data is typically collected using conventional method is very high, it is difficult to accomplish in time.

The content of the invention

In order to solve the above-mentioned technical problem, it is an object of the invention to provide it is a kind of can timely gathered data, and accuracy compared with The high information based on big data recommends method, system and device.

The technical solution used in the present invention is：

A kind of information based on big data recommends method, comprises the following steps：

Collection user behavior data is simultaneously analyzed it, obtains information collection data and user behavior analysis data；

Information collection data are pre-processed, obtain corpus；

LDA modelings are carried out to obtained corpus；

Processing is acquired to information collection data by distributed Gibbs Sampling, obtains training set, and then obtain Theme distribution probability matrix；

According to training set and theme distribution probability matrix, every new information document is calculated by Logistic regression algorithms Weighted score；

Weighted score highest n piece information documents are recommended user, wherein, n is preset value.

Recommend further improvements in methods, described collection user behavior as a kind of described information based on big data Data are simultaneously analyzed it, obtain information collection data and user behavior analysis data, the step for specifically include：

Collection daily record simultaneously carries out classification processing, obtains User action log；

According to User action log, user behavior data is collected；

Information document is classified and stored；

By clustering method, the user similar to interest classifies；

Such user that needs are recommended, it is 1 to browsed information document markup, to not browsed information document Labeled as 0, obtain browsing information collection and do not browse information collection, that is, obtain information collection data；

Obtain information and concentrate the ID of each information document, and obtain user's residence time of each information document, used Family behavioural analysis data.

Recommend further improvements in methods as a kind of described information based on big data, it is described to information collection data Pre-processed, obtain corpus, the step for specifically include：

Word segmentation processing is carried out to the information document in information collection data, and unregistered word is identified, obtains information text Word in shelves；

Obtained root is carried out stopping word processing according to default stop vocabulary, obtains corpus.

Recommend further improvements in methods, the language material to obtaining as a kind of described information based on big data Storehouse carry out LDA modelings, the step for specifically include：

According to corpus, carry out LDA and model to obtain LDA models；

Calculating is optimized to the parameter in LDA models；

Parameter Estimation is carried out according to the LDA models of foundation.

Recommend further improvements in methods as a kind of described information based on big data, it is described according to corpus, Carry out LDA and model to obtain LDA models, the step for be embodied in：

Wherein, theme distribution θ obeys the Di Li Crays that hyper parameter is α, word distributionObey the Di Li Crays that hyper parameter is β Distribution, word w obey the theme distribution that parameter is θ, and theme numbering z obeys parameter and isMultinomial distribution.

Recommend further improvements in methods as a kind of described information based on big data, in the model to LDA Parameter optimize calculating, the step for specific formula for calculation be：

Wherein,The parameter alpha after optimization is represented,Represent the parameter beta after optimization, α_kRepresent the ginseng before optimization Number α, β_tThe parameter beta before optimization is represented,For Digamma functions, represent to ask the derivative of variable x logarithm, n_ikRepresent i-th Article, the document count that theme is k, n_ktRepresent the counting for the word t that theme numbering is k, and n_i=∑_kn_ik, n_k=∑_tn_kt。

Recommend further improvements in methods as a kind of described information based on big data, it is described according to foundation LDA models carry out parameter Estimation, the step for specific formula for calculation be：

Wherein,Represent the distribution probability of word t under theme k, θ_{M, k}It is general to be expressed as the distribution that m pieces document subject matter is k Rate,The counting on word t under theme k is represented,Represent the counting on word t, α under m piece documents_tRepresent on word T parameter alpha, β_tRepresent the parameter beta on word t.

Recommend further improvements in methods, the calculating of the weighted score as a kind of described information based on big data Formula is：

Score (i)=c₁*Topic1+c₂*Topic2+……+c_k*TopicK；

Wherein, i represents i-th document, and k represents k-th of theme, and TopicK represents the distribution probability of k-th of theme, [c₁, c₂,....,c_k] represent to pass through the optimal regression coefficient value of each theme of the counted correspondence of Logistic regression algorithms.

Another technical scheme of the present invention is：

A kind of information recommendation system based on big data, including：

Collecting unit, for gathering user behavior data and it being analyzed, obtain information collection data and user's row For analyze data；

Pretreatment unit, for being pre-processed to information collection data, obtain corpus；

Modeling unit, for carrying out LDA modelings to obtained corpus；

Distributed processing unit, for being acquired processing to information collection data by distributed Gibbs Sampling, Training set is obtained, and then obtains theme distribution probability matrix；

Weight calculation unit, for according to training set and theme distribution probability matrix, passing through Logistic regression algorithm meters Calculate the weighted score of every new information document；

Recommendation unit, for weighted score highest n piece information documents to be recommended user, wherein, n is default Value.

Another technical scheme of the present invention is：

A kind of information recommendation apparatus based on big data, including：

Memory, for depositing program；

Processor, for perform described program for：

Information collection data are pre-processed, obtain corpus；

LDA modelings are carried out to obtained corpus；

The beneficial effects of the invention are as follows：

A kind of information based on big data of the present invention recommends method, system and device by implying Di Li Cray methods parallel Fast and effective training is carried out to extensive information document, and is returned by Logistic and asks for optimal coefficient, it is new so as to calculate The weighted score of information document, the theme distribution of information document is taken into full account and has combined user personality behavior, and then be user Personalized recommendation is provided, and the similarity based method based on document that compares, its algorithm complex greatly reduce, effectively increased Execution efficiency, reduces the memory headroom of occupancy, and reduces model error, greatly improves accuracy rate.

Brief description of the drawings

Fig. 1 is the step flow chart that a kind of information based on big data of the present invention recommends method；

Fig. 2 is a kind of block diagram of the information recommendation system based on big data of the present invention.

Embodiment

The embodiment of the present invention is described further below in conjunction with the accompanying drawings：

With reference to figure 1, a kind of information based on big data of the present invention recommends method, comprises the following steps：

Information collection data are pre-processed, obtain corpus；

LDA modelings are carried out to obtained corpus；

Preferred embodiment is further used as, described collection user behavior data is simultaneously analyzed it, provided News collection data and user behavior analysis data, the step for specifically include：

According to User action log, user behavior data is collected；

Information document is classified and stored；

By clustering method, the user similar to interest classifies；

In the present embodiment, daily record is gathered using flume, daily record is divided into two classes, a kind of behavioral data on user User action log, including user browse vestige and the scope of activities of user, a kind of day regular data on system.By user's row Started a flume component every 5 minutes for daily record, log information is transferred in kafka, and be stored in HDFS, system Day regular data will aim at being stored in daily 2:00 AM importing HDFS day using t+1 by the way of.

In the present embodiment, the mode buried a little using api and front end collects user behavior data from User action log, including How long user has seen information (being invalid information more than 5 minutes), if thumbs up, if comment, comment whether be it is positive or Person is negative, enters data into Spark Stream, is constantly passed in HDFS.

In the present embodiment, the HDFS in Hadoop big data platforms is used to be stored for the information document classification of magnanimity, And appearance can be established based on Hive, effectively can quickly found magnanimity information document., will using the inquiry of Hive tables Real-time calculating is done in user behavior data analysis in a serial fashion, and result of calculation is saved in HDFS, and calculating next time can be with Based on basis before, incremental computations are done.

In the present embodiment, by the clustering method of K-means algorithms, user behavior data is analyzed, interest is similar User be classified as one kind.1 (representing user to the interested of the information document) is designated as to the information document that certain class user browses, it is right The piece article up and down of information document interested, user, which does not click on the information document browsed and is designated as 0, (represents the user to information text Shelves are lost interest in), and the ID of each information document is obtained, and user's residence time.

Preferred embodiment is further used as, it is described that information collection data are pre-processed, corpus is obtained, this Step specifically includes：

In the embodiment of the present invention, the word segmentation processing is segmented using stammerer, and according to implicit Markov model to not stepping on Record word is identified, Custom Dictionaries, sets certain weight to proprietary word and popular word, it is ensured that during participle, the word can be accurate Really segmentation.Fall the vocabulary of no practical significance according to deactivation vocabulary automatic fitration, such as preposition, article, auxiliary words of mood, adverbial word, Jie Word, conjunction and punctuate etc..

Be further used as preferred embodiment, it is described to carry out LDA modelings to obtained corpus, the step for it is specific Including：

According to corpus, carry out LDA and model to obtain LDA models；

Calculating is optimized to the parameter in LDA models；

Parameter Estimation is carried out according to the LDA models of foundation.

Preferred embodiment is further used as, it is described to model to obtain LDA models according to corpus, progress LDA, this Step is embodied in：

Preferred embodiment is further used as, it is described that parameter Estimation is carried out according to the LDA models of foundation, the step for Specific formula for calculation be：

In the present embodiment, according to the dependence between variable, it is as follows that joint probability density formula can be obtained：

For few calculation error, respectively to θ andQuadrature, last simplified formula is

P (w, z | α, β)=p (w | z, β) p (z | α)；

It can thus be concluded that going out p (w, z), by Collapsed Gibbs Sampling, followed within the iterations of setting Lottery of lotteries takes the theme of current word, until the theme distribution of word reaches convergence.It is as follows to implement formula：

Next, using posterior probability estimation, theme distribution, and word distribution difference are obtained, and both obey Di Li Crays Distribution.Word distribution probability, and theme distribution probable value can be drawn according to Di Li Crays distribution property, i.e.,：

Wherein,Represent the distribution probability of word t under theme k, θ_{M, k}It is general to be expressed as the distribution that m pieces document subject matter is k Rate,The counting on word t under theme k is represented,Represent the counting on word t under m piece documents.

It is further used as preferred embodiment, the parameter in the model to LDA optimizes calculating, the step for Specific formula for calculation be：

In the present embodiment, when calculating k takes different value, model puzzlement degree perplexity change, then by puzzlement degree most The optimal theme number that small theme number is fitted as model to data.To given corpus data D, its puzzlement degree is：

Wherein w_mRepresent the word of m piece documents, N_mRepresent the length of m piece documents.It is as shown below, as theme number K= When 40, puzzlement degree is minimum, therefore most has theme number to be set to 40.

In the embodiment of the present invention, the optimization for parameter alpha and parameter beta：

Wherein,The parameter alpha after optimization is represented,The parameter beta after optimization is represented,For Digamma functions, The derivative of variable x logarithm, n are asked in expression_ikRepresent i-th article, the document count that theme is k, n_ktRepresent that theme numbering is k Word t counting, and n_i=∑_kn_ik, n_k=∑_tn_kt。

In the present embodiment, when carrying out parallelization Gibbs Sampling acquisition process, original data set is pressed to the number of Lothrus apterus According to dividing method, it is divided into P*P part (P is the number of concurrent set), the data block split is resequenced, is finally synthesizing P Individual data block, it is placed on each machine and performs.So each data set is sampled again.It is parallel in group, it is serial between group.Often It is to horn cupping with strategy, because same a line or same row can not be selected simultaneously, therefore selects diagonal to be calculated.In group After parallel execution an iteration, group's document, the statistic such as the counting of word is synchronized to next group, and in group in each piece The Gibbs Sampling of sampling and standalone version are the same methods, are then remerged.

To reduce volume of transmitted data, the method for use is that the same data block of line number in the data split is placed on On same computer node.Vocabulary V data are divided equally as far as possible, to reduce transmission volume.Again in each computer node Upper Gibbs Sampling, finally merge.So far parallelization sampling terminates.

Preferred embodiment is further used as, the calculation formula of the weighted score is：

Score (i)=c₁*Topic1+c₂*Topic2+……+c_k*TopicK；

In the present embodiment, during using Logistic regression algorithms, output is defined between zero and one, i.e.,：0≤hθ(x)≤ 1.And linear regression can not be accomplished, a function g is introduced here, makes the Hypothesis of logistic regression be expressed as：h_u(x)=g (u^TX) g is referred to as Sigmoid function or Logistic function here, and expression is：

G (z)=1/ (1+exp (- z)), h_u(x)=g (u^TX)=1/ (1+exp (- u^TX)), wherein u is parameter.

Optimization to u parameters, that is, minimize the log-likelihood loss function cost function of logistic regression.

Minimum loss function is asked using gradient descent method, optimal value is obtained, by nth iteration parameter more news It is as follows：

Until parameter u convergences, i.e., the regression coefficient value finally tried to achieve, it is the optimal solution for minimizing loss function.Wherein uj Represent j-th of parameter, xⁱRepresent i-th of component, yⁱRepresent the estimate of i-th of variable.

In the present embodiment, in the theme distribution matrix of training set document, each theme treats as independent variable x, and user clicks on Whether information works as dependent variable h (x), with reference to logistic regression algorithm, tries to achieve optimal regression coefficient value [c afterwards₁,c₂,...., c_k], then the theme distribution probable value of new information document is combined afterwards.Calculate score Score (i)=c of every information document₁* Topic1+c₂*Topic2+……+c_k* TopicK, i represent i-th document.Finally according to the height of new information document scores, The high information document of n piece scores is as the recommendation to the user before taking.

With reference to figure 2, a kind of information recommendation system based on big data of the invention, including：

Modeling unit, for carrying out LDA modelings to obtained corpus；

A kind of information recommendation apparatus based on big data of the present invention, including：

Memory, for depositing program；

Processor, for perform described program for：

Information collection data are pre-processed, obtain corpus；

LDA modelings are carried out to obtained corpus；

From the foregoing it can be that to recommend method, system and device to pass through parallel for a kind of information based on big data of the present invention Implicit Di Li Crays method carries out fast and effective training to extensive information document, and returned by Logistic ask for it is optimal Coefficient, so as to calculate the weighted score of new information document, the theme distribution of information document is taken into full account and has combined user personality Behavior, and then personalized recommendation is provided the user, and the similarity based method based on document that compares, its algorithm complex is significantly Reduce, effectively increase execution efficiency, reduce the memory headroom of occupancy, and reduce model error, greatly improve accuracy rate.

Above is the preferable implementation to the present invention is illustrated, but the invention is not limited to the implementation Example, those skilled in the art can also make a variety of equivalent variations on the premise of without prejudice to spirit of the invention or replace Change, these equivalent deformations or replacement are all contained in the application claim limited range.

Claims

1. a kind of information based on big data recommends method, it is characterised in that comprises the following steps：

Information collection data are pre-processed, obtain corpus；

LDA modelings are carried out to obtained corpus；

According to training set and theme distribution probability matrix, the power of every new information document is calculated by Logistic regression algorithms Heavy point；

2. a kind of information based on big data according to claim 1 recommends method, it is characterised in that：Described collection is used Family behavioral data is simultaneously analyzed it, obtains information collection data and user behavior analysis data, the step for specifically include：

According to User action log, user behavior data is collected；

Information document is classified and stored；

By clustering method, the user similar to interest classifies；

Such user that needs are recommended, it is 1 to browsed information document markup, to not browsed information document markup For 0, obtain browsing information collection and do not browse information collection, that is, obtain information collection data；

Obtain information and concentrate the ID of each information document, and obtain user's residence time of each information document, obtain user's row For analyze data.

3. a kind of information based on big data according to claim 1 recommends method, it is characterised in that：It is described to information Collection data are pre-processed, and obtain corpus, the step for specifically include：

Word segmentation processing is carried out to the information document in information collection data, and unregistered word is identified, is obtained in information document Word；

4. a kind of information based on big data according to claim 1 recommends method, it is characterised in that：It is described to obtaining Corpus carry out LDA modelings, the step for specifically include：

According to corpus, carry out LDA and model to obtain LDA models；

Calculating is optimized to the parameter in LDA models；

Parameter Estimation is carried out according to the LDA models of foundation.

5. a kind of information based on big data according to claim 4 recommends method, it is characterised in that：It is described according to language Expect storehouse, carry out LDA and model to obtain LDA models, the step for be embodied in：

Wherein, theme distribution θ obeys the Di Li Crays that hyper parameter is α, word distributionThe Di Li Crays that hyper parameter is β are obeyed to be distributed, Word w obeys the theme distribution that parameter is θ, and theme numbering z obeys parameter and isMultinomial distribution.

6. a kind of information based on big data according to claim 4 recommends method, it is characterised in that：It is described to LDA Parameter in model optimizes calculating, the step for specific formula for calculation be：

Wherein,The parameter alpha after optimization is represented,Represent the parameter beta after optimization, α_kThe parameter alpha before optimization is represented, β_tThe parameter beta before optimization is represented,For Digamma functions, represent to ask the derivative of variable x logarithm, n_ikRepresent i-th text Chapter, the document count that theme is k, n_ktRepresent the counting for the word t that theme numbering is k, and n_i=∑ * n_ik, n_k=∑_tn_kt。

7. a kind of information based on big data according to claim 5 recommends method, it is characterised in that：Described basis is built Vertical LDA models carry out parameter Estimation, the step for specific formula for calculation be：

<mrow> <msub> <mi>&theta;</mi> <mrow> <mi>m</mi> <mo>,</mo> <mi>k</mi> </mrow> </msub> <mo>=</mo> <mfrac> <mrow> <msubsup> <mi>n</mi> <mi>m</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> </msubsup> <mo>+</mo> <msub> <mi>&alpha;</mi> <mi>t</mi> </msub> </mrow> <mrow> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>K</mi> </msubsup> <msubsup> <mi>n</mi> <mi>m</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> </msubsup> <mo>+</mo> <msub> <mi>&alpha;</mi> <mi>t</mi> </msub> </mrow> </mfrac> <mo>;</mo> </mrow>

Wherein,Represent the distribution probability of word t under theme k, θ_{M, k}The distribution probability that m pieces document subject matter is k is expressed as,The counting on word t under theme k is represented,Represent the counting on word t, α under m piece documents_tRepresent on word t Parameter alpha, β_tRepresent the parameter beta on word t.

8. a kind of information based on big data according to claim 1 recommends method, it is characterised in that：The weighted score Calculation formula be：

Score (i)=c₁*Topic1+c₂*Topic2+……+c_k*TopicK；

A kind of 9. information recommendation system based on big data, it is characterised in that including：

Collecting unit, for gathering user behavior data and it being analyzed, obtain information collection data and user behavior point Analyse data；

Modeling unit, for carrying out LDA modelings to obtained corpus；

Distributed processing unit, for being acquired processing to information collection data by distributed Gibbs Sampling, obtain Training set, and then obtain theme distribution probability matrix；

Weight calculation unit, for according to training set and theme distribution probability matrix, being calculated by Logistic regression algorithms every The weighted score of the new information document of a piece；

Recommendation unit, for weighted score highest n piece information documents to be recommended user, wherein, n is preset value.

A kind of 10. information recommendation apparatus based on big data, it is characterised in that including：

Memory, for depositing program；

Processor, for perform described program for：

Information collection data are pre-processed, obtain corpus；

LDA modelings are carried out to obtained corpus；