CN107798083A - A kind of information based on big data recommends method, system and device - Google Patents

A kind of information based on big data recommends method, system and device Download PDF

Info

Publication number
CN107798083A
CN107798083A CN201710967315.7A CN201710967315A CN107798083A CN 107798083 A CN107798083 A CN 107798083A CN 201710967315 A CN201710967315 A CN 201710967315A CN 107798083 A CN107798083 A CN 107798083A
Authority
CN
China
Prior art keywords
information
data
theme
document
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710967315.7A
Other languages
Chinese (zh)
Inventor
陈贤耿
孔祥明
胡旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Industry Kaiyuan Science And Technology Co Ltd
Original Assignee
Guangdong Industry Kaiyuan Science And Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Industry Kaiyuan Science And Technology Co Ltd filed Critical Guangdong Industry Kaiyuan Science And Technology Co Ltd
Priority to CN201710967315.7A priority Critical patent/CN107798083A/en
Publication of CN107798083A publication Critical patent/CN107798083A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of information based on big data to recommend method, system and device, fast and effective training is carried out to extensive information document by implying Di Li Crays method parallel, and returned by Logistic and ask for optimal coefficient, so as to calculate the weighted score of new information document, take into full account the theme distribution of information document and combine user personality behavior, and then provide the user personalized recommendation, and the similarity based method based on document that compares, its algorithm complex greatly reduces, effectively increase execution efficiency, reduce the memory headroom taken, and reduce model error, greatly improve accuracy rate.It the composite can be widely applied in information recommendation.

Description

A kind of information based on big data recommends method, system and device
Technical field
The present invention relates to big data technical field, more particularly to a kind of information based on big data recommend method, system and Device.
Background technology
With the popularization of internet and the development of technology, various information promulgating platform progressively appears in regarding for people Among open country so that the mode that people obtain information is simpler, and method is more various, and furthered media and the distance of people.With This produces the problem of substantial amounts of information also brings information explosion daily simultaneously.Although people can obtain greatly easily daily The information of amount, but be easy to become vast and hazy.Because obtaining the information useful to oneself from complicated substantial amounts of information becomes very Difficulty, cost are very high.Information is filtered according to the collaborative filtering of classics, for substantial amounts of data, accurately Property is not high.And the cost that user behavior data is typically collected using conventional method is very high, it is difficult to accomplish in time.
The content of the invention
In order to solve the above-mentioned technical problem, it is an object of the invention to provide it is a kind of can timely gathered data, and accuracy compared with The high information based on big data recommends method, system and device.
The technical solution used in the present invention is:
A kind of information based on big data recommends method, comprises the following steps:
Collection user behavior data is simultaneously analyzed it, obtains information collection data and user behavior analysis data;
Information collection data are pre-processed, obtain corpus;
LDA modelings are carried out to obtained corpus;
Processing is acquired to information collection data by distributed Gibbs Sampling, obtains training set, and then obtain Theme distribution probability matrix;
According to training set and theme distribution probability matrix, every new information document is calculated by Logistic regression algorithms Weighted score;
Weighted score highest n piece information documents are recommended user, wherein, n is preset value.
Recommend further improvements in methods, described collection user behavior as a kind of described information based on big data Data are simultaneously analyzed it, obtain information collection data and user behavior analysis data, the step for specifically include:
Collection daily record simultaneously carries out classification processing, obtains User action log;
According to User action log, user behavior data is collected;
Information document is classified and stored;
By clustering method, the user similar to interest classifies;
Such user that needs are recommended, it is 1 to browsed information document markup, to not browsed information document Labeled as 0, obtain browsing information collection and do not browse information collection, that is, obtain information collection data;
Obtain information and concentrate the ID of each information document, and obtain user's residence time of each information document, used Family behavioural analysis data.
Recommend further improvements in methods as a kind of described information based on big data, it is described to information collection data Pre-processed, obtain corpus, the step for specifically include:
Word segmentation processing is carried out to the information document in information collection data, and unregistered word is identified, obtains information text Word in shelves;
Obtained root is carried out stopping word processing according to default stop vocabulary, obtains corpus.
Recommend further improvements in methods, the language material to obtaining as a kind of described information based on big data Storehouse carry out LDA modelings, the step for specifically include:
According to corpus, carry out LDA and model to obtain LDA models;
Calculating is optimized to the parameter in LDA models;
Parameter Estimation is carried out according to the LDA models of foundation.
Recommend further improvements in methods as a kind of described information based on big data, it is described according to corpus, Carry out LDA and model to obtain LDA models, the step for be embodied in:
Wherein, theme distribution θ obeys the Di Li Crays that hyper parameter is α, word distributionObey the Di Li Crays that hyper parameter is β Distribution, word w obey the theme distribution that parameter is θ, and theme numbering z obeys parameter and isMultinomial distribution.
Recommend further improvements in methods as a kind of described information based on big data, in the model to LDA Parameter optimize calculating, the step for specific formula for calculation be:
Wherein,The parameter alpha after optimization is represented,Represent the parameter beta after optimization, αkRepresent the ginseng before optimization Number α, βtThe parameter beta before optimization is represented,For Digamma functions, represent to ask the derivative of variable x logarithm, nikRepresent i-th Article, the document count that theme is k, nktRepresent the counting for the word t that theme numbering is k, and ni=∑knik, nk=∑tnkt
Recommend further improvements in methods as a kind of described information based on big data, it is described according to foundation LDA models carry out parameter Estimation, the step for specific formula for calculation be:
Wherein,Represent the distribution probability of word t under theme k, θM, kIt is general to be expressed as the distribution that m pieces document subject matter is k Rate,The counting on word t under theme k is represented,Represent the counting on word t, α under m piece documentstRepresent on word T parameter alpha, βtRepresent the parameter beta on word t.
Recommend further improvements in methods, the calculating of the weighted score as a kind of described information based on big data Formula is:
Score (i)=c1*Topic1+c2*Topic2+……+ck*TopicK;
Wherein, i represents i-th document, and k represents k-th of theme, and TopicK represents the distribution probability of k-th of theme, [c1, c2,....,ck] represent to pass through the optimal regression coefficient value of each theme of the counted correspondence of Logistic regression algorithms.
Another technical scheme of the present invention is:
A kind of information recommendation system based on big data, including:
Collecting unit, for gathering user behavior data and it being analyzed, obtain information collection data and user's row For analyze data;
Pretreatment unit, for being pre-processed to information collection data, obtain corpus;
Modeling unit, for carrying out LDA modelings to obtained corpus;
Distributed processing unit, for being acquired processing to information collection data by distributed Gibbs Sampling, Training set is obtained, and then obtains theme distribution probability matrix;
Weight calculation unit, for according to training set and theme distribution probability matrix, passing through Logistic regression algorithm meters Calculate the weighted score of every new information document;
Recommendation unit, for weighted score highest n piece information documents to be recommended user, wherein, n is default Value.
Another technical scheme of the present invention is:
A kind of information recommendation apparatus based on big data, including:
Memory, for depositing program;
Processor, for perform described program for:
Collection user behavior data is simultaneously analyzed it, obtains information collection data and user behavior analysis data;
Information collection data are pre-processed, obtain corpus;
LDA modelings are carried out to obtained corpus;
Processing is acquired to information collection data by distributed Gibbs Sampling, obtains training set, and then obtain Theme distribution probability matrix;
According to training set and theme distribution probability matrix, every new information document is calculated by Logistic regression algorithms Weighted score;
Weighted score highest n piece information documents are recommended user, wherein, n is preset value.
The beneficial effects of the invention are as follows:
A kind of information based on big data of the present invention recommends method, system and device by implying Di Li Cray methods parallel Fast and effective training is carried out to extensive information document, and is returned by Logistic and asks for optimal coefficient, it is new so as to calculate The weighted score of information document, the theme distribution of information document is taken into full account and has combined user personality behavior, and then be user Personalized recommendation is provided, and the similarity based method based on document that compares, its algorithm complex greatly reduce, effectively increased Execution efficiency, reduces the memory headroom of occupancy, and reduces model error, greatly improves accuracy rate.
Brief description of the drawings
Fig. 1 is the step flow chart that a kind of information based on big data of the present invention recommends method;
Fig. 2 is a kind of block diagram of the information recommendation system based on big data of the present invention.
Embodiment
The embodiment of the present invention is described further below in conjunction with the accompanying drawings:
With reference to figure 1, a kind of information based on big data of the present invention recommends method, comprises the following steps:
Collection user behavior data is simultaneously analyzed it, obtains information collection data and user behavior analysis data;
Information collection data are pre-processed, obtain corpus;
LDA modelings are carried out to obtained corpus;
Processing is acquired to information collection data by distributed Gibbs Sampling, obtains training set, and then obtain Theme distribution probability matrix;
According to training set and theme distribution probability matrix, every new information document is calculated by Logistic regression algorithms Weighted score;
Weighted score highest n piece information documents are recommended user, wherein, n is preset value.
Preferred embodiment is further used as, described collection user behavior data is simultaneously analyzed it, provided News collection data and user behavior analysis data, the step for specifically include:
Collection daily record simultaneously carries out classification processing, obtains User action log;
According to User action log, user behavior data is collected;
Information document is classified and stored;
By clustering method, the user similar to interest classifies;
Such user that needs are recommended, it is 1 to browsed information document markup, to not browsed information document Labeled as 0, obtain browsing information collection and do not browse information collection, that is, obtain information collection data;
Obtain information and concentrate the ID of each information document, and obtain user's residence time of each information document, used Family behavioural analysis data.
In the present embodiment, daily record is gathered using flume, daily record is divided into two classes, a kind of behavioral data on user User action log, including user browse vestige and the scope of activities of user, a kind of day regular data on system.By user's row Started a flume component every 5 minutes for daily record, log information is transferred in kafka, and be stored in HDFS, system Day regular data will aim at being stored in daily 2:00 AM importing HDFS day using t+1 by the way of.
In the present embodiment, the mode buried a little using api and front end collects user behavior data from User action log, including How long user has seen information (being invalid information more than 5 minutes), if thumbs up, if comment, comment whether be it is positive or Person is negative, enters data into Spark Stream, is constantly passed in HDFS.
In the present embodiment, the HDFS in Hadoop big data platforms is used to be stored for the information document classification of magnanimity, And appearance can be established based on Hive, effectively can quickly found magnanimity information document., will using the inquiry of Hive tables Real-time calculating is done in user behavior data analysis in a serial fashion, and result of calculation is saved in HDFS, and calculating next time can be with Based on basis before, incremental computations are done.
In the present embodiment, by the clustering method of K-means algorithms, user behavior data is analyzed, interest is similar User be classified as one kind.1 (representing user to the interested of the information document) is designated as to the information document that certain class user browses, it is right The piece article up and down of information document interested, user, which does not click on the information document browsed and is designated as 0, (represents the user to information text Shelves are lost interest in), and the ID of each information document is obtained, and user's residence time.
Preferred embodiment is further used as, it is described that information collection data are pre-processed, corpus is obtained, this Step specifically includes:
Word segmentation processing is carried out to the information document in information collection data, and unregistered word is identified, obtains information text Word in shelves;
Obtained root is carried out stopping word processing according to default stop vocabulary, obtains corpus.
In the embodiment of the present invention, the word segmentation processing is segmented using stammerer, and according to implicit Markov model to not stepping on Record word is identified, Custom Dictionaries, sets certain weight to proprietary word and popular word, it is ensured that during participle, the word can be accurate Really segmentation.Fall the vocabulary of no practical significance according to deactivation vocabulary automatic fitration, such as preposition, article, auxiliary words of mood, adverbial word, Jie Word, conjunction and punctuate etc..
Be further used as preferred embodiment, it is described to carry out LDA modelings to obtained corpus, the step for it is specific Including:
According to corpus, carry out LDA and model to obtain LDA models;
Calculating is optimized to the parameter in LDA models;
Parameter Estimation is carried out according to the LDA models of foundation.
Preferred embodiment is further used as, it is described to model to obtain LDA models according to corpus, progress LDA, this Step is embodied in:
Wherein, theme distribution θ obeys the Di Li Crays that hyper parameter is α, word distributionObey the Di Li Crays that hyper parameter is β Distribution, word w obey the theme distribution that parameter is θ, and theme numbering z obeys parameter and isMultinomial distribution.
Preferred embodiment is further used as, it is described that parameter Estimation is carried out according to the LDA models of foundation, the step for Specific formula for calculation be:
Wherein,Represent the distribution probability of word t under theme k, θM, kIt is general to be expressed as the distribution that m pieces document subject matter is k Rate,The counting on word t under theme k is represented,Represent the counting on word t, α under m piece documentstRepresent on word T parameter alpha, βtRepresent the parameter beta on word t.
In the present embodiment, according to the dependence between variable, it is as follows that joint probability density formula can be obtained:
For few calculation error, respectively to θ andQuadrature, last simplified formula is
P (w, z | α, β)=p (w | z, β) p (z | α);
It can thus be concluded that going out p (w, z), by Collapsed Gibbs Sampling, followed within the iterations of setting Lottery of lotteries takes the theme of current word, until the theme distribution of word reaches convergence.It is as follows to implement formula:
Next, using posterior probability estimation, theme distribution, and word distribution difference are obtained, and both obey Di Li Crays Distribution.Word distribution probability, and theme distribution probable value can be drawn according to Di Li Crays distribution property, i.e.,:
Wherein,Represent the distribution probability of word t under theme k, θM, kIt is general to be expressed as the distribution that m pieces document subject matter is k Rate,The counting on word t under theme k is represented,Represent the counting on word t under m piece documents.
It is further used as preferred embodiment, the parameter in the model to LDA optimizes calculating, the step for Specific formula for calculation be:
Wherein,The parameter alpha after optimization is represented,Represent the parameter beta after optimization, αkRepresent the ginseng before optimization Number α, βtThe parameter beta before optimization is represented,For Digamma functions, represent to ask the derivative of variable x logarithm, nikRepresent i-th Article, the document count that theme is k, nktRepresent the counting for the word t that theme numbering is k, and ni=∑knik, nk=∑tnkt
In the present embodiment, when calculating k takes different value, model puzzlement degree perplexity change, then by puzzlement degree most The optimal theme number that small theme number is fitted as model to data.To given corpus data D, its puzzlement degree is:
Wherein wmRepresent the word of m piece documents, NmRepresent the length of m piece documents.It is as shown below, as theme number K= When 40, puzzlement degree is minimum, therefore most has theme number to be set to 40.
In the embodiment of the present invention, the optimization for parameter alpha and parameter beta:
Wherein,The parameter alpha after optimization is represented,The parameter beta after optimization is represented,For Digamma functions, The derivative of variable x logarithm, n are asked in expressionikRepresent i-th article, the document count that theme is k, nktRepresent that theme numbering is k Word t counting, and ni=∑knik, nk=∑tnkt
In the present embodiment, when carrying out parallelization Gibbs Sampling acquisition process, original data set is pressed to the number of Lothrus apterus According to dividing method, it is divided into P*P part (P is the number of concurrent set), the data block split is resequenced, is finally synthesizing P Individual data block, it is placed on each machine and performs.So each data set is sampled again.It is parallel in group, it is serial between group.Often It is to horn cupping with strategy, because same a line or same row can not be selected simultaneously, therefore selects diagonal to be calculated.In group After parallel execution an iteration, group's document, the statistic such as the counting of word is synchronized to next group, and in group in each piece The Gibbs Sampling of sampling and standalone version are the same methods, are then remerged.
To reduce volume of transmitted data, the method for use is that the same data block of line number in the data split is placed on On same computer node.Vocabulary V data are divided equally as far as possible, to reduce transmission volume.Again in each computer node Upper Gibbs Sampling, finally merge.So far parallelization sampling terminates.
Preferred embodiment is further used as, the calculation formula of the weighted score is:
Score (i)=c1*Topic1+c2*Topic2+……+ck*TopicK;
Wherein, i represents i-th document, and k represents k-th of theme, and TopicK represents the distribution probability of k-th of theme, [c1, c2,....,ck] represent to pass through the optimal regression coefficient value of each theme of the counted correspondence of Logistic regression algorithms.
In the present embodiment, during using Logistic regression algorithms, output is defined between zero and one, i.e.,:0≤hθ(x)≤ 1.And linear regression can not be accomplished, a function g is introduced here, makes the Hypothesis of logistic regression be expressed as:hu(x)=g (uTX) g is referred to as Sigmoid function or Logistic function here, and expression is:
G (z)=1/ (1+exp (- z)), hu(x)=g (uTX)=1/ (1+exp (- uTX)), wherein u is parameter.
Optimization to u parameters, that is, minimize the log-likelihood loss function cost function of logistic regression.
Minimum loss function is asked using gradient descent method, optimal value is obtained, by nth iteration parameter more news It is as follows:
Until parameter u convergences, i.e., the regression coefficient value finally tried to achieve, it is the optimal solution for minimizing loss function.Wherein uj Represent j-th of parameter, xiRepresent i-th of component, yiRepresent the estimate of i-th of variable.
In the present embodiment, in the theme distribution matrix of training set document, each theme treats as independent variable x, and user clicks on Whether information works as dependent variable h (x), with reference to logistic regression algorithm, tries to achieve optimal regression coefficient value [c afterwards1,c2,...., ck], then the theme distribution probable value of new information document is combined afterwards.Calculate score Score (i)=c of every information document1* Topic1+c2*Topic2+……+ck* TopicK, i represent i-th document.Finally according to the height of new information document scores, The high information document of n piece scores is as the recommendation to the user before taking.
With reference to figure 2, a kind of information recommendation system based on big data of the invention, including:
Collecting unit, for gathering user behavior data and it being analyzed, obtain information collection data and user's row For analyze data;
Pretreatment unit, for being pre-processed to information collection data, obtain corpus;
Modeling unit, for carrying out LDA modelings to obtained corpus;
Distributed processing unit, for being acquired processing to information collection data by distributed Gibbs Sampling, Training set is obtained, and then obtains theme distribution probability matrix;
Weight calculation unit, for according to training set and theme distribution probability matrix, passing through Logistic regression algorithm meters Calculate the weighted score of every new information document;
Recommendation unit, for weighted score highest n piece information documents to be recommended user, wherein, n is default Value.
A kind of information recommendation apparatus based on big data of the present invention, including:
Memory, for depositing program;
Processor, for perform described program for:
Collection user behavior data is simultaneously analyzed it, obtains information collection data and user behavior analysis data;
Information collection data are pre-processed, obtain corpus;
LDA modelings are carried out to obtained corpus;
Processing is acquired to information collection data by distributed Gibbs Sampling, obtains training set, and then obtain Theme distribution probability matrix;
According to training set and theme distribution probability matrix, every new information document is calculated by Logistic regression algorithms Weighted score;
Weighted score highest n piece information documents are recommended user, wherein, n is preset value.
From the foregoing it can be that to recommend method, system and device to pass through parallel for a kind of information based on big data of the present invention Implicit Di Li Crays method carries out fast and effective training to extensive information document, and returned by Logistic ask for it is optimal Coefficient, so as to calculate the weighted score of new information document, the theme distribution of information document is taken into full account and has combined user personality Behavior, and then personalized recommendation is provided the user, and the similarity based method based on document that compares, its algorithm complex is significantly Reduce, effectively increase execution efficiency, reduce the memory headroom of occupancy, and reduce model error, greatly improve accuracy rate.
Above is the preferable implementation to the present invention is illustrated, but the invention is not limited to the implementation Example, those skilled in the art can also make a variety of equivalent variations on the premise of without prejudice to spirit of the invention or replace Change, these equivalent deformations or replacement are all contained in the application claim limited range.

Claims (10)

1. a kind of information based on big data recommends method, it is characterised in that comprises the following steps:
Collection user behavior data is simultaneously analyzed it, obtains information collection data and user behavior analysis data;
Information collection data are pre-processed, obtain corpus;
LDA modelings are carried out to obtained corpus;
Processing is acquired to information collection data by distributed Gibbs Sampling, obtains training set, and then obtain theme Distribution probability matrix;
According to training set and theme distribution probability matrix, the power of every new information document is calculated by Logistic regression algorithms Heavy point;
Weighted score highest n piece information documents are recommended user, wherein, n is preset value.
2. a kind of information based on big data according to claim 1 recommends method, it is characterised in that:Described collection is used Family behavioral data is simultaneously analyzed it, obtains information collection data and user behavior analysis data, the step for specifically include:
Collection daily record simultaneously carries out classification processing, obtains User action log;
According to User action log, user behavior data is collected;
Information document is classified and stored;
By clustering method, the user similar to interest classifies;
Such user that needs are recommended, it is 1 to browsed information document markup, to not browsed information document markup For 0, obtain browsing information collection and do not browse information collection, that is, obtain information collection data;
Obtain information and concentrate the ID of each information document, and obtain user's residence time of each information document, obtain user's row For analyze data.
3. a kind of information based on big data according to claim 1 recommends method, it is characterised in that:It is described to information Collection data are pre-processed, and obtain corpus, the step for specifically include:
Word segmentation processing is carried out to the information document in information collection data, and unregistered word is identified, is obtained in information document Word;
Obtained root is carried out stopping word processing according to default stop vocabulary, obtains corpus.
4. a kind of information based on big data according to claim 1 recommends method, it is characterised in that:It is described to obtaining Corpus carry out LDA modelings, the step for specifically include:
According to corpus, carry out LDA and model to obtain LDA models;
Calculating is optimized to the parameter in LDA models;
Parameter Estimation is carried out according to the LDA models of foundation.
5. a kind of information based on big data according to claim 4 recommends method, it is characterised in that:It is described according to language Expect storehouse, carry out LDA and model to obtain LDA models, the step for be embodied in:
Wherein, theme distribution θ obeys the Di Li Crays that hyper parameter is α, word distributionThe Di Li Crays that hyper parameter is β are obeyed to be distributed, Word w obeys the theme distribution that parameter is θ, and theme numbering z obeys parameter and isMultinomial distribution.
6. a kind of information based on big data according to claim 4 recommends method, it is characterised in that:It is described to LDA Parameter in model optimizes calculating, the step for specific formula for calculation be:
Wherein,The parameter alpha after optimization is represented,Represent the parameter beta after optimization, αkThe parameter alpha before optimization is represented, βtThe parameter beta before optimization is represented,For Digamma functions, represent to ask the derivative of variable x logarithm, nikRepresent i-th text Chapter, the document count that theme is k, nktRepresent the counting for the word t that theme numbering is k, and ni=∑ * nik, nk=∑tnkt
7. a kind of information based on big data according to claim 5 recommends method, it is characterised in that:Described basis is built Vertical LDA models carry out parameter Estimation, the step for specific formula for calculation be:
<mrow> <msub> <mi>&amp;theta;</mi> <mrow> <mi>m</mi> <mo>,</mo> <mi>k</mi> </mrow> </msub> <mo>=</mo> <mfrac> <mrow> <msubsup> <mi>n</mi> <mi>m</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> </msubsup> <mo>+</mo> <msub> <mi>&amp;alpha;</mi> <mi>t</mi> </msub> </mrow> <mrow> <msubsup> <mi>&amp;Sigma;</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>K</mi> </msubsup> <msubsup> <mi>n</mi> <mi>m</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> </msubsup> <mo>+</mo> <msub> <mi>&amp;alpha;</mi> <mi>t</mi> </msub> </mrow> </mfrac> <mo>;</mo> </mrow>
Wherein,Represent the distribution probability of word t under theme k, θM, kThe distribution probability that m pieces document subject matter is k is expressed as,The counting on word t under theme k is represented,Represent the counting on word t, α under m piece documentstRepresent on word t Parameter alpha, βtRepresent the parameter beta on word t.
8. a kind of information based on big data according to claim 1 recommends method, it is characterised in that:The weighted score Calculation formula be:
Score (i)=c1*Topic1+c2*Topic2+……+ck*TopicK;
Wherein, i represents i-th document, and k represents k-th of theme, and TopicK represents the distribution probability of k-th of theme, [c1, c2,....,ck] represent to pass through the optimal regression coefficient value of each theme of the counted correspondence of Logistic regression algorithms.
A kind of 9. information recommendation system based on big data, it is characterised in that including:
Collecting unit, for gathering user behavior data and it being analyzed, obtain information collection data and user behavior point Analyse data;
Pretreatment unit, for being pre-processed to information collection data, obtain corpus;
Modeling unit, for carrying out LDA modelings to obtained corpus;
Distributed processing unit, for being acquired processing to information collection data by distributed Gibbs Sampling, obtain Training set, and then obtain theme distribution probability matrix;
Weight calculation unit, for according to training set and theme distribution probability matrix, being calculated by Logistic regression algorithms every The weighted score of the new information document of a piece;
Recommendation unit, for weighted score highest n piece information documents to be recommended user, wherein, n is preset value.
A kind of 10. information recommendation apparatus based on big data, it is characterised in that including:
Memory, for depositing program;
Processor, for perform described program for:
Collection user behavior data is simultaneously analyzed it, obtains information collection data and user behavior analysis data;
Information collection data are pre-processed, obtain corpus;
LDA modelings are carried out to obtained corpus;
Processing is acquired to information collection data by distributed Gibbs Sampling, obtains training set, and then obtain theme Distribution probability matrix;
According to training set and theme distribution probability matrix, the power of every new information document is calculated by Logistic regression algorithms Heavy point;
Weighted score highest n piece information documents are recommended user, wherein, n is preset value.
CN201710967315.7A 2017-10-17 2017-10-17 A kind of information based on big data recommends method, system and device Pending CN107798083A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710967315.7A CN107798083A (en) 2017-10-17 2017-10-17 A kind of information based on big data recommends method, system and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710967315.7A CN107798083A (en) 2017-10-17 2017-10-17 A kind of information based on big data recommends method, system and device

Publications (1)

Publication Number Publication Date
CN107798083A true CN107798083A (en) 2018-03-13

Family

ID=61534122

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710967315.7A Pending CN107798083A (en) 2017-10-17 2017-10-17 A kind of information based on big data recommends method, system and device

Country Status (1)

Country Link
CN (1) CN107798083A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108509793A (en) * 2018-04-08 2018-09-07 北京明朝万达科技股份有限公司 A kind of user's anomaly detection method and device based on User action log data
CN111309873A (en) * 2018-11-23 2020-06-19 北京嘀嘀无限科技发展有限公司 Data processing method and device, electronic equipment and storage medium
CN111309874A (en) * 2018-11-23 2020-06-19 北京嘀嘀无限科技发展有限公司 Data processing method and device, electronic equipment and storage medium
WO2021035955A1 (en) * 2019-08-29 2021-03-04 苏州朗动网络科技有限公司 Text news processing method and device and storage medium
CN115203578A (en) * 2022-09-16 2022-10-18 深圳云威网络科技有限公司 User behavior analysis system based on big data platform

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831234A (en) * 2012-08-31 2012-12-19 北京邮电大学 Personalized news recommendation device and method based on news content and theme feature
US20130339367A1 (en) * 2012-06-14 2013-12-19 Santhosh Adayikkoth Method and system for preferential accessing of one or more critical entities
CN105824911A (en) * 2016-03-15 2016-08-03 山东大学 Video recommending method based on LDA user theme model
CN106815369A (en) * 2017-01-24 2017-06-09 中山大学 A kind of file classification method based on Xgboost sorting algorithms

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130339367A1 (en) * 2012-06-14 2013-12-19 Santhosh Adayikkoth Method and system for preferential accessing of one or more critical entities
CN102831234A (en) * 2012-08-31 2012-12-19 北京邮电大学 Personalized news recommendation device and method based on news content and theme feature
CN105824911A (en) * 2016-03-15 2016-08-03 山东大学 Video recommending method based on LDA user theme model
CN106815369A (en) * 2017-01-24 2017-06-09 中山大学 A kind of file classification method based on Xgboost sorting algorithms

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
梁剑: "基于LDA文本主题挖掘的个性化推送及其在Spark平台的实现", 《中国优秀硕士学位论文群文数据库 信息科技辑》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108509793A (en) * 2018-04-08 2018-09-07 北京明朝万达科技股份有限公司 A kind of user's anomaly detection method and device based on User action log data
CN111309873A (en) * 2018-11-23 2020-06-19 北京嘀嘀无限科技发展有限公司 Data processing method and device, electronic equipment and storage medium
CN111309874A (en) * 2018-11-23 2020-06-19 北京嘀嘀无限科技发展有限公司 Data processing method and device, electronic equipment and storage medium
WO2021035955A1 (en) * 2019-08-29 2021-03-04 苏州朗动网络科技有限公司 Text news processing method and device and storage medium
CN115203578A (en) * 2022-09-16 2022-10-18 深圳云威网络科技有限公司 User behavior analysis system based on big data platform

Similar Documents

Publication Publication Date Title
CN107908669A (en) A kind of big data news based on parallel LDA recommends method, system and device
CN107798083A (en) A kind of information based on big data recommends method, system and device
CN110059311B (en) Judicial text data-oriented keyword extraction method and system
CN112001185B (en) Emotion classification method combining Chinese syntax and graph convolution neural network
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
CN106844632B (en) Product comment emotion classification method and device based on improved support vector machine
CN105279495B (en) A kind of video presentation method summarized based on deep learning and text
CN108108449A (en) A kind of implementation method based on multi-source heterogeneous data question answering system and the system towards medical field
CN108108426B (en) Understanding method and device for natural language question and electronic equipment
CN110019770A (en) The method and apparatus of train classification models
CN110020189A (en) A kind of article recommended method based on Chinese Similarity measures
CN104778256B (en) A kind of the quick of field question answering system consulting can increment clustering method
CN107153658A (en) A kind of public sentiment hot word based on weighted keyword algorithm finds method
CN103631858B (en) A kind of science and technology item similarity calculating method
CN107943824A (en) A kind of big data news category method, system and device based on LDA
CN104484380A (en) Personalized search method and personalized search device
CN108446408A (en) A kind of short text method of abstracting based on PageRank
CN111090811B (en) Massive news hot topic extraction method and system
CN107357785A (en) Theme feature word abstracting method and system, feeling polarities determination methods and system
CN113988053A (en) Hot word extraction method and device
CN111858842A (en) Judicial case screening method based on LDA topic model
CN110457472A (en) The emotion association analysis method for electric business product review based on SOM clustering algorithm
CN104794209B (en) Chinese microblogging mood sorting technique based on Markov logical network and system
CN107451116B (en) Statistical analysis method for mobile application endogenous big data
CN107066585A (en) A kind of probability topic calculates the public sentiment monitoring method and system with matching

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180313

RJ01 Rejection of invention patent application after publication