CN107798083A - A kind of information based on big data recommends method, system and device - Google Patents
A kind of information based on big data recommends method, system and device Download PDFInfo
- Publication number
- CN107798083A CN107798083A CN201710967315.7A CN201710967315A CN107798083A CN 107798083 A CN107798083 A CN 107798083A CN 201710967315 A CN201710967315 A CN 201710967315A CN 107798083 A CN107798083 A CN 107798083A
- Authority
- CN
- China
- Prior art keywords
- information
- data
- theme
- document
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of information based on big data to recommend method, system and device, fast and effective training is carried out to extensive information document by implying Di Li Crays method parallel, and returned by Logistic and ask for optimal coefficient, so as to calculate the weighted score of new information document, take into full account the theme distribution of information document and combine user personality behavior, and then provide the user personalized recommendation, and the similarity based method based on document that compares, its algorithm complex greatly reduces, effectively increase execution efficiency, reduce the memory headroom taken, and reduce model error, greatly improve accuracy rate.It the composite can be widely applied in information recommendation.
Description
Technical field
The present invention relates to big data technical field, more particularly to a kind of information based on big data recommend method, system and
Device.
Background technology
With the popularization of internet and the development of technology, various information promulgating platform progressively appears in regarding for people
Among open country so that the mode that people obtain information is simpler, and method is more various, and furthered media and the distance of people.With
This produces the problem of substantial amounts of information also brings information explosion daily simultaneously.Although people can obtain greatly easily daily
The information of amount, but be easy to become vast and hazy.Because obtaining the information useful to oneself from complicated substantial amounts of information becomes very
Difficulty, cost are very high.Information is filtered according to the collaborative filtering of classics, for substantial amounts of data, accurately
Property is not high.And the cost that user behavior data is typically collected using conventional method is very high, it is difficult to accomplish in time.
The content of the invention
In order to solve the above-mentioned technical problem, it is an object of the invention to provide it is a kind of can timely gathered data, and accuracy compared with
The high information based on big data recommends method, system and device.
The technical solution used in the present invention is:
A kind of information based on big data recommends method, comprises the following steps:
Collection user behavior data is simultaneously analyzed it, obtains information collection data and user behavior analysis data;
Information collection data are pre-processed, obtain corpus;
LDA modelings are carried out to obtained corpus;
Processing is acquired to information collection data by distributed Gibbs Sampling, obtains training set, and then obtain
Theme distribution probability matrix;
According to training set and theme distribution probability matrix, every new information document is calculated by Logistic regression algorithms
Weighted score;
Weighted score highest n piece information documents are recommended user, wherein, n is preset value.
Recommend further improvements in methods, described collection user behavior as a kind of described information based on big data
Data are simultaneously analyzed it, obtain information collection data and user behavior analysis data, the step for specifically include:
Collection daily record simultaneously carries out classification processing, obtains User action log;
According to User action log, user behavior data is collected;
Information document is classified and stored;
By clustering method, the user similar to interest classifies;
Such user that needs are recommended, it is 1 to browsed information document markup, to not browsed information document
Labeled as 0, obtain browsing information collection and do not browse information collection, that is, obtain information collection data;
Obtain information and concentrate the ID of each information document, and obtain user's residence time of each information document, used
Family behavioural analysis data.
Recommend further improvements in methods as a kind of described information based on big data, it is described to information collection data
Pre-processed, obtain corpus, the step for specifically include:
Word segmentation processing is carried out to the information document in information collection data, and unregistered word is identified, obtains information text
Word in shelves;
Obtained root is carried out stopping word processing according to default stop vocabulary, obtains corpus.
Recommend further improvements in methods, the language material to obtaining as a kind of described information based on big data
Storehouse carry out LDA modelings, the step for specifically include:
According to corpus, carry out LDA and model to obtain LDA models;
Calculating is optimized to the parameter in LDA models;
Parameter Estimation is carried out according to the LDA models of foundation.
Recommend further improvements in methods as a kind of described information based on big data, it is described according to corpus,
Carry out LDA and model to obtain LDA models, the step for be embodied in:
Wherein, theme distribution θ obeys the Di Li Crays that hyper parameter is α, word distributionObey the Di Li Crays that hyper parameter is β
Distribution, word w obey the theme distribution that parameter is θ, and theme numbering z obeys parameter and isMultinomial distribution.
Recommend further improvements in methods as a kind of described information based on big data, in the model to LDA
Parameter optimize calculating, the step for specific formula for calculation be:
Wherein,The parameter alpha after optimization is represented,Represent the parameter beta after optimization, αkRepresent the ginseng before optimization
Number α, βtThe parameter beta before optimization is represented,For Digamma functions, represent to ask the derivative of variable x logarithm, nikRepresent i-th
Article, the document count that theme is k, nktRepresent the counting for the word t that theme numbering is k, and ni=∑knik, nk=∑tnkt。
Recommend further improvements in methods as a kind of described information based on big data, it is described according to foundation
LDA models carry out parameter Estimation, the step for specific formula for calculation be:
Wherein,Represent the distribution probability of word t under theme k, θM, kIt is general to be expressed as the distribution that m pieces document subject matter is k
Rate,The counting on word t under theme k is represented,Represent the counting on word t, α under m piece documentstRepresent on word
T parameter alpha, βtRepresent the parameter beta on word t.
Recommend further improvements in methods, the calculating of the weighted score as a kind of described information based on big data
Formula is:
Score (i)=c1*Topic1+c2*Topic2+……+ck*TopicK;
Wherein, i represents i-th document, and k represents k-th of theme, and TopicK represents the distribution probability of k-th of theme, [c1,
c2,....,ck] represent to pass through the optimal regression coefficient value of each theme of the counted correspondence of Logistic regression algorithms.
Another technical scheme of the present invention is:
A kind of information recommendation system based on big data, including:
Collecting unit, for gathering user behavior data and it being analyzed, obtain information collection data and user's row
For analyze data;
Pretreatment unit, for being pre-processed to information collection data, obtain corpus;
Modeling unit, for carrying out LDA modelings to obtained corpus;
Distributed processing unit, for being acquired processing to information collection data by distributed Gibbs Sampling,
Training set is obtained, and then obtains theme distribution probability matrix;
Weight calculation unit, for according to training set and theme distribution probability matrix, passing through Logistic regression algorithm meters
Calculate the weighted score of every new information document;
Recommendation unit, for weighted score highest n piece information documents to be recommended user, wherein, n is default
Value.
Another technical scheme of the present invention is:
A kind of information recommendation apparatus based on big data, including:
Memory, for depositing program;
Processor, for perform described program for:
Collection user behavior data is simultaneously analyzed it, obtains information collection data and user behavior analysis data;
Information collection data are pre-processed, obtain corpus;
LDA modelings are carried out to obtained corpus;
Processing is acquired to information collection data by distributed Gibbs Sampling, obtains training set, and then obtain
Theme distribution probability matrix;
According to training set and theme distribution probability matrix, every new information document is calculated by Logistic regression algorithms
Weighted score;
Weighted score highest n piece information documents are recommended user, wherein, n is preset value.
The beneficial effects of the invention are as follows:
A kind of information based on big data of the present invention recommends method, system and device by implying Di Li Cray methods parallel
Fast and effective training is carried out to extensive information document, and is returned by Logistic and asks for optimal coefficient, it is new so as to calculate
The weighted score of information document, the theme distribution of information document is taken into full account and has combined user personality behavior, and then be user
Personalized recommendation is provided, and the similarity based method based on document that compares, its algorithm complex greatly reduce, effectively increased
Execution efficiency, reduces the memory headroom of occupancy, and reduces model error, greatly improves accuracy rate.
Brief description of the drawings
Fig. 1 is the step flow chart that a kind of information based on big data of the present invention recommends method;
Fig. 2 is a kind of block diagram of the information recommendation system based on big data of the present invention.
Embodiment
The embodiment of the present invention is described further below in conjunction with the accompanying drawings:
With reference to figure 1, a kind of information based on big data of the present invention recommends method, comprises the following steps:
Collection user behavior data is simultaneously analyzed it, obtains information collection data and user behavior analysis data;
Information collection data are pre-processed, obtain corpus;
LDA modelings are carried out to obtained corpus;
Processing is acquired to information collection data by distributed Gibbs Sampling, obtains training set, and then obtain
Theme distribution probability matrix;
According to training set and theme distribution probability matrix, every new information document is calculated by Logistic regression algorithms
Weighted score;
Weighted score highest n piece information documents are recommended user, wherein, n is preset value.
Preferred embodiment is further used as, described collection user behavior data is simultaneously analyzed it, provided
News collection data and user behavior analysis data, the step for specifically include:
Collection daily record simultaneously carries out classification processing, obtains User action log;
According to User action log, user behavior data is collected;
Information document is classified and stored;
By clustering method, the user similar to interest classifies;
Such user that needs are recommended, it is 1 to browsed information document markup, to not browsed information document
Labeled as 0, obtain browsing information collection and do not browse information collection, that is, obtain information collection data;
Obtain information and concentrate the ID of each information document, and obtain user's residence time of each information document, used
Family behavioural analysis data.
In the present embodiment, daily record is gathered using flume, daily record is divided into two classes, a kind of behavioral data on user
User action log, including user browse vestige and the scope of activities of user, a kind of day regular data on system.By user's row
Started a flume component every 5 minutes for daily record, log information is transferred in kafka, and be stored in HDFS, system
Day regular data will aim at being stored in daily 2:00 AM importing HDFS day using t+1 by the way of.
In the present embodiment, the mode buried a little using api and front end collects user behavior data from User action log, including
How long user has seen information (being invalid information more than 5 minutes), if thumbs up, if comment, comment whether be it is positive or
Person is negative, enters data into Spark Stream, is constantly passed in HDFS.
In the present embodiment, the HDFS in Hadoop big data platforms is used to be stored for the information document classification of magnanimity,
And appearance can be established based on Hive, effectively can quickly found magnanimity information document., will using the inquiry of Hive tables
Real-time calculating is done in user behavior data analysis in a serial fashion, and result of calculation is saved in HDFS, and calculating next time can be with
Based on basis before, incremental computations are done.
In the present embodiment, by the clustering method of K-means algorithms, user behavior data is analyzed, interest is similar
User be classified as one kind.1 (representing user to the interested of the information document) is designated as to the information document that certain class user browses, it is right
The piece article up and down of information document interested, user, which does not click on the information document browsed and is designated as 0, (represents the user to information text
Shelves are lost interest in), and the ID of each information document is obtained, and user's residence time.
Preferred embodiment is further used as, it is described that information collection data are pre-processed, corpus is obtained, this
Step specifically includes:
Word segmentation processing is carried out to the information document in information collection data, and unregistered word is identified, obtains information text
Word in shelves;
Obtained root is carried out stopping word processing according to default stop vocabulary, obtains corpus.
In the embodiment of the present invention, the word segmentation processing is segmented using stammerer, and according to implicit Markov model to not stepping on
Record word is identified, Custom Dictionaries, sets certain weight to proprietary word and popular word, it is ensured that during participle, the word can be accurate
Really segmentation.Fall the vocabulary of no practical significance according to deactivation vocabulary automatic fitration, such as preposition, article, auxiliary words of mood, adverbial word, Jie
Word, conjunction and punctuate etc..
Be further used as preferred embodiment, it is described to carry out LDA modelings to obtained corpus, the step for it is specific
Including:
According to corpus, carry out LDA and model to obtain LDA models;
Calculating is optimized to the parameter in LDA models;
Parameter Estimation is carried out according to the LDA models of foundation.
Preferred embodiment is further used as, it is described to model to obtain LDA models according to corpus, progress LDA, this
Step is embodied in:
Wherein, theme distribution θ obeys the Di Li Crays that hyper parameter is α, word distributionObey the Di Li Crays that hyper parameter is β
Distribution, word w obey the theme distribution that parameter is θ, and theme numbering z obeys parameter and isMultinomial distribution.
Preferred embodiment is further used as, it is described that parameter Estimation is carried out according to the LDA models of foundation, the step for
Specific formula for calculation be:
Wherein,Represent the distribution probability of word t under theme k, θM, kIt is general to be expressed as the distribution that m pieces document subject matter is k
Rate,The counting on word t under theme k is represented,Represent the counting on word t, α under m piece documentstRepresent on word
T parameter alpha, βtRepresent the parameter beta on word t.
In the present embodiment, according to the dependence between variable, it is as follows that joint probability density formula can be obtained:
For few calculation error, respectively to θ andQuadrature, last simplified formula is
P (w, z | α, β)=p (w | z, β) p (z | α);
It can thus be concluded that going out p (w, z), by Collapsed Gibbs Sampling, followed within the iterations of setting
Lottery of lotteries takes the theme of current word, until the theme distribution of word reaches convergence.It is as follows to implement formula:
Next, using posterior probability estimation, theme distribution, and word distribution difference are obtained, and both obey Di Li Crays
Distribution.Word distribution probability, and theme distribution probable value can be drawn according to Di Li Crays distribution property, i.e.,:
Wherein,Represent the distribution probability of word t under theme k, θM, kIt is general to be expressed as the distribution that m pieces document subject matter is k
Rate,The counting on word t under theme k is represented,Represent the counting on word t under m piece documents.
It is further used as preferred embodiment, the parameter in the model to LDA optimizes calculating, the step for
Specific formula for calculation be:
Wherein,The parameter alpha after optimization is represented,Represent the parameter beta after optimization, αkRepresent the ginseng before optimization
Number α, βtThe parameter beta before optimization is represented,For Digamma functions, represent to ask the derivative of variable x logarithm, nikRepresent i-th
Article, the document count that theme is k, nktRepresent the counting for the word t that theme numbering is k, and ni=∑knik, nk=∑tnkt。
In the present embodiment, when calculating k takes different value, model puzzlement degree perplexity change, then by puzzlement degree most
The optimal theme number that small theme number is fitted as model to data.To given corpus data D, its puzzlement degree is:
Wherein wmRepresent the word of m piece documents, NmRepresent the length of m piece documents.It is as shown below, as theme number K=
When 40, puzzlement degree is minimum, therefore most has theme number to be set to 40.
In the embodiment of the present invention, the optimization for parameter alpha and parameter beta:
Wherein,The parameter alpha after optimization is represented,The parameter beta after optimization is represented,For Digamma functions,
The derivative of variable x logarithm, n are asked in expressionikRepresent i-th article, the document count that theme is k, nktRepresent that theme numbering is k
Word t counting, and ni=∑knik, nk=∑tnkt。
In the present embodiment, when carrying out parallelization Gibbs Sampling acquisition process, original data set is pressed to the number of Lothrus apterus
According to dividing method, it is divided into P*P part (P is the number of concurrent set), the data block split is resequenced, is finally synthesizing P
Individual data block, it is placed on each machine and performs.So each data set is sampled again.It is parallel in group, it is serial between group.Often
It is to horn cupping with strategy, because same a line or same row can not be selected simultaneously, therefore selects diagonal to be calculated.In group
After parallel execution an iteration, group's document, the statistic such as the counting of word is synchronized to next group, and in group in each piece
The Gibbs Sampling of sampling and standalone version are the same methods, are then remerged.
To reduce volume of transmitted data, the method for use is that the same data block of line number in the data split is placed on
On same computer node.Vocabulary V data are divided equally as far as possible, to reduce transmission volume.Again in each computer node
Upper Gibbs Sampling, finally merge.So far parallelization sampling terminates.
Preferred embodiment is further used as, the calculation formula of the weighted score is:
Score (i)=c1*Topic1+c2*Topic2+……+ck*TopicK;
Wherein, i represents i-th document, and k represents k-th of theme, and TopicK represents the distribution probability of k-th of theme, [c1,
c2,....,ck] represent to pass through the optimal regression coefficient value of each theme of the counted correspondence of Logistic regression algorithms.
In the present embodiment, during using Logistic regression algorithms, output is defined between zero and one, i.e.,:0≤hθ(x)≤
1.And linear regression can not be accomplished, a function g is introduced here, makes the Hypothesis of logistic regression be expressed as:hu(x)=g
(uTX) g is referred to as Sigmoid function or Logistic function here, and expression is:
G (z)=1/ (1+exp (- z)), hu(x)=g (uTX)=1/ (1+exp (- uTX)), wherein u is parameter.
Optimization to u parameters, that is, minimize the log-likelihood loss function cost function of logistic regression.
Minimum loss function is asked using gradient descent method, optimal value is obtained, by nth iteration parameter more news
It is as follows:
Until parameter u convergences, i.e., the regression coefficient value finally tried to achieve, it is the optimal solution for minimizing loss function.Wherein uj
Represent j-th of parameter, xiRepresent i-th of component, yiRepresent the estimate of i-th of variable.
In the present embodiment, in the theme distribution matrix of training set document, each theme treats as independent variable x, and user clicks on
Whether information works as dependent variable h (x), with reference to logistic regression algorithm, tries to achieve optimal regression coefficient value [c afterwards1,c2,....,
ck], then the theme distribution probable value of new information document is combined afterwards.Calculate score Score (i)=c of every information document1*
Topic1+c2*Topic2+……+ck* TopicK, i represent i-th document.Finally according to the height of new information document scores,
The high information document of n piece scores is as the recommendation to the user before taking.
With reference to figure 2, a kind of information recommendation system based on big data of the invention, including:
Collecting unit, for gathering user behavior data and it being analyzed, obtain information collection data and user's row
For analyze data;
Pretreatment unit, for being pre-processed to information collection data, obtain corpus;
Modeling unit, for carrying out LDA modelings to obtained corpus;
Distributed processing unit, for being acquired processing to information collection data by distributed Gibbs Sampling,
Training set is obtained, and then obtains theme distribution probability matrix;
Weight calculation unit, for according to training set and theme distribution probability matrix, passing through Logistic regression algorithm meters
Calculate the weighted score of every new information document;
Recommendation unit, for weighted score highest n piece information documents to be recommended user, wherein, n is default
Value.
A kind of information recommendation apparatus based on big data of the present invention, including:
Memory, for depositing program;
Processor, for perform described program for:
Collection user behavior data is simultaneously analyzed it, obtains information collection data and user behavior analysis data;
Information collection data are pre-processed, obtain corpus;
LDA modelings are carried out to obtained corpus;
Processing is acquired to information collection data by distributed Gibbs Sampling, obtains training set, and then obtain
Theme distribution probability matrix;
According to training set and theme distribution probability matrix, every new information document is calculated by Logistic regression algorithms
Weighted score;
Weighted score highest n piece information documents are recommended user, wherein, n is preset value.
From the foregoing it can be that to recommend method, system and device to pass through parallel for a kind of information based on big data of the present invention
Implicit Di Li Crays method carries out fast and effective training to extensive information document, and returned by Logistic ask for it is optimal
Coefficient, so as to calculate the weighted score of new information document, the theme distribution of information document is taken into full account and has combined user personality
Behavior, and then personalized recommendation is provided the user, and the similarity based method based on document that compares, its algorithm complex is significantly
Reduce, effectively increase execution efficiency, reduce the memory headroom of occupancy, and reduce model error, greatly improve accuracy rate.
Above is the preferable implementation to the present invention is illustrated, but the invention is not limited to the implementation
Example, those skilled in the art can also make a variety of equivalent variations on the premise of without prejudice to spirit of the invention or replace
Change, these equivalent deformations or replacement are all contained in the application claim limited range.
Claims (10)
1. a kind of information based on big data recommends method, it is characterised in that comprises the following steps:
Collection user behavior data is simultaneously analyzed it, obtains information collection data and user behavior analysis data;
Information collection data are pre-processed, obtain corpus;
LDA modelings are carried out to obtained corpus;
Processing is acquired to information collection data by distributed Gibbs Sampling, obtains training set, and then obtain theme
Distribution probability matrix;
According to training set and theme distribution probability matrix, the power of every new information document is calculated by Logistic regression algorithms
Heavy point;
Weighted score highest n piece information documents are recommended user, wherein, n is preset value.
2. a kind of information based on big data according to claim 1 recommends method, it is characterised in that:Described collection is used
Family behavioral data is simultaneously analyzed it, obtains information collection data and user behavior analysis data, the step for specifically include:
Collection daily record simultaneously carries out classification processing, obtains User action log;
According to User action log, user behavior data is collected;
Information document is classified and stored;
By clustering method, the user similar to interest classifies;
Such user that needs are recommended, it is 1 to browsed information document markup, to not browsed information document markup
For 0, obtain browsing information collection and do not browse information collection, that is, obtain information collection data;
Obtain information and concentrate the ID of each information document, and obtain user's residence time of each information document, obtain user's row
For analyze data.
3. a kind of information based on big data according to claim 1 recommends method, it is characterised in that:It is described to information
Collection data are pre-processed, and obtain corpus, the step for specifically include:
Word segmentation processing is carried out to the information document in information collection data, and unregistered word is identified, is obtained in information document
Word;
Obtained root is carried out stopping word processing according to default stop vocabulary, obtains corpus.
4. a kind of information based on big data according to claim 1 recommends method, it is characterised in that:It is described to obtaining
Corpus carry out LDA modelings, the step for specifically include:
According to corpus, carry out LDA and model to obtain LDA models;
Calculating is optimized to the parameter in LDA models;
Parameter Estimation is carried out according to the LDA models of foundation.
5. a kind of information based on big data according to claim 4 recommends method, it is characterised in that:It is described according to language
Expect storehouse, carry out LDA and model to obtain LDA models, the step for be embodied in:
Wherein, theme distribution θ obeys the Di Li Crays that hyper parameter is α, word distributionThe Di Li Crays that hyper parameter is β are obeyed to be distributed,
Word w obeys the theme distribution that parameter is θ, and theme numbering z obeys parameter and isMultinomial distribution.
6. a kind of information based on big data according to claim 4 recommends method, it is characterised in that:It is described to LDA
Parameter in model optimizes calculating, the step for specific formula for calculation be:
Wherein,The parameter alpha after optimization is represented,Represent the parameter beta after optimization, αkThe parameter alpha before optimization is represented,
βtThe parameter beta before optimization is represented,For Digamma functions, represent to ask the derivative of variable x logarithm, nikRepresent i-th text
Chapter, the document count that theme is k, nktRepresent the counting for the word t that theme numbering is k, and ni=∑ * nik, nk=∑tnkt。
7. a kind of information based on big data according to claim 5 recommends method, it is characterised in that:Described basis is built
Vertical LDA models carry out parameter Estimation, the step for specific formula for calculation be:
<mrow>
<msub>
<mi>&theta;</mi>
<mrow>
<mi>m</mi>
<mo>,</mo>
<mi>k</mi>
</mrow>
</msub>
<mo>=</mo>
<mfrac>
<mrow>
<msubsup>
<mi>n</mi>
<mi>m</mi>
<mrow>
<mo>(</mo>
<mi>k</mi>
<mo>)</mo>
</mrow>
</msubsup>
<mo>+</mo>
<msub>
<mi>&alpha;</mi>
<mi>t</mi>
</msub>
</mrow>
<mrow>
<msubsup>
<mi>&Sigma;</mi>
<mrow>
<mi>k</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>K</mi>
</msubsup>
<msubsup>
<mi>n</mi>
<mi>m</mi>
<mrow>
<mo>(</mo>
<mi>k</mi>
<mo>)</mo>
</mrow>
</msubsup>
<mo>+</mo>
<msub>
<mi>&alpha;</mi>
<mi>t</mi>
</msub>
</mrow>
</mfrac>
<mo>;</mo>
</mrow>
Wherein,Represent the distribution probability of word t under theme k, θM, kThe distribution probability that m pieces document subject matter is k is expressed as,The counting on word t under theme k is represented,Represent the counting on word t, α under m piece documentstRepresent on word t
Parameter alpha, βtRepresent the parameter beta on word t.
8. a kind of information based on big data according to claim 1 recommends method, it is characterised in that:The weighted score
Calculation formula be:
Score (i)=c1*Topic1+c2*Topic2+……+ck*TopicK;
Wherein, i represents i-th document, and k represents k-th of theme, and TopicK represents the distribution probability of k-th of theme, [c1,
c2,....,ck] represent to pass through the optimal regression coefficient value of each theme of the counted correspondence of Logistic regression algorithms.
A kind of 9. information recommendation system based on big data, it is characterised in that including:
Collecting unit, for gathering user behavior data and it being analyzed, obtain information collection data and user behavior point
Analyse data;
Pretreatment unit, for being pre-processed to information collection data, obtain corpus;
Modeling unit, for carrying out LDA modelings to obtained corpus;
Distributed processing unit, for being acquired processing to information collection data by distributed Gibbs Sampling, obtain
Training set, and then obtain theme distribution probability matrix;
Weight calculation unit, for according to training set and theme distribution probability matrix, being calculated by Logistic regression algorithms every
The weighted score of the new information document of a piece;
Recommendation unit, for weighted score highest n piece information documents to be recommended user, wherein, n is preset value.
A kind of 10. information recommendation apparatus based on big data, it is characterised in that including:
Memory, for depositing program;
Processor, for perform described program for:
Collection user behavior data is simultaneously analyzed it, obtains information collection data and user behavior analysis data;
Information collection data are pre-processed, obtain corpus;
LDA modelings are carried out to obtained corpus;
Processing is acquired to information collection data by distributed Gibbs Sampling, obtains training set, and then obtain theme
Distribution probability matrix;
According to training set and theme distribution probability matrix, the power of every new information document is calculated by Logistic regression algorithms
Heavy point;
Weighted score highest n piece information documents are recommended user, wherein, n is preset value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710967315.7A CN107798083A (en) | 2017-10-17 | 2017-10-17 | A kind of information based on big data recommends method, system and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710967315.7A CN107798083A (en) | 2017-10-17 | 2017-10-17 | A kind of information based on big data recommends method, system and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107798083A true CN107798083A (en) | 2018-03-13 |
Family
ID=61534122
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710967315.7A Pending CN107798083A (en) | 2017-10-17 | 2017-10-17 | A kind of information based on big data recommends method, system and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107798083A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108509793A (en) * | 2018-04-08 | 2018-09-07 | 北京明朝万达科技股份有限公司 | A kind of user's anomaly detection method and device based on User action log data |
CN111309873A (en) * | 2018-11-23 | 2020-06-19 | 北京嘀嘀无限科技发展有限公司 | Data processing method and device, electronic equipment and storage medium |
CN111309874A (en) * | 2018-11-23 | 2020-06-19 | 北京嘀嘀无限科技发展有限公司 | Data processing method and device, electronic equipment and storage medium |
WO2021035955A1 (en) * | 2019-08-29 | 2021-03-04 | 苏州朗动网络科技有限公司 | Text news processing method and device and storage medium |
CN115203578A (en) * | 2022-09-16 | 2022-10-18 | 深圳云威网络科技有限公司 | User behavior analysis system based on big data platform |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102831234A (en) * | 2012-08-31 | 2012-12-19 | 北京邮电大学 | Personalized news recommendation device and method based on news content and theme feature |
US20130339367A1 (en) * | 2012-06-14 | 2013-12-19 | Santhosh Adayikkoth | Method and system for preferential accessing of one or more critical entities |
CN105824911A (en) * | 2016-03-15 | 2016-08-03 | 山东大学 | Video recommending method based on LDA user theme model |
CN106815369A (en) * | 2017-01-24 | 2017-06-09 | 中山大学 | A kind of file classification method based on Xgboost sorting algorithms |
-
2017
- 2017-10-17 CN CN201710967315.7A patent/CN107798083A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130339367A1 (en) * | 2012-06-14 | 2013-12-19 | Santhosh Adayikkoth | Method and system for preferential accessing of one or more critical entities |
CN102831234A (en) * | 2012-08-31 | 2012-12-19 | 北京邮电大学 | Personalized news recommendation device and method based on news content and theme feature |
CN105824911A (en) * | 2016-03-15 | 2016-08-03 | 山东大学 | Video recommending method based on LDA user theme model |
CN106815369A (en) * | 2017-01-24 | 2017-06-09 | 中山大学 | A kind of file classification method based on Xgboost sorting algorithms |
Non-Patent Citations (1)
Title |
---|
梁剑: "基于LDA文本主题挖掘的个性化推送及其在Spark平台的实现", 《中国优秀硕士学位论文群文数据库 信息科技辑》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108509793A (en) * | 2018-04-08 | 2018-09-07 | 北京明朝万达科技股份有限公司 | A kind of user's anomaly detection method and device based on User action log data |
CN111309873A (en) * | 2018-11-23 | 2020-06-19 | 北京嘀嘀无限科技发展有限公司 | Data processing method and device, electronic equipment and storage medium |
CN111309874A (en) * | 2018-11-23 | 2020-06-19 | 北京嘀嘀无限科技发展有限公司 | Data processing method and device, electronic equipment and storage medium |
WO2021035955A1 (en) * | 2019-08-29 | 2021-03-04 | 苏州朗动网络科技有限公司 | Text news processing method and device and storage medium |
CN115203578A (en) * | 2022-09-16 | 2022-10-18 | 深圳云威网络科技有限公司 | User behavior analysis system based on big data platform |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107908669A (en) | A kind of big data news based on parallel LDA recommends method, system and device | |
CN107798083A (en) | A kind of information based on big data recommends method, system and device | |
CN110059311B (en) | Judicial text data-oriented keyword extraction method and system | |
CN112001185B (en) | Emotion classification method combining Chinese syntax and graph convolution neural network | |
CN107133213B (en) | Method and system for automatically extracting text abstract based on algorithm | |
CN106844632B (en) | Product comment emotion classification method and device based on improved support vector machine | |
CN105279495B (en) | A kind of video presentation method summarized based on deep learning and text | |
CN108108449A (en) | A kind of implementation method based on multi-source heterogeneous data question answering system and the system towards medical field | |
CN108108426B (en) | Understanding method and device for natural language question and electronic equipment | |
CN110019770A (en) | The method and apparatus of train classification models | |
CN110020189A (en) | A kind of article recommended method based on Chinese Similarity measures | |
CN104778256B (en) | A kind of the quick of field question answering system consulting can increment clustering method | |
CN107153658A (en) | A kind of public sentiment hot word based on weighted keyword algorithm finds method | |
CN103631858B (en) | A kind of science and technology item similarity calculating method | |
CN107943824A (en) | A kind of big data news category method, system and device based on LDA | |
CN104484380A (en) | Personalized search method and personalized search device | |
CN108446408A (en) | A kind of short text method of abstracting based on PageRank | |
CN111090811B (en) | Massive news hot topic extraction method and system | |
CN107357785A (en) | Theme feature word abstracting method and system, feeling polarities determination methods and system | |
CN113988053A (en) | Hot word extraction method and device | |
CN111858842A (en) | Judicial case screening method based on LDA topic model | |
CN110457472A (en) | The emotion association analysis method for electric business product review based on SOM clustering algorithm | |
CN104794209B (en) | Chinese microblogging mood sorting technique based on Markov logical network and system | |
CN107451116B (en) | Statistical analysis method for mobile application endogenous big data | |
CN107066585A (en) | A kind of probability topic calculates the public sentiment monitoring method and system with matching |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180313 |
|
RJ01 | Rejection of invention patent application after publication |