CN109726222A - A kind of data flow theme feature extracting method, device, equipment and storage medium - Google Patents

A kind of data flow theme feature extracting method, device, equipment and storage medium Download PDF

Info

Publication number
CN109726222A
CN109726222A CN201811641140.1A CN201811641140A CN109726222A CN 109726222 A CN109726222 A CN 109726222A CN 201811641140 A CN201811641140 A CN 201811641140A CN 109726222 A CN109726222 A CN 109726222A
Authority
CN
China
Prior art keywords
word
theme
probability
data flow
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811641140.1A
Other languages
Chinese (zh)
Other versions
CN109726222B (en
Inventor
杨璐
王猛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN201811641140.1A priority Critical patent/CN109726222B/en
Publication of CN109726222A publication Critical patent/CN109726222A/en
Application granted granted Critical
Publication of CN109726222B publication Critical patent/CN109726222B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Data flow theme feature extracting method provided by the present invention by using vocabulary the unfixed LDA model of number of words, by making the distribution of its subject word obey the unfixed Di Li Cray process of atom number, rather than the Di Li Cray distribution that atom number is fixed, new model is added it to when encountering the new word not occurred in vocabulary in vocabulary and continues the execution of algorithm, by constantly encountering and adding new word, it realizes and does not increase memory processing pressure while information makes full use of, it is bonded the vocabulary in LDA model more with corpus to be treated, improve the precision of model, enhance the ability of online LDA algorithm processing data flow.The invention also discloses a kind of data flow theme feature extraction element, equipment and a kind of readable storage medium storing program for executing, have above-mentioned beneficial effect.

Description

A kind of data flow theme feature extracting method, device, equipment and storage medium
Technical field
The present invention relates to text data processing technology field, in particular to a kind of data flow theme feature extracting method, dress It sets, equipment and a kind of readable storage medium storing program for executing.
Background technique
Topic model is that one kind can pass through analytic language from the technology for finding out information required for user in mass data Each document in material, the word in statistical documents, and infer in current document which contains according to the information that statistics obtains Ratio shared by theme and each theme is how many.
LDA (Latent Dirichlet Allocation) is the topic model of current main-stream, is wrapped in text mining field There is application in terms of including text subject identification, text classification and Text similarity computing.For various application scenarios, produce The various mutation algorithms based on LDA topic model algorithm.Wherein, it is known as handling a kind of LDA topic model of data flow Line LDA (Online LDA) algorithm, such as: online gibbs sampler algorithm (Online Gibbs Sampling, OGS), online change Divide and infers algorithm (Online Variational Inference, OVB), online belief propagation algorithm (Online Belief Propagation, OBP) etc..
The execution of online LDA algorithm is based on vocabulary, before algorithm execution, needs to scan whole corpus, by corpus After all words of middle appearance are organized into vocabulary, algorithm can just start to execute, and online LDA algorithm is in the process of execution In can not increase new word into vocabulary.Therefore online LDA algorithm, which can only be handled in data flow, is present in vocabulary Word can not handle the word being not present in vocabulary in data flow, will cause information loss in this way, and if using word Measure very big vocabulary cover in data flow it is possible that all words, it will cause memory, over-burden.
Therefore, how to realize mitigation memory processing pressure while information makes full use of, be those skilled in the art's needs The technical issues of solution.
Summary of the invention
The object of the present invention is to provide a kind of data flow theme feature extracting method, this method, which utilizes, makes its subject word point Cloth obeys the unfixed Di Li Cray process of atom number, the processing to new word may be implemented, to realize the abundant of corpus It utilizes, while only adding new word into vocabulary, processing pressure change is smaller, enhances online LDA algorithm processing data The ability of stream;It is a further object of the present invention to provide a kind of data flow theme feature extraction element, equipment and a kind of readable storages Medium has above-mentioned beneficial effect.
In order to solve the above technical problems, the present invention provides a kind of data flow theme feature extracting method, comprising:
Based on online LDA algorithm characterized by comprising
By the data flow received according to arrival time sequential organization at several batch corpus, and determine currently pending batch Secondary corpus;
The word for including in the batch corpus to be processed is scanned and recognized, word to be processed is obtained;
The word to be processed is compared with the word in vocabulary, judge in the word to be processed whether include The new word being not present in the vocabulary;
If so, the new word is added in the vocabulary, updated vocabulary is obtained;
It is that the word to be processed distributes each theme probability according to folding bar construction, obtains initial subject probability;
Run new LDA model and data processing carried out to the initial subject probability according to new term table, obtain it is each it is described to Handle the theme probability of word;Wherein, the new LDA model is to obey the LDA of Di Li Cray process based on belief propagation frame Model.
Preferably, described to distribute each theme probability according to folding bar construction for the word to be processed, it is general to obtain initial subject Rate, comprising:
It is that the word to be processed distributes each theme probability according to formula 1, obtains initial subject probability;
Wherein, the formula 1 specifically:
Wherein, LOC (w, k) is the function for positioning position of the word w in the single distribution of theme k, and WORD (j, k) is main Inscribe the word that coordinate is j in the word distribution of k.
Preferably, the new LDA model of operation carries out data processing to the initial subject probability according to new term table, obtains To the theme probability of each word to be processed, comprising:
The initial subject probability is brought into formula 2 according to new term table and carries out data processing, is obtained each described to be processed The theme probability of word;
Wherein, the formula 2 specifically:
μw,d(k) it is word w belongs to theme k in text d probability;Be theme k word distribution in, in addition to The probability of word w in other outer all texts of document d;It is in text d other than word w, other words belong to theme The counting of k;It is in the word distribution of theme k, other than the word w in text d, other all words belong to master Inscribe the probability of k.
The present invention discloses a kind of data flow theme feature extraction element, is based on online LDA algorithm, comprising:
Corpus determination unit, for by the data flow received according to arrival time sequential organization at several batch corpus, And determine currently pending batch corpus;
Word identification unit, for being scanned and recognized to the word for including in the batch corpus to be processed, obtain to Handle word;
Comparing unit judges described to be processed for the word to be processed to be compared with the word in vocabulary It whether include the new word being not present in the vocabulary in word;
Vocabulary updating unit, for when the new word in the word to be processed including being not present in the vocabulary When, the new word is added in the vocabulary, updated vocabulary is obtained;
Bar construction unit is rolled over, for being that the word to be processed distributes each theme probability according to folding bar construction, is obtained initial Theme probability;
LDA processing unit carries out at data the initial subject probability according to new term table for running LDA model Reason, obtains the theme probability of each word to be processed;Wherein, the LDA model is to obey Di Li based on belief propagation frame The LDA model of Cray process.
The present invention discloses a kind of data flow theme feature extract equipment, comprising:
Memory, for storing computer program;
Processor, the step of data flow theme feature extracting method is realized when for executing the computer program.
The present invention discloses a kind of readable storage medium storing program for executing, and program is stored on the readable storage medium storing program for executing, and described program is located The step of reason device realizes the data flow theme feature extracting method when executing.
Data flow theme feature extracting method provided by the present invention is unfixed by using the number of words of vocabulary LDA model, by making the distribution of its subject word obey the unfixed Di Li Cray process of atom number, rather than atom number is fixed Di Li Cray distribution, allow new model to add it to vocabulary when encountering the new word not occurred in vocabulary In and continue the execution of algorithm and realize and do not increase while information makes full use of by constantly encountering and adding new word Add memory processing pressure, be bonded the vocabulary in LDA model more with corpus to be treated, improve the precision of model, Enhance the ability of online LDA algorithm processing data flow.
The invention also discloses a kind of data flow theme feature extraction element, equipment and a kind of readable storage medium storing program for executing, have Above-mentioned beneficial effect, details are not described herein.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The embodiment of invention for those of ordinary skill in the art without creative efforts, can also basis The attached drawing of offer obtains other attached drawings.
Fig. 1 is a kind of traditional Di Li Cray distributed model operation schematic diagram;
Fig. 2 is a kind of flow chart of data flow theme feature extracting method provided in an embodiment of the present invention;
Fig. 3 is a kind of overall procedure schematic diagram provided in an embodiment of the present invention;
Fig. 4 is a kind of comparative result figure provided in an embodiment of the present invention;
Fig. 5 is a kind of structural block diagram of data flow theme feature extraction element provided in an embodiment of the present invention;
Fig. 6 is a kind of structural block diagram of data flow theme feature extract equipment provided in an embodiment of the present invention;
Fig. 7 is a kind of structural schematic diagram of data flow theme feature extract equipment provided in an embodiment of the present invention.
Specific embodiment
Core of the invention is to provide a kind of data flow theme feature extracting method, and this method can be enhanced online LDA and calculate The ability of method processing data flow;Another core of the invention is to provide a kind of data flow theme feature extraction element, equipment and one Kind readable storage medium storing program for executing, has above-mentioned beneficial effect.
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
The working principle of online LDA algorithm processing data flow is that data flow is temporally formed to dot sequency corpus, i.e., will For a complete material segmentation at the small corpus of several pieces, the same time only handles a small corpus.By by the production of topic model Object --- subject word matrix just updates as global variable, the small corpus of every processing a batch, online LDA algorithm and saves theme list Word matrix is for handling the small corpus of next group.By running LDA algorithm on many batches of small corpus, it is finally capable of handling data flow The complete corpus of middle appearance.
Under the frame of LDA model, every document all includes multiple themes, and ratio shared by each theme is respectively not It is identical.In a document, each word is generated by one of theme.LDA model is all in every document by finding Theme, and relevant vocabulary is found out to describe each theme, to extract the theme feature of document.In LDA model, every document Be considered as a document word matrix, by by document word matrix decomposition at document subject matter matrix and subject word matrix, most The word distribution of the theme distribution and each theme of every document is obtained eventually.
In LDA model, theme distribution θ that every document d is K with a lengthdIt indicates, K is the number of topics of setting.Often A theme k can be expressed as word distribution phik.Wherein, text subject is distributed θdWith subject word distribution phikObey Di Li Cray Distribution.That is: θd~Dirichlet (α), φk~Dirichlet (β).
Traditional Di Li Cray is distributed (Latent Dirichlet Allocation, LDA) model, the work of the model Process is illustrated in fig. 1 shown below: obtaining the distribution θ of a document d from the Dirichlet prior based on α firstd, β is based on from one Dirichlet prior in obtain the distribution phi of each theme kk, from θdOne theme Z of middle acquisitioni, then from subject word be distributed φZiOne word x of middle acquisitioni.Such process is repeated until obtaining all documents.Wherein, aforesaid way practicability is influenced Key problem be: online LDA algorithm needs pre-set fixation vocabulary before runtime.
By analysis, what the subject word distribution of above-mentioned traditional LDA model was obeyed is the distribution of Di Li Cray, Di Li Cray Distribution it needs to be determined that atomicity, this make LDA algorithm must start execute before just will vocabulary number of words determination under Come, causes LDA algorithm that must use pre-set fixation vocabulary, can not handle in data flow and be not present in vocabulary Word, will cause information loss in this way, and if covering in data flow to go out using the very big vocabulary of word amount Existing all words, it will cause memory, over-burden.
The present invention proposes a kind of unfixed LDA model of the number of words using vocabulary, by making its subject word point Cloth obeys the unfixed Di Li Cray process of atom number, rather than the Di Li Cray that atom number is fixed is distributed, so that new model The execution of algorithm can be added it in vocabulary and continued when encountering the new word not occurred in vocabulary.By not New word is encountered and added disconnectedly, is bonded the vocabulary in LDA model more with corpus to be treated, is improved model Precision, enhance the ability of online LDA algorithm processing data flow.
Referring to FIG. 2, Fig. 2 is a kind of flow chart of data flow theme feature extracting method provided in an embodiment of the present invention; This method is based on online LDA algorithm, mainly comprises the steps that
Step s210, by the data flow received according to arrival time sequential organization at several batch corpus, and determination is worked as Preceding batch corpus to be processed.
Input is data flow, and data flow is sequentially organized into corpus by arrival time.When the corpus reached in data flow When quantity meets the amount of a small batch data, currently pending batch corpus is determined it as.Wherein, to the corpus number of setting Without limitation, if the data volume of setting is too small, data flow will be divided into the corpus of a large amount of batches, and each corpus processing will be read It takes vocabulary to be compared, many times vocabulary will be read, influence extraction efficiency;And if this batch of data is too big, it is carrying out It just needs to wait for a long time before the processing of one batch corpus, causes to handle data flow not in time, therefore, how much will specifically count It needs to take the circumstances into consideration to set according to the corpus for being organized into a batch, for example may be set in several hundred between one two thousand.In addition, data volume Size can also be set according to the type of data flow, such as processing microblogging data flow when, can be set as 140 Word is introduced by taking above two restrictive condition as an example herein, and the setting of specific value it is not limited here, can be customized.
Step s220, the word for including in batch corpus to be processed is scanned and recognized, obtains word to be processed.
The scanning process of word is referred to word scan method in the prior art, and details are not described herein.
Step s230, word to be processed is compared with the word in vocabulary, judges whether wrap in word to be processed Include the new word being not present in vocabulary.
Step s240, new word is added in vocabulary, obtains updated vocabulary.
It is previously stored with the higher word of preset probability of occurrence and corresponding code in vocabulary, will currently criticize All singles of secondary middle appearance are compared with the word in vocabulary, judge in the word to be processed of present lot whether include New word is only added in vocabulary in the present embodiment, avoids if there is new word by the new word being not present in word list New word is ignored on what document subject matter feature influenced, while also avoiding adding a large amount of words and entering bring number in vocabulary The problem of being greatly reduced according to treatment effeciency.
By making the distribution of its subject word obey the unfixed Di Li Cray process of atom number, rather than atom number is fixed Di Li Cray distribution, allow new model to add it to vocabulary when encountering the new word not occurred in vocabulary In and continue the execution of algorithm.By constantly encountering and add new word, makes vocabulary in LDA model and need to handle Corpus be more bonded, improve the precision of model, enhance the ability of online LDA algorithm processing data flow.
Di Li Cray process (Dirichlet Process, DP) is a random process, and path is that probability is surveyed Degree, its edge distribution on finite dimension are Dirichlet distributions, are used as Dirichlet process mixed model (Dirichlet Process Mixture Model, DPMM, also referred to as unlimited mixed model, infinite mixture Model priori) is widely used.Its unlimited features is utilized in the present invention, the Di Li Cray distribution fixed by atom number is adjusted Whole is the Di Li Cray process that can add new atom, can handle the addition processing for new word.
It mainly include initially general by folding bar construction progress word based on the process that Di Li Cray process carries out data processing The determination of rate, and the data processing based on new LDA model to new term table and probability values, i.e. following step s250 with And step s260.
Step s250, it is that word to be processed distributes each theme probability according to folding bar construction, obtains initial subject probability.
Word to be processed is configured to according to folding stick (Stick-breaking) in the present embodiment and distributes initial subject probability, packet It includes and distributes initial subject probability for word existing in new word and vocabulary, be not limited to word in vocabulary, by rolling over stick The addition that new word may be implemented in construction calculates.
The specific formula used when determining theme probability to specific folding bar construction in the present embodiment without limitation, can basis The theoretical distribution that bar construction is rolled over during Di Li Cray voluntarily determines.
The Di Li Cray distribution for rolling over bar construction is two parameter distributions, is expressed as DP (α, G0), wherein α is beta distribution Study first, G0It is the base distribution of Di Li Cray process.If π (aleatory variable) obeys DP (α, G0), then π can be by such as Under folding bar mode construct:
Vi~Beta (1, α), θi~G0
In φkIn, the corresponding φ of each word of theme kw(k) the weight coefficient V that only corresponding folding stick generatesw(k), And p (Vw(k)|αβ)=BETA (1, αβ)
Then, specifically, in theme k, the corresponding probability φ of word wwIt (k) can be with are as follows:
Wherein, LOC (w, k) is the function for positioning position of the word w in the single distribution of theme k, and WORD (j, k) is main Inscribe the word that coordinate is j in the word distribution of k.It will be in the word distribution of each theme by setting LOC function and WORD function The position of word is fixed.
Determine that the initial subject probability distribution of word can simply and easily calculate each theme probability by above-mentioned formula 1 Value only determines to include new in the present embodiment to guarantee initial value closer to actual value to carry out folding bar construction using above-mentioned formula 1 It is introduced for the unified initial value confirmation for having word in word and vocabulary, other initial masters based on folding bar construction Topic determine the probability mode can refer to the introduction of the present embodiment.
Step s260, run new LDA model and data processing carried out to initial subject probability according to new term table, obtain respectively to Handle the theme probability of word.
Wherein, new LDA model is to obey the LDA model of Di Li Cray process based on belief propagation frame, and belief propagation is calculated Method is a kind of Message Passing Algorithm inferred on graph model, available new based on propagation frame is executed in the present invention The likelihood function of LDA model restrains word theme probability.The specific algorithm realization of belief propagation frame can refer to correlation It introduces, in the present embodiment without limitation to the execution propagation algorithm specifically used.It can be simultaneously to new word based on new LDA model And the body feature of old word is extracted.
It in the present embodiment without limitation to the specific formula of new LDA model, can be according to data handling utility scene, processing Precision etc. needs sets itself.
The present embodiment is introduced by taking a kind of high-accuracy high-efficiency rate LDA model algorithm as an example, and other model algorithms can join According to the introduction of the present embodiment.
Currently, the calculation formula of the posterior probability of LDA model are as follows:
Wherein, α, β are the parameters of Di Li Cray distribution.1≤d≤D is the index of text in corpus, and 1≤w≤W is single Word index in vocabulary, 1≤k≤K are subject index.Z is the theme of word.θ is document subject matter distribution, and φ is subject word point Cloth.θdIt is the theme distribution of text d
The posterior probability formula of LDA model can not be calculated directly, so being calculated using several approximate resonings and parameter Estimation Method obtains LDA model.Existing LDA reasoning algorithm includes: gibbs sampler algorithm (Gibbs Sampling, GS), variation Infer algorithm (Variational Inference, VB) and belief propagation algorithm (Belief Propagation, BP).No Same derivation algorithm is larger in time, space and precision aspect difference.Consider from time angle, in the document subject matter number of setting In the lesser situation of K, VB algorithm is compared, the time that GS and BP algorithm consume is less, in the biggish situation of K, the consumption of GS algorithm Time is less.Consider that memory required for GS algorithm is the 1/K of VB and BP in the consumption angle of memory headroom.Come from precision aspect It sees, BP algorithm has absolute advantage.Using precision is higher, the better BP of comprehensive effectiveness in processing data flow in the present embodiment Reasoning algorithm.
Belief propagation algorithm removes the θ in the likelihood function (3) of complete LDA model by integral calculationdIt is obtained with after φ (k) To the likelihood function for collapsing LDA model:
Wherein, xw,dIt is the word frequency of word w in document d,It is the number that word w all in document d belongs to theme k.
Since the final purpose of LDA model is to obtain the probability of the theme label z of each word in every document.In confidence In propagation algorithm, in order to maximize the probability of z, belief propagation algorithm can calculate μw,d(k) posterior probability:
Be in text d other than word w, belong to the word count of theme k;It is in addition to text Outside d, word w belongs to the probability of theme k in other all texts;It is the Qi Tasuo other than the word w in text d There is word to belong to the probability of theme k;μw,d(k) it is single w belongs to theme k in text d probability.
Wherein, the theme distribution θ of document ddWith the main body word distribution phi of theme kwAre as follows:
Belief propagation algorithm can constantly iterate to calculate theme distribution θ on corpusd, until θd(k) and φw(k) it converges to steady Definite value.After belief propagation algorithm obtains subject word matrix φ, can be obtained each document theme distribution and each master Word distribution in topic, realizes the extraction of theme feature.
Wherein, in the likelihood function for collapsing LDA model, the likelihood function of LDA can be analyzed to p (z | α) and p (w | z, αβ) Product.Obeyed distribution is distributed by changing subject word, and p (w | z, αβ) become
Γ () is gamma function,It is the probability of word w in the word distribution of theme k.It is main Inscribe the probability for the word that coordinate is j in the word distribution of k.
In order to maximize the likelihood function of z, theme probability matrix is calculated by formula 2.
Specifically, formula 2 are as follows:
Wherein, μw,d(k) it is word w belongs to theme k in text d probability;It is the word distribution in theme k In, other than document d in other all texts word w probability;It is other words in text d other than word w Belong to the counting of theme k;It is in the word distribution of theme k, other than the word w in text d, other are all Word belongs to the probability of theme k.Preferably, the construction of word initial subject probability can be carried out according to formula 2.
Wherein, the theme distribution θ of document ddFormula can be found in existing calculating theme distribution, such as can be with are as follows:
The word distribution phi of theme kkThen calculated by following formula:
Wherein,It is in the word distribution of theme k, coordinate is the probability of the word of j.
Formula 2 is constantly iterated to calculate on corpus, until θd(k) and φw(k) stationary value, available theme list are converged to LDA topic model algorithm of the word matrix based on Di Li Cray process.
Based on the above-mentioned technical proposal, data flow theme feature extracting method provided by the invention, by the master of LDA topic model The Di Li Cray distribution that topic word distribution is obeyed is changed to Di Li Cray process, and new based on obtaining under belief propagation algorithm frame LDA topic model, the model theme-word are distributed as the LDA model of the Di Li Cray process of folding bar construction.By using vocabulary The unfixed LDA model of the number of words of table, by making the distribution of its subject word obey the unfixed Di Li Cray of atom number Process, rather than the Di Li Cray distribution that atom number is fixed, so that new model is encountering the new word not occurred in vocabulary When can add it in vocabulary and continue the execution of algorithm, by constantly encountering and adding new word, realize letter Breath does not increase memory processing pressure while make full use of, and makes vocabulary in LDA model and corpus to be treated more Fitting, improves the precision of model, enhances the ability of online LDA algorithm processing data flow.
Embodiment two:
Specifically, to deepen the understanding to overall plan in embodiment, overall data process circulation process is carried out at this It is whole to introduce.Fig. 3 show a kind of overall procedure schematic diagram.
Input is data flow, and data flow is sequentially organized into corpus by arrival time.
When the corpus quantity reached in data flow meets the amount of a small batch data, by batch data DsIn not in word The new term that remittance table occurs is added in vocabulary, based on new vocabulary execute LDA algorithm (including carry out folding bar construction and The optimization of theme probability is carried out by new LDA model), obtain subject word matrix φk
By subject word matrix φkAs global variable, the small corpus of every processing a batch, algorithm will update and save theme list Word matrix φkFor handling the small corpus of next group.The product of final output is global variable subject word matrix φk, pass through theme Word matrix φkThe theme distribution of every document in data flow is analyzed, that is, extracts the theme feature of data flow.
The extraction for carrying out data flow theme feature through the above way can greatly promote Data Stream Processing efficiency and effect Fruit.
Below by the online LDA algorithm of flow chart of data processing comparison tradition in the present embodiment, the online LDA algorithm of tradition is specifically wrapped Include: online gibbs sampler algorithm (Online Gibbs Sampling, OGS), online variation infer algorithm (Online Variational Inference, OVB), online belief propagation algorithm (Online Belief Propagation, OBP), into The comparison of row data process effects.
Fig. 4 is a kind of comparative result figure, and it is the test most common index of LDA model that wherein ordinate, which is evaluation criterion: mixed Degree of confusing Perp, the performance of energy accurate evaluation LDA algorithm modeling, abscissa is number of documents.It can be clearly seen from Fig. 4, this hair LDA model performance of the bright middle subject word matrix based on Di Li Cray process is better than OBP, OGS, OVB.
Embodiment three:
It is only introduced by taking single batch document process process as an example in above-described embodiment, to deepen to number provided by the invention According to the understanding of stream theme feature extracting method, the present embodiment is introduced by taking continuous two batches document process process as an example.
1, the document of currently processed batch is determined.
2, this small batch document is scanned, the word in the batch document: instant noodles, bread, aircraft, tank is obtained.
Through comparing, present lot is without new word.
3, a theme is distributed to each word.
This theme, which can be, to be randomly assigned, for example, instant noodles this words is distributed to the general of weapon this theme Rate is 75%, and the probability for distributing to food is 25%.
Instant noodles: weapon;
Bread: finance;
Aircraft: food;
Tank: weapon.
4, it is that each word distributes each theme probability according to folding bar construction, obtains initial subject probability.
Instant noodles: weapon 10%, food 60%, finance 40%;
Bread: weapon 5%, food 80%, finance 70%;
Aircraft: weapon 60%, food 30%, finance 50%;
Tank: weapon 75%, food 23%, finance 21%.
5, new LDA model is run, based on this current vocabulary and initial subject probability, to each list in vocabulary The whole probability of tone obtains each theme matrix.
The document process of this small batch just finishes, and at this moment, has obtained the product of this small batch:
Food theme: 70% instant noodles, 20% bread, 5% aircraft, 5% tank;
Weapon theme: 60% aircraft, 20% tank, 10% bread, 10% instant noodles;
Financial theme: 35% bread, 20% aircraft, 15% instant noodles, 2% tank.
6, the processing for starting next small batch, is scanned and vocabulary compares, and has read new word: pudding.
Using folding bar construction, a new probability is distributed under each theme to pudding:
Food: 65% instant noodles, 15% bread, 7.5% aircraft, 7.5% tank, 5% pudding
Weapon: 62% aircraft, 18% tank, 7.5% bread, 7.5% instant noodles, 5% pudding
7, new LDA model is run, based on this current new vocabulary, adjusted again generally to each word in vocabulary Rate obtains the product of second small batch:
Food theme: 70% instant noodles, 68% pudding, 50% bread, 5% aircraft, 5% tank;
Weapon theme: 60% aircraft, 20% tank, 10% bread, 10% instant noodles, 2% pudding;
Financial theme: 35% bread, 20% aircraft, 16% pudding, 15% instant noodles, 2% tank.
And so on, continue the data processing of next small batch, until data processings all in data flow complete, Obtain the whole product of data flow.
New word can be added in real time based on data flow theme feature extracting method provided by the invention to be handled, and met It adds it in vocabulary when to the word not occurred in vocabulary, by constantly encountering and adding new word, makes Vocabulary in LDA model is more bonded with corpus to be treated.
Example IV:
The present embodiment is introduced by taking practical E-business applications scene as an example.
In e-commerce, own services are improved by extracting the theme in customer's unfavorable ratings, improve Customer Satisfaction Degree.Since demand quantity is more, there is new evaluation to occur all the time, evaluation is suitble to processed in the form of data flow. It can be excavated using topic model extraction theme therein and extract valuable information, more efficiently utilize customer evaluation.
The data flow formed to customer's unfavorable ratings is handled, and is separately employed in line LDA algorithm and the present invention and is proposed LDA algorithm based on Di Li Cray process extracts 10 themes in data flow, and most to each subject distillation frequency of occurrences High 5 vocabulary are analyzed.
Theme 1 Theme 2 Theme 3 Theme 4 Theme 5 Theme 6 Theme 7 Theme 8 Theme 9 Theme 10
1 It buys It is bad It is expired Difference It tastes bad Difference is commented It tastes bad It is bad It is fresh Taste
2 It is bad It is expired Difference Difference is commented Difference It tastes bad Taste Taste Date It receives
3 Difference Difference is commented It is bad It is expired Taste It receives It receives It tastes bad Rubbish Goods
4 It is expired Quality Date Too Difference is commented Goods Goods Fake products It is expired Rubbish
5 Difference is commented A bit Too It tastes bad It is nice It buys It buys It buys It is nice It is fresh
Table 1
Table 1 is to handle the theme feature that data flow extracts by the online LDA algorithm in conventional method.By to online The subject analysis that LDA algorithm is extracted from data flow, it can be found that customer appeal is bad with food taste, food is expired to be It is main.
Theme 1 Theme 2 Theme 3 Theme 4 Theme 5 Theme 6 Theme 7 Theme 8 Theme 9 Theme 10
1 Difference It tastes bad Date It is expired Difference is commented Rubbish It is bad Fake products It is bad Invoice
2 Too It buys It is fresh Thing It buys It is broken It buys Egg Express delivery It confiscates
3 Quality Taste It is too small It must not Effect It eats Shelf-life It buys It is too slow It buys
4 It buys It is thin It is ugly Customer service It is false Customer service One bottle Cargo Slowly Increasingly
5 It is nice Cheaply It is bad It is taken in Agglomeration Date of manufacture It is satisfied It brings Logistics Member
Table 2
Table 2 is that the theme that the data flow theme feature extracting method provided according to the present invention is drawn into from data flow is special Sign.By being compared with online LDA algorithm, there are many new vocabulary in the theme that the present invention extracts, calculates with online LDA Except the customer that method obtains compares the demand of food taste, date of manufacture, algorithm of the invention has found more new information, If customer service not can solve Customer Problems, the excessively slow problem of logistics etc., greatly promoting for data-handling efficiency is realized.
Embodiment five:
Data flow theme feature extraction element provided by the invention is introduced below, referring to FIG. 5, Fig. 5 is this hair A kind of structural block diagram for data flow theme feature extraction element that bright embodiment provides;The apparatus may include:
Corpus determination unit 510 is mainly used for the data flow that will be received according to arrival time sequential organization into several batches Corpus, and determine currently pending batch corpus;
Word identification unit 520 is mainly used for scanning and recognizing the word for including in batch corpus to be processed, obtains Word to be processed;
Comparing unit 530 is mainly used for for word to be processed being compared with the word in vocabulary, judges list to be processed It whether include the new word being not present in vocabulary in word;
Vocabulary updating unit 540 is mainly used for when in word to be processed including the new word being not present in vocabulary, New word is added in vocabulary, updated vocabulary is obtained;
Folding bar construction unit 550 is mainly used for according to folding bar construction being that word to be processed distributes each theme probability, obtains just Beginning theme probability;
LDA processing unit 560, which is mainly used for operation LDA model, carries out at data initial subject probability according to new term table Reason, obtains the theme probability of each word to be processed;Wherein, LDA model is to obey Di Li Cray process based on belief propagation frame LDA model.
Preferably, folding bar construction unit specifically can be used for:
It is that word to be processed distributes each theme probability according to formula 1, obtains initial subject probability;
Wherein, formula 1 specifically:
Wherein LOC (w, k) is the function for positioning position of the word w in the single distribution of theme k, and WORD (j, k) is main Inscribe the word that coordinate is j in the word distribution of k.
Preferably, LDA processing unit specifically can be used for:
Initial subject probability is brought into formula 2 according to new term table and carries out data processing, obtains the master of each word to be processed Inscribe probability;
Wherein, formula 2 specifically:
μw,d(k) it is word w belongs to theme k in text d probability;Be theme k word distribution in, in addition to The probability of word w in other outer all texts of document d;It is in text d other than word w, other words belong to theme The counting of k;It is in the word distribution of theme k, other than the word w in text d, other all words belong to master Inscribe the probability of k.
The energy of online LDA algorithm processing data flow can be enhanced in data flow theme feature extraction element provided in this embodiment Power.
Embodiment six:
Data flow theme feature extract equipment provided by the invention is introduced below, specifically to data flow theme feature The introduction of extract equipment can refer to above-mentioned data flow theme feature extracting method and device, Fig. 6 provide for the embodiment of the present invention A kind of data flow theme feature extract equipment structural block diagram;The equipment may include:
Memory 500, for storing computer program;
Processor 600, when for executing computer program the step of realization data flow theme feature extracting method.
Data flow theme feature extract equipment provided by the invention can be promoted to data flow theme feature extraction efficiency Purpose.
Referring to FIG. 7, a kind of structural schematic diagram of data flow theme feature extract equipment provided in an embodiment of the present invention, it should Data flow theme feature extract equipment can generate bigger difference because configuration or performance are different, may include one or one The above processor (central processing units, CPU) 322 (for example, one or more processors) and storage Device 332, one or more storage application programs 342 or data 344 storage medium 330 (such as one or more Mass memory unit).Wherein, memory 332 and storage medium 330 can be of short duration storage or persistent storage.It is stored in storage The program of medium 330 may include one or more modules (diagram does not mark), and each module may include at data Manage the series of instructions operation in equipment.Further, central processing unit 322 can be set to communicate with storage medium 330, The series of instructions operation in storage medium 330 is executed in data flow theme feature extract equipment 301.
Data flow theme feature extract equipment 301 can also include one or more power supplys 326, one or one with Upper wired or wireless network interface 350, one or more input/output interfaces 358, and/or, one or more behaviour Make system 341, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM etc..
Step in data flow theme feature extracting method described above can be extracted by data flow theme feature to be set Standby structure is realized.
Readable storage medium storing program for executing provided in an embodiment of the present invention is introduced below, readable storage medium storing program for executing described below with The above-described wireless self-organization network assemblage method based on frequency hopping can correspond to each other reference.
A kind of readable storage medium storing program for executing disclosed by the invention, is stored thereon with program, number is realized when program is executed by processor The step of according to stream theme feature extracting method.
It should be noted that each list in data flow theme feature extraction element in the application specific embodiment Member, the course of work please refer to the corresponding specific embodiment of data flow theme feature extracting method, and details are not described herein.
It is apparent to those skilled in the art that for convenience and simplicity of description, the device of foregoing description, The specific work process of equipment, storage medium and unit, can refer to corresponding processes in the foregoing method embodiment, herein no longer It repeats.
In several embodiments provided herein, it should be understood that disclosed device, equipment, storage medium and Method may be implemented in other ways.For example, apparatus embodiments described above are merely indicative, for example, single Member division, only a kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or Component can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point is shown The mutual coupling, direct-coupling or communication connection shown or discussed can be through some interfaces, between device or unit Coupling or communication connection are connect, can be electrical property, mechanical or other forms.
Unit may or may not be physically separated as illustrated by the separation member, shown as a unit Component may or may not be physical unit, it can and it is in one place, or may be distributed over multiple networks On unit.It can some or all of the units may be selected to achieve the purpose of the solution of this embodiment according to the actual needs.
It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.
It, can if integrated unit is realized in the form of SFU software functional unit and when sold or used as an independent product To be stored in a mobile terminal.Based on this understanding, the technical solution of the application is substantially in other words to the prior art The all or part of the part to contribute or the technical solution can be embodied in the form of software products, which deposits It stores up in one storage medium, including some instructions are used so that a mobile terminal (can be mobile phone or tablet computer Deng) execute each embodiment method of the application all or part of the steps.And storage medium above-mentioned includes: USB flash disk, moves firmly Disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), The various media that can store program code such as magnetic or disk.
Each embodiment is described in a progressive manner in specification, the highlights of each of the examples are with other realities The difference of example is applied, the same or similar parts in each embodiment may refer to each other.For device disclosed in embodiment Speech, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is referring to method part illustration ?.
Professional further appreciates that, unit described in conjunction with the examples disclosed in the embodiments of the present disclosure And algorithm steps, it can be realized with the combination of electronic hardware, terminal or the two, in order to clearly demonstrate hardware and software Interchangeability generally describes each exemplary composition and step according to function in the above description.These functions are studied carefully Unexpectedly it is implemented in hardware or software, the specific application and design constraint depending on technical solution.Professional technique people Member can use different methods to achieve the described function each specific application, but this realization is it is not considered that super The scope of the present invention out.
The step of method described in conjunction with the examples disclosed in this document or algorithm, can directly be held with hardware, processor The combination of capable software module or the two is implemented.Software module can be placed in random access memory (RAM), memory, read-only deposit Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology In any other form of storage medium well known in field.
Above to data flow theme feature extracting method, device, equipment and readable storage medium storing program for executing provided by the present invention into It has gone and has been discussed in detail.Used herein a specific example illustrates the principle and implementation of the invention, the above implementation The explanation of example is merely used to help understand method and its core concept of the invention.It should be pointed out that for the general of the art , without departing from the principle of the present invention, can be with several improvements and modifications are made to the present invention for logical technical staff, this A little improvement and modification are also fallen within the protection scope of the claims of the present invention.

Claims (8)

1. a kind of data flow theme feature extracting method is based on online LDA algorithm characterized by comprising
By the data flow received according to arrival time sequential organization at several batch corpus, and determine currently pending batch language Material;
The word for including in the batch corpus to be processed is scanned and recognized, word to be processed is obtained;
The word to be processed is compared with the word in vocabulary, judges in the word to be processed whether to include described The new word being not present in vocabulary;
If so, the new word is added in the vocabulary, updated vocabulary is obtained;
It is that the word to be processed distributes each theme probability according to folding bar construction, obtains initial subject probability;
It runs new LDA model and data processing is carried out to the initial subject probability according to new term table, obtain each described to be processed The theme probability of word;Wherein, the new LDA model is to obey the LDA mould of Di Li Cray process based on belief propagation frame Type.
2. data flow theme feature extracting method as described in claim 1, which is characterized in that it is described according to folding bar construction be institute It states word to be processed and distributes each theme probability, obtain initial subject probability, comprising:
It is that the word to be processed distributes each theme probability according to formula 1, obtains initial subject probability;
Wherein, the formula 1 specifically:
Wherein, LOC (w, k) is the function for positioning position of the word w in the single distribution of theme k, and WORD (j, k) is theme k Word distribution in coordinate be j word.
3. data flow theme feature extracting method as described in claim 1, which is characterized in that the new LDA model root of operation Data processing is carried out to the initial subject probability according to new term table, obtains the theme probability of each word to be processed, comprising:
The initial subject probability is brought into formula 2 according to new term table and carries out data processing, obtains each word to be processed Theme probability;
Wherein, the formula 2 specifically:
μw,d(k) it is word w belongs to theme k in text d probability;It is in the word distribution of theme k, in addition to document d The probability of word w in other outer all texts;It is in text d other than word w, other words belong to the meter of theme k Number;It is in the word distribution of theme k, other than the word w in text d, other all words belong to theme k's Probability.
4. a kind of data flow theme feature extraction element is based on online LDA algorithm characterized by comprising
Corpus determination unit, for by the data flow received according to arrival time sequential organization at several batch corpus, and really Fixed currently pending batch corpus;
Word identification unit obtains to be processed for scanning and recognizing to the word for including in the batch corpus to be processed Word;
Comparing unit judges the word to be processed for the word to be processed to be compared with the word in vocabulary In whether include the new word that is not present in the vocabulary;
Vocabulary updating unit, for inciting somebody to action when in the word to be processed including the new word being not present in the vocabulary The new word is added in the vocabulary, obtains updated vocabulary;
Bar construction unit is rolled over, for being that the word to be processed distributes each theme probability according to folding bar construction, obtains initial subject Probability;
LDA processing unit carries out data processing to the initial subject probability according to new term table for running LDA model, obtains To the theme probability of each word to be processed;Wherein, the LDA model is to obey Di Li Cray based on belief propagation frame The LDA model of process.
5. data flow theme feature extraction element as described in claim 1, which is characterized in that the folding bar construction unit is specific For:
It is that the word to be processed distributes each theme probability according to formula 1, obtains initial subject probability;
Wherein, the formula 1 specifically:
Wherein LOC (w, k) is the function for positioning position of the word w in the single distribution of theme k, and WORD (j, k) is theme k The word that coordinate is j in word distribution.
6. data flow theme feature extraction element as described in claim 1, which is characterized in that the LDA processing unit is specific For:
The initial subject probability is brought into formula 2 according to new term table and carries out data processing, obtains each word to be processed Theme probability;
Wherein, the formula 2 specifically:
μw,d(k) it is word w belongs to theme k in text d probability;It is in the word distribution of theme k, in addition to document The probability of word w in other outer all texts of d;It is in text d other than word w, other words belong to theme k's It counts;It is in the word distribution of theme k, other than the word w in text d, other all words belong to theme k Probability.
7. a kind of data flow theme feature extract equipment characterized by comprising
Memory, for storing computer program;
Processor realizes the data flow theme feature as described in any one of claims 1 to 3 when for executing the computer program The step of extracting method.
8. a kind of readable storage medium storing program for executing, which is characterized in that be stored with program on the readable storage medium storing program for executing, described program is processed It is realized when device executes as described in any one of claims 1 to 3 the step of data flow theme feature extracting method.
CN201811641140.1A 2018-12-29 2018-12-29 Data stream theme feature extraction method, device, equipment and storage medium Active CN109726222B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811641140.1A CN109726222B (en) 2018-12-29 2018-12-29 Data stream theme feature extraction method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811641140.1A CN109726222B (en) 2018-12-29 2018-12-29 Data stream theme feature extraction method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN109726222A true CN109726222A (en) 2019-05-07
CN109726222B CN109726222B (en) 2023-06-13

Family

ID=66299302

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811641140.1A Active CN109726222B (en) 2018-12-29 2018-12-29 Data stream theme feature extraction method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN109726222B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113470739A (en) * 2021-07-03 2021-10-01 中国科学院新疆理化技术研究所 Protein interaction prediction method and system based on mixed membership degree random block model

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120095952A1 (en) * 2010-10-19 2012-04-19 Xerox Corporation Collapsed gibbs sampler for sparse topic models and discrete matrix factorization
CN103870447A (en) * 2014-03-11 2014-06-18 北京优捷信达信息科技有限公司 Keyword extracting method based on implied Dirichlet model

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120095952A1 (en) * 2010-10-19 2012-04-19 Xerox Corporation Collapsed gibbs sampler for sparse topic models and discrete matrix factorization
CN103870447A (en) * 2014-03-11 2014-06-18 北京优捷信达信息科技有限公司 Keyword extracting method based on implied Dirichlet model

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113470739A (en) * 2021-07-03 2021-10-01 中国科学院新疆理化技术研究所 Protein interaction prediction method and system based on mixed membership degree random block model

Also Published As

Publication number Publication date
CN109726222B (en) 2023-06-13

Similar Documents

Publication Publication Date Title
Chen et al. Selection and estimation for mixed graphical models
CN109615452B (en) Product recommendation method based on matrix decomposition
CN109948036B (en) Method and device for calculating weight of participle term
CN113468227B (en) Information recommendation method, system, equipment and storage medium based on graph neural network
Giraud et al. Graph selection with GGMselect
CN108427756B (en) Personalized query word completion recommendation method and device based on same-class user model
CN110442733A (en) A kind of subject generating method, device and equipment and medium
CN112231584A (en) Data pushing method and device based on small sample transfer learning and computer equipment
CN109885674B (en) Method and device for determining and recommending information of subject label
Fienberg Introduction to papers on the modeling and analysis of network data
CN108509793A (en) A kind of user's anomaly detection method and device based on User action log data
CN110750629A (en) Robot dialogue generation method and device, readable storage medium and robot
CN109960791A (en) Judge the method and storage medium, terminal of text emotion
CN113657421A (en) Convolutional neural network compression method and device and image classification method and device
CN111309718B (en) Distribution network voltage data missing filling method and device
CN109636212A (en) The prediction technique of operation actual run time
CN109726222A (en) A kind of data flow theme feature extracting method, device, equipment and storage medium
Kaya et al. Analytical comparison of clustering techniques for the recognition of communication patterns
CN110019662B (en) Label reconstruction method and device
Elidan Bagged structure learning of bayesian network
CN112559877A (en) CTR (China railway) estimation method and system based on cross-platform heterogeneous data and behavior context
CN117236999A (en) Activity determination method and device, electronic equipment and storage medium
CN113254788B (en) Big data based recommendation method and system and readable storage medium
CN113051126B (en) Portrait construction method, apparatus, device and storage medium
Eastoe et al. Nonparametric estimation of the spectral measure, and associated dependence measures, for multivariate extreme values using a limiting conditional representation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant