CN106997382A

CN106997382A - Innovation intention label automatic marking method and system based on big data

Info

Publication number: CN106997382A
Application number: CN201710173029.3A
Authority: CN
Inventors: 鹿旭东; 张盘龙; 陈志勇; 郭伟; 崔立真
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2017-03-22
Filing date: 2017-03-22
Publication date: 2017-08-01
Anticipated expiration: 2037-03-22
Also published as: CN106997382B

Abstract

The invention discloses the innovation intention label automatic marking method based on big data and system, methods described includes：Training result collection is obtained using search dog training Word2vector and LDA.The document data of user's browsing pages is subjected to participle, stop words and word filtration treatment is removed.By the document data of pretreatment, it is combined by using improved TextRank algorithm Word2vector and calculates the label for coming from circumferential edge.And the document of pretreatment is calculated into the label on document data theme by LDA.Visualization is realized by way of generating label-cloud, and all this paper label words are marked out in document data, facilitates user to be read and found key content part.

Description

Innovation intention label automatic marking method and system based on big data

Technical field

The present invention relates to the innovation intention label automatic marking method based on big data and system.

Background technology

With the fast development and popularization of internet, information is in explosive growth so that be have accumulated on internet substantial amounts of Information.Internet user is not only the viewer of internet content simultaneously, also creates various information in internet, then results in mutually Networked information diversification of forms, this causes very big difficulty to information sifting.Using word as the information of carrier in internet information It account for very big ratio, increase and the confusion of structure of information content make people have more references during information is searched Property, the coverage rate of information more fully, is related to the every aspect of people's life, greatly facilitates the life of people, but greatly The information of amount easily makes the mankind be trapped in the stage for selection of having no way of, and it is not one quickly to select effective information from substantial amounts of information The easy thing of part.

Enterprise is when carrying out innovation work, using big data as the basis of analysis and plan, it is necessary to differentiate and check point Analyse valuable data.How big data and the quick related data that effectively obtains enterprise of interest theme are made full use of, and And mark critical data is realized, mixed and disorderly useless information is excluded, enterprise is focused on more valuable and important letter It is the difficult point currently innovated on breath, text marking is arisen at the historic moment in this context.Text marking refers to have using several There is depth of indexing specificity and the word or phrase of text subject can be reflected, these words or phrase are commonly referred to as label, and reader is by reading These labels can quickly understand text subject, so as to determine whether oneself text interested.

Text automatic marking be one got up with internet development it is emerging grind subject, it derived from information extraction and Text Classification, and combine the research method in the directions such as information retrieval and collaborative filtering.In recent years, the text grown up This automatic marking technology has the socialization mark based on user, multi-tag classification annotation, keyword extraction mark；

Above describe the main method of current text marking.Wherein, the socialization based on user is labeled in system service Initial stage, because not passing data provide reference, the problem of there is cold start-up；Multi-tag classification annotation method is based on mostly The algorithm of supervised learning is, it is necessary to which the substantial amounts of data set manually marked is as training set, and artificial labeled data collection is not only time-consuming Arduously, also in the presence of very big subjectivity.

The content of the invention

In order to solve the deficiencies in the prior art, the invention provides the innovation intention label automatic marking side based on big data Method and system, it has the method mark text using keyword extraction, belongs to the category of unsupervised learning, without artificial mark The effect of data set.

Innovation intention label automatic marking method based on big data, including：

Step (1)：Model training：

Text depth representing model Word2vector is trained using corpus, institute in corpus is obtained after training There are word and the corresponding vector model file of all words, that is, the Word2vector models trained；

The LDA moulds for obtaining LDA result sets and training are trained to document subject matter generation model LDA using corpus Type, the LDA result sets include several themes, and the word and word that each theme includes belonging to the theme belong to the master The probability of topic；

Step (2)：Participle is carried out to the data file of user's current browse webpage using Chinese Academy of Sciences's ICTCLAS Words partition systems Operation, then removes stop words；Obtain pretreated data file；

Step (3)：Generate this paper labels and theme label；

Step (4)：Realize the visualization to final this paper labels and theme label.

The stop words of the step (2) looked into the word of given threshold and word without practical significance including the use of frequency.

The word without practical significance includes auxiliary words of mood, adverbial word, preposition and conjunction.

The step of removal stop words, includes：After word segmentation processing, part of speech is labeled, retain noun, verb and Adjective, filters out the word of remaining part of speech, while also needing to filter out the word that frequency of use exceeds given threshold.

The step of step (3) is：

Step (31)：Pretreated data file mark is marked herein using the TextRank algorithm of unsupervised learning Label, and using the Word2vector models trained, the correlation between word and word, profit are calculated based on vector model file Correlation between word and word is modified to this paper labels；The final this paper labels of generation；

Step (32)：Subject analysis is carried out to pretreated data file using LDA result sets, theme label is generated.

The step (31) includes：

Step (311)：The data file of pretreatment is read, the information to each word in data file is counted；It is described The information of each word includes：Position, the position of word last appearance and word sum that word frequency, prefix time occur；

Step (312)：Calculate word weight：The value of the word frequency factor, word location factor and word span factor is calculated respectively；

Word w_iWeight m (w_i) calculation formula：

m(w_i)=tf (w_i)*loc(w_i)*span(w_i)；(1)

Wherein, tf (w_i) it is word w_iThe word frequency factor, loc (w_i) it is word w_iLocation factor, span (w_i) it is word w_iAcross Spend the factor.

The calculation formula of the word frequency factor is：

Wherein, fre (w_i) represent word w_iThe occurrence number in data file.

The calculation formula of the word location factor is：

Wherein, area (w_i) represent word w_iPositional value.

When position of the lexeme in text is different, role is also different, and the word positioned at preceding 10% is to expression text master Topic is most important, and 10%-30% word importance is taken second place before text.Text data is divided into three regions, before being located at 10% is first area, and positional value is set to 50, is second area positioned at preceding 10%-30%, and positional value is set to 30, final area Positional value is set to 20, and the word that multizone all occurs takes maximum.

The calculation formula of institute's predicate span factor is：

Wherein, first (w_i) represent the position that word occurs first in the text, last (w_i) represent word last in the text The position of appearance, sum is total word number for including in text.

The coverage of word span reflection word in the text, span is bigger, bigger to reflection global information effect.In label In extraction, the big word of span can reflect the global theme of text.

Step (313)：Word spacing is calculated, in units of sentence, if among two words are while appear in a sentence, The co-occurrence number of times of two words adds 1, and word spacing is that co-occurrence number of times is reciprocal, if two Term co-occurrence number of times are 0, the distance of two words It is infinitely great；

Step (314)：Word attraction is calculated, among the attraction quantitative formula that the word spacing of step (313) is substituted into word, Draw the attraction quantization means of two words；If two word distances are infinity, then it represents that two word attractions are 0, two Whether word occurs, and will not be affected one another；

The attraction quantitative formula of word：

conn(w_i,w_j)=m (w_i)*m(w_j)/r(w_i,w_j)²；(5)

Wherein, m (w_i) it is word w_iWeight, m (w_j) it is word w_jWeight, conn (w_i,w_j) reflect and possess different weights Two words between contact；r(w_i,w_j) represent word w_iWith word w_jSpacing；

Step (315)：The correlation between word is calculated, is calculated using the Word2vector models trained and represents related The cosine value of property size.

During text depth representing model Word2vector is trained using corpus, obtain including language material After storehouse word and the corresponding vector of all words, k-means clusters are carried out to all words by vector correlation, phase is obtained What the high word of closing property was constituted clusters.Correlation is determined by the cosine value for calculating two words, cosine value is bigger, and correlation is bigger.

It is assumed that word w_i,w_jAll be n-dimensional vector, then correlation cos (w_i,w_j) calculation formula：

And then the word relation Conn (w after being improved_i,w_j) formula：

Conn(w_i,w_j)=conn (w_i,w_j)*(1+cos(w_i,w_j))；(7)

TextRank formula after being improved：

Wherein, TextRank (w_i) represent w_iImportance, TextRank (w_j) represent word w_jImportance；

Step (316)：Calculate word TextRank values：It is 1 to initialize TextRank values, and word relation result of calculation is substituted into and changed TextRank formula after entering, it is 0.0001 to set iteration ends threshold value, continues on the TextRank formula iteration after improving, Until result convergence, so as to obtain the TextRank values of each word；

Step (317)：Word is ranked up from high to low according to the TextRank values of calculating；

Step (318)：Preceding 20 words chosen in ranking results are used as this paper labels.

The step (32) includes：

Step (321)：The data file of pretreatment is read, recording text word sum, the information to each word in data is entered Row statistics；

Step (322)：The theme distribution probability of data file is calculated by LDA result sets；

LDA result sets include some themes, and the word and word that each theme includes belonging to the theme belong to the theme Probability,

All words are ranked up from big to small by probable value；By pretreated data file as a sequence [w₁, w₂,w₃......w_n], wherein w_iI-th of word is represented, n represents that one has n word.Each theme includes the number of word in data file That measures is desired forAssuming that there is K theme, the probability distribution that data file belongs to different themes is obtained Calculate data file and belong to i-th of theme T_iProbabilityFormula：

Wherein,Expression belongs to i-th of theme T_iWord quantity desired value, it is assumed that word w_jBelong to i-th of theme T_iIt is general Rate is p (w_j,T_i), thenCalculation formula is：

Step (323)：The maximum theme of select probability, 5 are taken by the word included in the theme from high to low according to probability Individual word, constitutes this paper theme labels.

Further, the present invention also uses the innovation intention label automatic marking system technical scheme based on big data, its With this paper labels and theme label can be added to the data file that user browses automatically, user is facilitated to find the important letter of text Breath, improves the effect of reading efficiency.

Innovation intention label automatic marking system based on big data, including：

Model training unit：

Data file processing unit：Use data text of Chinese Academy of Sciences's ICTCLAS Words partition systems to user's current browse webpage Shelves carry out participle operation, then remove stop words；Obtain pretreated data file；

Label generation unit：Generate this paper labels and theme label；

Visualization：Realize the visualization to final this paper labels and theme label.

Compared with prior art, the beneficial effects of the invention are as follows：

The keyword of data file is obtained using improved TextRank algorithm, result of calculation has compared with other algorithms There is higher accuracy rate and representativeness, from document in itself, with good representativeness, reaching can for the label extracted The effect of accurate expression content of text；

The theme label of text is generated using LDA models, solve feature word of text be not included in it is tired among text Difficulty, can preferably react the subject content of text, and comprehensive this paper labels realize the label of accurate expression content of text and theme；

Brief description of the drawings

The Figure of description for constituting the part of the application is used for providing further understanding of the present application, and the application's shows Meaning property embodiment and its illustrate be used for explain the application, do not constitute the improper restriction to the application.

Fig. 1 is pretreatment process figure of the invention；

Fig. 2 is this paper label product process figures of the invention；

Fig. 3 is subject of the present invention label product process figure.

Embodiment

It is noted that described further below is all exemplary, it is intended to provide further instruction to the application.Unless another Indicate, all technologies used herein and scientific terminology are with usual with the application person of an ordinary skill in the technical field The identical meanings of understanding.

It should be noted that term used herein above is merely to describe embodiment, and be not intended to restricted root According to the illustrative embodiments of the application.As used herein, unless the context clearly indicates otherwise, otherwise singulative It is also intended to include plural form, additionally, it should be understood that, when in this manual using term "comprising" and/or " bag Include " when, it indicates existing characteristics, step, operation, device, component and/or combinations thereof.

The present invention is comprehensive to use the improved text marking algorithm based on TextRank, Word2vector (one of Google Text analyzing instrument) calculate word correlation and LDA (document subject matter generation model) extract document subject matter come realize text from Dynamic mark.Original TextRank algorithm only accounts for the relation between word in calculating process, and have ignored the spy of word in itself Attribute is levied, causes to fail to make full use of text message during keyword is extracted.The present invention is improved to this relation, Put first with word frequency, lexeme, the information such as word span calculates word weight, and word is set up followed by the weight and word activating force model Between attraction relation, to replace it is original in word relation.Using this improved procedure, on the one hand for word individual, Take full advantage of the word frequency in text, lexeme put, the information such as word span, on the other hand for the relation between word, it is considered to word Co-occurrence rate in sentence, and consider the correlation between word, the Word2vector provided using Google calculates phase Guan Xing.The theme of document may not be included among the word content of document, can not then use the phrase in document content Into label mark, so determining the theme of document using LDA, and provide the label of the theme.

The technical scheme is that：To the Query Result or browsing pages of user, realize that the automatic mark that tags is related Data needed for intention, remove gibberish, and by correlation priority ranking.Under big data background, the visualization of data is got over Come more important, this patent is shown annotation results using the form of label-cloud, and keyword is highlighted out.Adopt With the present invention, data set automatic marking can be realized by unsupervised learning mode, label comes from data file, and noise is small, It is representative good.User can preferentially read the key content of automatic marking during inquiring and browsing, and can focus on notice To prior information.

The present invention is achieved through the following technical solutions the innovation intention automatic marking method based on big data, specific steps It is as follows：

Step one：Use training LDA and Word2vector.

Step 2：Word segmentation processing, filtering useless word are carried out to user's browsing pages.As shown in Figure 1；

Step 3：Label, automatic marking are generated using TextRank algorithm combination LDA.As shown in Figure 2；

Step 4：Label and key content realize visualization.As shown in Figure 3；

In step one, LDA and Word2vector are trained using search dog corpus.

1.Word2vector is the instrument that Google is developed, and it by word by being converted into vector, training set Contents processing is converted into the vector operation in fixed dimension vector space, is come using the distance between the vector calculated result Represent the correlation between text word.Training corpus is bigger, and term vector expression is better, is trained using search dog corpus, Obtain comprising word vector field homoemorphism type file corresponding with its all in corpus, it is possible to achieve calculate correlation between word Task.

2.Word2vec is a efficient work that word is characterized as to real number value vector that Google increases income in year in 2013 Tool, it utilizes the thought of deep learning, the processing to content of text can be reduced in K gts by training Vector operation, and the similarity in vector space can be for the similarity on expression text semantic.The word of Word2vec outputs Vector can be used to do the related work of many NLP, such as cluster, look for synonym, part of speech analysis etc..If changing a think of Road, assigns word as feature, then Word2vec can just seek Feature Mapping to K gts for text data More profound character representation.

3.Word2vec uses Distributed representation term vector representation. Distributed representation were proposed by Hinton in 1986 earliest.Its basic thought is by training each It is vectorial (K is generally the hyper parameter in model) that word is mapped to K dimension real numbers, by the distance between word (such as cosine similarities, Euclidean distance etc.) to judge the semantic similarity between them, it uses one three layers of neutral net, input layer-hidden layer-defeated Go out layer.The technology for having individual core is to be encoded according to word frequency with Huffman so that it is interior that the similar word hidden layer of all word frequency is activated Hold basically identical, the higher word of the frequency of occurrences, the hiding number of layers that they activate is fewer, so effectively reduces calculating Complexity.And a reason exactly its high efficiency popular Word2vec, Mikolov is by the article pointed out, one optimizes Unit version can train more than one hundred billion word within one day.

4. this three-layer neural network is that language model is modeled in itself, but also obtains a kind of word in vector simultaneously Expression spatially, and this side effect is only Word2vec real target.

5. with latent semantic analysis (Latent Semantic Index, LSI), potential Di Li Crays distribution (Latent Dirichlet Allocation, LDA) classical processes compare, Word2vec make use of the context of word, and semantic information is more Enrich on ground.

6.LDA (Latent Dirichlet Allocation) is a kind of document subject matter generation model, includes word, theme With document three-decker.Generation model think each word of an article be by " with certain probability selection some theme, And with some word of certain probability selection from this theme " such a process obtains.Document obeys multinomial point to theme Cloth, theme to word obeys multinomial distribution.

LDA is a kind of non-supervisory machine learning techniques, can be for subject information hiding in identification corpus.It is used The method of bag of words, by each document is considered as a word frequency vector, so that text message is converted into the number for ease of modeling Word information.The probability distribution that some themes of each documents representative are constituted, and each theme is represented a lot The probability distribution that word is constituted.It is trained using search dog corpus, obtains several themes, and each theme The set of the probability of middle word, can use LDA training results collection to calculate the probability distribution that document data belongs to all themes.

In step 2, the ICTCLAS Words partition systems researched and developed using cas computer carry out participle behaviour to text data Make, then remove stop words, part of speech filtering.

1. current Chinese Word Automatic Segmentation is broadly divided into three major types：Segmentation methods based on character string, the participle based on understanding Algorithm, the segmentation methods based on statistics, although above-mentioned several segmentation methods are very ripe, be due to Chinese language in itself Complexity, Chinese content is with ambiguity and neologisms continuously emerge, so current Words partition system is all to integrate to use a variety of participles Algorithm.Tsing-Hua University, Beijing University, Harbin Institute of Technology, Microsoft Research, China of the Chinese Academy of Sciences, magnanimity science and technology etc. have all carried out Chinese word segmentation research, its In the ICTCLAS Words partition systems researched and developed by cas computer protrude the most.

Specifically, ICTCLAS Words partition systems have five layers of hidden Markov model, main participle process includes preliminary point Word, word identification, again participle, part-of-speech tagging are not logged in, wherein employing shortest-path method in preliminary participle to Chinese word language Rough segmentation, unknown word identification is processed to name, place name, complex mechanism name, ensures the precision of participle as far as possible.It is internal and international The open evaluation result of authority shows that the Words partition system participle speed is fast, and accuracy is high.Here is the API used：

(1) initialize：BoolICTCLAS_Init (const char*pszInitDir=NULL)；

PszInitDir is initialization path.Initialize and successfully return to true, otherwise return to false.

(2) participle is exited：boolICTCLAS_Exit()；

The memory headroom that dictionary takes is discharged, extra buffer and other system resources is removed.

(3) file process：boolICTCLAS_FileProcess(const char*sSrcFilename, eCodeTypeeCt,const char*sDsnFilename,intbPOStagged)；

SSrcFilename is source file path to be analyzed, and eCodeType is the character code of source file,

SDsnFilename is the destination file after participle, and whether bPOStagged is needs to carry out part-of-speech tagging, and 0 is

No, 1 is yes.File participle successfully returns to true, otherwise returns to false.

2. stop words is generally divided into two classes：One class is even excessively frequently some words using quite varied, such as " I ", " just " etc., another kind of is that the frequency of occurrences is very high in the text, but practical significance and little word, predominantly some tone Auxiliary word, adverbial word, preposition, conjunction etc., as " ", " ", " and " etc word.It is exactly that this two classes word is literary from building to remove stop words Remove in the word of present networks node, reduce the complexity of network.The part of speech of label is usually noun, verb, adjective, and word is long Generally higher than be equal to two words, it is therefore desirable to which the result after text participle is subjected to part-of-speech tagging, only retained according to part of speech this three The word of class part of speech.

3. idiographic flow is as shown in Figure 2：

(1) word segmentation processing is carried out to document data using ICTCLAS Words partition systems；

(2) performing word segmentation result goes stop words to operate, and removes useless stop words；

(3) to result carry out part-of-speech tagging, retain can as label noun, verb and adjective, filter out remaining Word, exclusive PCR.

In step 3, text data automatic marking is realized using the TextRank algorithm of unsupervised learning, and it is entered Row is improved, and the correlation between Word2vector calculating words and word is used in combination.Then text data is led using LDA Topic analysis, comprehensive generation label.

It is used for weighing the fine or not sole criterion in a website specifically, PageRank algorithms are Google, is Google wounds Beginning people Larry Page and Xie Er drop cloth woods proposed in 1998.The algorithm makes full use of the structure of hyperlinks evaluating network page on webpage Ranking, its basic thought is that the link of a webpage to another webpage is interpreted as into the former ballot to the latter.One webpage Linked number of times is more, it is meant that the ballot that the webpage possesses other webpages is more, and the webpage is more important.While webpage of voting Poll importance depend on the webpage importance of itself, if a webpage itself is important, the net linked by it Page is comparatively also important.PageRank algorithms can apply to the extraction of keyword and sentence：Word or sentence are regarded as Webpage, the chain that webpage is regarded in the contact between word or sentence as transfers the registration of Party membership, etc. from one unit to another, and the importance of word or sentence is calculated using algorithm, is extracted Important word or sentence.

1.RadeMihalcea and Paul Tarau proposed TextRank algorithm in 2004 according to PageRank algorithms. The essence of TextRank algorithm is a kind of algorithm based on figure, and word or sentence are equal to the node of figure, word or sentence in the algorithm Contact between son is equal to the side of figure, and text network is represented with DN=(W, R), and wherein W is the collection for the word for constituting text network Close, R is the set of the relation of any two word in W.Contact between word is using Term co-occurrence in the sliding window of length-specific Number of times is represented.

(1) it is similar to PageRank thought, if a word is directly connected by a line with another word, then it is assumed that should Word is that the latter has thrown a ticket, and the importance that the word is voted depends on the importance of its own, the importance of such a word again The poll that is just obtained by it and the importance for the other words voted for it are together decided on.Think in PageRank by a net The probability that page is linked to other webpages is random impartial, thus the figure obtained is no weight.But in text network, two There are a variety of contacts between individual word, it is considered to which the power contacted between word is necessary.Assuming that conn (w_i,w_j) represent word w_iAnd w_jIt Between contact (be herein the two co-occurrence number of times) in the word window that length is, then word wⁱTextRank values definition such as formula It is shown：

Wherein In (w_i) represent sensing word w_iSet of words, Out (w_j) represent word w_jPointed set of words, d represents damping The factor, value is 0.85.

(2) RadaMihalcea and Paul Tarau are experimentally confirmed turns into digraph extraction keyword by text mapping Accuracy rate be less than text be mapped to the accuracy rate of non-directed graph, there is no directionality between this declarer.Therefore by digraph TextRank definition is changed to：

Wherein L (w_i) and L (w_j) represent and word w respectively_iAnd w_jThe set for the word being directly connected to.

2. improve TextRank algorithm.

Relation in the TextRank algorithm that RadaMihalcea and Paul Tarau are proposed between word only considered word and exist Co-occurrence number of times in certain window length, and the word characteristic information of itself such as word frequency, lexeme in whole text are put, word span Etc. being ignored, in addition, the correlation between word and word is simply analyzed from current text, so cause the correlation of word inadequate Accurately.The present invention starts with terms of following three, and algorithm is improved：Information (including word frequency, lexeme first by word in itself Put, word span) calculate word weight, then weigh what is contacted between word and word by the frequency of co-occurrence between word weight and word and word Tightness degree, finally, the correlation between word and word is calculated using Word2vector.

(1) word weight is calculated.Word weight calculation is put and word span, word w by word frequency, lexeme_iWeight calculation formula：

m(w_i)=tf (w_i)*loc(w_i)*span(w_i)

Wherein m (w_i) it is word w_iWeight, tf (w_i) it is word w_iThe word frequency factor, loc (w_i) it is word w_iLocation factor, span(w_i) it is word w_iSpan factor.The computational methods of each factor are as follows：

【1】The word frequency factor.The word frequency of one word is higher, and the word is more important in the text.The calculating of the word frequency factor is used Nonlinear function method, it is assumed that word w_iOccurrence number is fre (w in the text_i), then word frequency factor calculation formula：

【2】Word location factor.When position of the lexeme in text is different, role is also different, positioned at preceding 10% Word is most important to expression text subject, and 10%-30% word importance is taken second place before text.Text data is divided into three Region, preceding 10% is first area, and positional value is set to 50, and preceding 10%-30% is second area, and positional value is set to 30, finally Regional location value is set to 20, and the word that multizone all occurs takes maximum.Word wⁱPositional value area (wⁱ) represent, calculation formula For：

【3】Word span factor.The coverage of word span reflection word in the text, span is bigger, to reflection global information Effect is bigger., it is necessary to which the big word of span, can reflect the global theme of text in tag extraction.Calculation formula：

Wherein first (w_i) and last (w_i) position that the position that word occurs first in the text occurs with last is represented respectively Put, sum is total word number for including in text.

(2) word relation is calculated.

There is the effect activated mutually between word and word, some words always occur in pairs with other words, when a word occurs When, it frequently can lead to this effect that people are naturally expected between another word, word and be referred to as word activating force.The opposing party Face, with the word often arrange in pairs or groups appearance word more than one, it is necessary to judge the word arranged in pairs or groups therewith according to specific language environment.Not With text in, word is also different from the intensity that word is activated mutually, can be in a text according to word importance in itself and word Between the contact set up between word of activation.

The physical meaning of word activating force is similar to gravitation, and it is initially defined as follows：Suppositive wⁱAnd w^jIn corpus The number of times of middle appearance is respectively fre (wⁱ) and fre (w^j), the frequency of the two co-occurrence is co-occur (wⁱ,w^j), then word wⁱTo word w^j Activating force such as formula:

Wherein d (w_i,w_j) it is word w_iAnd w_jAverage distance during co-occurrence therebetween.

【1】Formula of Universal Gravitation is analogous to it can be found that in word activating force formula, Section 1 and Section 2 represent two respectively The quality of individual object, d (wⁱ,w^j) represent the distance between object.Word activating force reflects the strong of " attraction " between two words Degree.But original word activating force formula only considered the number of times of the respective word frequency of word and Term co-occurrence, not by word its in itself Its characteristic is taken into account, it is impossible to make full use of the information of text.

In a document data, the information such as word frequency, position, span of word is word build-in attribute herein.Together Sample, exists between word and word and contacts, analogy Formula of Universal Gravitation, obtains " attraction " quantitative formula between word：

conn(w_i,w_j)=m (w_i)*m(w_j)/r(w_i,w_j)²

Wherein m (w_i) and m (w_j) it is respectively word w_iWith word w_jWeight, conn (w_i,w_j) reflect and possess different weights Contact between two words.

【2】Word2vector calculates the correlation between word and word.In the training process, obtain including corpus word After vector corresponding with its, k-means clusters are carried out to all words by vector, the poly- of the high word composition of correlation is obtained Cluster.Correlation is determined by the cosine value for calculating two words, cosine value is bigger, and correlation is bigger.It is assumed that word wⁱ,w^jAll be n tie up to Measure, then relevance values cos (wⁱ,w^j) calculation formula：

Thus it is possible to the word relation Conn (w after being improved_i,w_j) formula：

Conn(w_i,w_j)=conn (w_i,w_j)*(1+cos(w_i,w_j))

By Conn (w_i,w_j) conn (w of replacement above_i,w_j) it is TextRank formula after can be improved

3.LDA result sets include some themes, and the word and word that each theme includes belonging to this theme belong to the theme Probability, all words are ranked up from big to small by probable value.By treated data file as a sequence [w₁, w₂,...w_n] wherein w_iI-th of word is represented, n represents that one has n word.The quantity of word in data file is included by each theme ExpectationAssuming that there is K theme T, the probability branch that data file belongs to different themes can be obtained Calculate each theme probabilityFormula：

Wherein,Expression belongs to the desired value of theme i word quantity, it is assumed that word w_jThe probability for belonging to theme i is p (w_j, T_i), then _TiCalculation formula is：

4. idiographic flow：

(1) data file of pretreatment is read, word frequency is recorded, the position that prefix time and last occur, text word sum is right The information of each word is counted in data.

(2) word weight is calculated, the value of the word frequency factor, word location factor and word span factor is calculated respectively.

(3) word spacing is calculated, in units of sentence, if among two words are while appear in a sentence, they Co-occurrence number of times adds 1, and word spacing is that co-occurrence number of times is reciprocal, if two Term co-occurrence number of times are 0, their distance is infinitely great.

(4) word attraction is calculated, the word spacing of previous step is substituted among attraction quantitative formula, you can draw two words Attraction quantization means.If two word distances are infinity, then it represents that two word attractions are 0, and whether they occur, no It can be affected one another.

(5) correlation between word is calculated, the cosine value for representing correlation size is calculated using Word2vector.

(6) word TextRank values are calculated.It is 1 to initialize TextRank values, result of calculation is substituted into after improving TextRank formula, it is 0.0001 to set iteration ends threshold value, continues on the formula iteration, until result convergence, so as to obtain Obtain the TextRank values of each word.

(7) word is ranked up from high to low according to the TextRank values of calculating.

(8) 20 words of TOP chosen in ranking results are used as this paper labels.

(9) the theme distribution probability of data file is calculated by LDA.

(10) the maximum theme of select probability, 5 words, group are taken according to probability by the word included in the theme from high to low Into this paper theme labels.

Wherein, step 3-1 includes (1) (2) (3) (4) (5) (6) (7) (8), as shown in Fig. 2 generation this paper labels, index Label data both are from data file.

Step 3-1 include (1) (9) (10), as shown in figure 3, generation document theme label, label data not necessarily from In data file.

In step 4, the visualization to document data label and key content is realized.The present invention uses two labels, one It is this paper labels, label substance comes from circumferential edge.Another is theme label, and label data comes from the master of document data Topic, can reflect the theme of document data, can also solve the problem of circumferential edge does not include theme.

The present invention shows form using label-cloud, is generated using PyTagCloud, PyTagCloud is to be based on The python expanding libraries that Wordle technologies are realized.The label-cloud of generation is with the different word of different color shows, this paper labels First five word of sequence is shown, word font size reflection term weighing size, word weight is bigger, shows and gets in label-cloud It is eye-catching.In addition, in document data, 20 this paper labels are marked out with the color different from other words to facilitate user During reading documents data content, emphasis can be quickly found.

The preferred embodiment of the application is the foregoing is only, the application is not limited to, for the skill of this area For art personnel, the application can have various modifications and variations.It is all within spirit herein and principle, made any repair Change, equivalent substitution, improvement etc., should be included within the protection domain of the application.

Claims

1. the innovation intention label automatic marking method based on big data, it is characterized in that, including：

Step (1)：Model training：

Text depth representing model Word2vector is trained using corpus, all words in corpus are obtained after training Language and the corresponding vector model file of all words, that is, the Word2vector models trained；

The LDA models for obtaining LDA result sets and training, institute are trained to document subject matter generation model LDA using corpus Stating LDA result sets includes several themes, and the word and word that each theme includes belonging to the theme belong to the theme Probability；

Step (2)：Participle behaviour is carried out to the data file of user's current browse webpage using Chinese Academy of Sciences's ICTCLAS Words partition systems Make, then remove stop words；Obtain pretreated data file；

Step (3)：Generate this paper labels and theme label；

2. the innovation intention label automatic marking method as claimed in claim 1 based on big data, it is characterized in that, the step (2) stop words looked into the word of given threshold and word without practical significance including the use of frequency；

The word without practical significance includes auxiliary words of mood, adverbial word, preposition and conjunction；

The step of removal stop words, includes：After word segmentation processing, part of speech is labeled, retains noun, verb and describes Word, filters out the word of remaining part of speech, while also needing to filter out the word that frequency of use exceeds given threshold.

3. the innovation intention label automatic marking method as claimed in claim 1 based on big data, it is characterized in that, the step (3) the step of is：

Step (31)：This paper labels are marked to pretreated data file using the TextRank algorithm of unsupervised learning, and And using the Word2vector models trained, the correlation between word and word is calculated based on vector model file, using word with Correlation between word is modified to this paper labels；The final this paper labels of generation；

4. the innovation intention label automatic marking method as claimed in claim 3 based on big data, it is characterized in that, the step (31) include：

Step (311)：The data file of pretreatment is read, the information to each word in data file is counted；It is described each The information of word includes：Position, the position of word last appearance and word sum that word frequency, prefix time occur；

Word w_iWeight m (w_i) calculation formula：

m(w_i)=tf (w_i)*loc(w_i)*span(w_i)；(1)

Wherein, tf (w_i) it is word w_iThe word frequency factor, loc (w_i) it is word w_iLocation factor, span (w_i) it is word w_iSpan because Son；

Step (313)：Word spacing is calculated, in units of sentence, if among two words are while appear in a sentence, two The co-occurrence number of times of word adds 1, and word spacing is that co-occurrence number of times is reciprocal, if two Term co-occurrence number of times are 0, the distance of two words is infinite Greatly；

Step (314)：Word attraction is calculated, among the attraction quantitative formula that the word spacing of step (313) is substituted into word, is drawn The attraction quantization means of two words；If two word distances are infinity, then it represents that two word attractions are 0, and two words go out Now whether, will not be affected one another；

Step (315)：The correlation between word is calculated, is calculated using the Word2vector models trained and represents that correlation is big Small cosine value；

Step (316)：Calculate word TextRank values：It is 1 to initialize TextRank values, word relation result of calculation is substituted into after improving TextRank formula, it is 0.0001 to set iteration ends threshold values, continues on the TextRank formula iteration after improving, until As a result restrain, so as to obtain the TextRank values of each word；

5. the innovation intention label automatic marking method as claimed in claim 4 based on big data, it is characterized in that, the word frequency The calculation formula of the factor is：

t f (w_{i}) = \frac{f r e (w_{i})}{1 + f r e (w_{i})}; - - - (2)

Wherein, fre (w_i) represent word w_iThe occurrence number in data file.

6. the innovation intention label automatic marking method as claimed in claim 4 based on big data, it is characterized in that, the lexeme The calculation formula for putting the factor is：

l o c (w_{i}) = \frac{a r e a (w_{i}) - 1}{a r e a (w_{i}) + 1}; - - - (3)

Wherein, area (w_i) represent word w_iPositional value；

When position of the lexeme in text is different, role is also different, and the word positioned at preceding 10% is to expressing text subject most Important, 10%-30% word importance is taken second place before text；Text data is divided into three regions, is positioned at preceding 10% First area, positional value is set to 50, is second area positioned at preceding 10%-30%, and positional value is set to 30, final area positional value 20 are set to, the word that multizone all occurs takes maximum.

7. the innovation intention label automatic marking method based on big data as claimed in claim 4, it is characterized in that, institute's predicate across Degree the factor calculation formula be：

s p a n (w_{i}) = \frac{l a s t (w_{i}) - f i r s t (w_{i}) + 1}{s u m}; - - - (4)

Wherein, first (w_i) represent the position that word occurs first in the text, last (w_i) represent that last word occurs in the text Position, sum is total word number for including in text；

The coverage of word span reflection word in the text, span is bigger, bigger to reflection global information effect；In tag extraction In, the big word of span can reflect the global theme of text；

The attraction quantitative formula of word：

conn(w_i,w_j)=m (w_i)*m(w_j)/r(w_i,w_j)²； (5)

Wherein, m (w_i) it is word w_iWeight, m (w_j) it is word w_jWeight, conn (w_i,w_j) reflect and possess the two of different weights Contact between individual word；r(w_i,w_j) represent word w_iWith word w_jSpacing.

8. the innovation intention label automatic marking method as claimed in claim 4 based on big data, it is characterized in that,

During text depth representing model Word2vector is trained using corpus, obtain including corpus word After language and the corresponding vector of all words, k-means clusters are carried out to all words by vector correlation, correlation is obtained What high word was constituted clusters；Correlation is determined by the cosine value for calculating two words, cosine value is bigger, and correlation is bigger；

c o s (w_{i}, w_{j}) = \frac{Σ_{k = 1}^{n} (w_{i_{k}} * w_{j_{k}})}{\sqrt{Σ_{k = 1}^{n} {(w_{i_{k}})}^{2}} * \sqrt{Σ_{k = 1}^{n} {(w_{j_{k}})}^{2}}}; - - - (6)

And then the word relation Conn (w after being improved_i,w_j) formula：

Conn(w_i,w_j)=conn (w_i,w_j)*(1+cos(w_i,w_j))； (7)

TextRank formula after being improved：

T e x t R a n k (w_{i}) = (1 - d) + d * \underset{w_{j} &Element; L (w_{i})}{Σ} \frac{C o n n (w_{i}, w_{j})}{\underset{w_{k} &Element; L (w_{j})}{Σ} C o n n (w_{k}, w_{j})} T e x t R a n k (w_{j}); - - - (8)

Wherein, TextRank (w_i) represent w_iImportance, TextRank (w_j) represent word w_jImportance.

9. the innovation intention label automatic marking method as claimed in claim 3 based on big data, it is characterized in that,

The step (32) includes：

Step (321)：The data file of pretreatment is read, recording text word sum, the information to each word in data is united Meter；

LDA result sets include some themes, and the word and word that each theme includes belonging to the theme belong to the general of the theme Rate,

All words are ranked up from big to small by probable value；By pretreated data file as a sequence [w₁,w₂, w₃......w_n], wherein w_iI-th of word is represented, n represents that one has n word；Each theme includes the quantity of word in data file Be desired forAssuming that there is K theme, the probability distribution that data file belongs to different themes is obtained Calculate data file and belong to i-th of theme T_iProbabilityFormula：

p_{T_{i}} = {\overset{&OverBar;}{n}}_{T_{i}} / n

Wherein,Expression belongs to i-th of theme T_iWord quantity desired value, it is assumed that word w_jBelong to i-th of theme T_iProbability be p(w_j,T_i), thenCalculation formula is：

{\overset{&OverBar;}{n}}_{T_{i}} = Σ_{j = 1}^{n} p (w_{j}, T_{i})

Step (323)：The maximum theme of select probability, 5 words are taken according to probability by the word included in the theme from high to low, Constitute this paper theme labels.

10. the innovation intention label automatic marking system based on big data, it is characterized in that, including：

Model training unit：

Data file processing unit：The data file of user's current browse webpage is entered using Chinese Academy of Sciences's ICTCLAS Words partition systems Row participle is operated, and then removes stop words；Obtain pretreated data file；

Label generation unit：Generate this paper labels and theme label；