CN106997382A - Innovation intention label automatic marking method and system based on big data - Google Patents

Innovation intention label automatic marking method and system based on big data Download PDF

Info

Publication number
CN106997382A
CN106997382A CN201710173029.3A CN201710173029A CN106997382A CN 106997382 A CN106997382 A CN 106997382A CN 201710173029 A CN201710173029 A CN 201710173029A CN 106997382 A CN106997382 A CN 106997382A
Authority
CN
China
Prior art keywords
word
theme
words
text
data file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710173029.3A
Other languages
Chinese (zh)
Other versions
CN106997382B (en
Inventor
鹿旭东
张盘龙
陈志勇
郭伟
崔立真
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN201710173029.3A priority Critical patent/CN106997382B/en
Publication of CN106997382A publication Critical patent/CN106997382A/en
Application granted granted Critical
Publication of CN106997382B publication Critical patent/CN106997382B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses the innovation intention label automatic marking method based on big data and system, methods described includes:Training result collection is obtained using search dog training Word2vector and LDA.The document data of user's browsing pages is subjected to participle, stop words and word filtration treatment is removed.By the document data of pretreatment, it is combined by using improved TextRank algorithm Word2vector and calculates the label for coming from circumferential edge.And the document of pretreatment is calculated into the label on document data theme by LDA.Visualization is realized by way of generating label-cloud, and all this paper label words are marked out in document data, facilitates user to be read and found key content part.

Description

Innovation intention label automatic marking method and system based on big data
Technical field
The present invention relates to the innovation intention label automatic marking method based on big data and system.
Background technology
With the fast development and popularization of internet, information is in explosive growth so that be have accumulated on internet substantial amounts of Information.Internet user is not only the viewer of internet content simultaneously, also creates various information in internet, then results in mutually Networked information diversification of forms, this causes very big difficulty to information sifting.Using word as the information of carrier in internet information It account for very big ratio, increase and the confusion of structure of information content make people have more references during information is searched Property, the coverage rate of information more fully, is related to the every aspect of people's life, greatly facilitates the life of people, but greatly The information of amount easily makes the mankind be trapped in the stage for selection of having no way of, and it is not one quickly to select effective information from substantial amounts of information The easy thing of part.
Enterprise is when carrying out innovation work, using big data as the basis of analysis and plan, it is necessary to differentiate and check point Analyse valuable data.How big data and the quick related data that effectively obtains enterprise of interest theme are made full use of, and And mark critical data is realized, mixed and disorderly useless information is excluded, enterprise is focused on more valuable and important letter It is the difficult point currently innovated on breath, text marking is arisen at the historic moment in this context.Text marking refers to have using several There is depth of indexing specificity and the word or phrase of text subject can be reflected, these words or phrase are commonly referred to as label, and reader is by reading These labels can quickly understand text subject, so as to determine whether oneself text interested.
Text automatic marking be one got up with internet development it is emerging grind subject, it derived from information extraction and Text Classification, and combine the research method in the directions such as information retrieval and collaborative filtering.In recent years, the text grown up This automatic marking technology has the socialization mark based on user, multi-tag classification annotation, keyword extraction mark;
Above describe the main method of current text marking.Wherein, the socialization based on user is labeled in system service Initial stage, because not passing data provide reference, the problem of there is cold start-up;Multi-tag classification annotation method is based on mostly The algorithm of supervised learning is, it is necessary to which the substantial amounts of data set manually marked is as training set, and artificial labeled data collection is not only time-consuming Arduously, also in the presence of very big subjectivity.
The content of the invention
In order to solve the deficiencies in the prior art, the invention provides the innovation intention label automatic marking side based on big data Method and system, it has the method mark text using keyword extraction, belongs to the category of unsupervised learning, without artificial mark The effect of data set.
Innovation intention label automatic marking method based on big data, including:
Step (1):Model training:
Text depth representing model Word2vector is trained using corpus, institute in corpus is obtained after training There are word and the corresponding vector model file of all words, that is, the Word2vector models trained;
The LDA moulds for obtaining LDA result sets and training are trained to document subject matter generation model LDA using corpus Type, the LDA result sets include several themes, and the word and word that each theme includes belonging to the theme belong to the master The probability of topic;
Step (2):Participle is carried out to the data file of user's current browse webpage using Chinese Academy of Sciences's ICTCLAS Words partition systems Operation, then removes stop words;Obtain pretreated data file;
Step (3):Generate this paper labels and theme label;
Step (4):Realize the visualization to final this paper labels and theme label.
The stop words of the step (2) looked into the word of given threshold and word without practical significance including the use of frequency.
The word without practical significance includes auxiliary words of mood, adverbial word, preposition and conjunction.
The step of removal stop words, includes:After word segmentation processing, part of speech is labeled, retain noun, verb and Adjective, filters out the word of remaining part of speech, while also needing to filter out the word that frequency of use exceeds given threshold.
The step of step (3) is:
Step (31):Pretreated data file mark is marked herein using the TextRank algorithm of unsupervised learning Label, and using the Word2vector models trained, the correlation between word and word, profit are calculated based on vector model file Correlation between word and word is modified to this paper labels;The final this paper labels of generation;
Step (32):Subject analysis is carried out to pretreated data file using LDA result sets, theme label is generated.
The step (31) includes:
Step (311):The data file of pretreatment is read, the information to each word in data file is counted;It is described The information of each word includes:Position, the position of word last appearance and word sum that word frequency, prefix time occur;
Step (312):Calculate word weight:The value of the word frequency factor, word location factor and word span factor is calculated respectively;
Word wiWeight m (wi) calculation formula:
m(wi)=tf (wi)*loc(wi)*span(wi);(1)
Wherein, tf (wi) it is word wiThe word frequency factor, loc (wi) it is word wiLocation factor, span (wi) it is word wiAcross Spend the factor.
The calculation formula of the word frequency factor is:
Wherein, fre (wi) represent word wiThe occurrence number in data file.
The calculation formula of the word location factor is:
Wherein, area (wi) represent word wiPositional value.
When position of the lexeme in text is different, role is also different, and the word positioned at preceding 10% is to expression text master Topic is most important, and 10%-30% word importance is taken second place before text.Text data is divided into three regions, before being located at 10% is first area, and positional value is set to 50, is second area positioned at preceding 10%-30%, and positional value is set to 30, final area Positional value is set to 20, and the word that multizone all occurs takes maximum.
The calculation formula of institute's predicate span factor is:
Wherein, first (wi) represent the position that word occurs first in the text, last (wi) represent word last in the text The position of appearance, sum is total word number for including in text.
The coverage of word span reflection word in the text, span is bigger, bigger to reflection global information effect.In label In extraction, the big word of span can reflect the global theme of text.
Step (313):Word spacing is calculated, in units of sentence, if among two words are while appear in a sentence, The co-occurrence number of times of two words adds 1, and word spacing is that co-occurrence number of times is reciprocal, if two Term co-occurrence number of times are 0, the distance of two words It is infinitely great;
Step (314):Word attraction is calculated, among the attraction quantitative formula that the word spacing of step (313) is substituted into word, Draw the attraction quantization means of two words;If two word distances are infinity, then it represents that two word attractions are 0, two Whether word occurs, and will not be affected one another;
The attraction quantitative formula of word:
conn(wi,wj)=m (wi)*m(wj)/r(wi,wj)2;(5)
Wherein, m (wi) it is word wiWeight, m (wj) it is word wjWeight, conn (wi,wj) reflect and possess different weights Two words between contact;r(wi,wj) represent word wiWith word wjSpacing;
Step (315):The correlation between word is calculated, is calculated using the Word2vector models trained and represents related The cosine value of property size.
During text depth representing model Word2vector is trained using corpus, obtain including language material After storehouse word and the corresponding vector of all words, k-means clusters are carried out to all words by vector correlation, phase is obtained What the high word of closing property was constituted clusters.Correlation is determined by the cosine value for calculating two words, cosine value is bigger, and correlation is bigger.
It is assumed that word wi,wjAll be n-dimensional vector, then correlation cos (wi,wj) calculation formula:
And then the word relation Conn (w after being improvedi,wj) formula:
Conn(wi,wj)=conn (wi,wj)*(1+cos(wi,wj));(7)
TextRank formula after being improved:
Wherein, TextRank (wi) represent wiImportance, TextRank (wj) represent word wjImportance;
Step (316):Calculate word TextRank values:It is 1 to initialize TextRank values, and word relation result of calculation is substituted into and changed TextRank formula after entering, it is 0.0001 to set iteration ends threshold value, continues on the TextRank formula iteration after improving, Until result convergence, so as to obtain the TextRank values of each word;
Step (317):Word is ranked up from high to low according to the TextRank values of calculating;
Step (318):Preceding 20 words chosen in ranking results are used as this paper labels.
The step (32) includes:
Step (321):The data file of pretreatment is read, recording text word sum, the information to each word in data is entered Row statistics;
Step (322):The theme distribution probability of data file is calculated by LDA result sets;
LDA result sets include some themes, and the word and word that each theme includes belonging to the theme belong to the theme Probability,
All words are ranked up from big to small by probable value;By pretreated data file as a sequence [w1, w2,w3......wn], wherein wiI-th of word is represented, n represents that one has n word.Each theme includes the number of word in data file That measures is desired forAssuming that there is K theme, the probability distribution that data file belongs to different themes is obtained Calculate data file and belong to i-th of theme TiProbabilityFormula:
Wherein,Expression belongs to i-th of theme TiWord quantity desired value, it is assumed that word wjBelong to i-th of theme TiIt is general Rate is p (wj,Ti), thenCalculation formula is:
Step (323):The maximum theme of select probability, 5 are taken by the word included in the theme from high to low according to probability Individual word, constitutes this paper theme labels.
Further, the present invention also uses the innovation intention label automatic marking system technical scheme based on big data, its With this paper labels and theme label can be added to the data file that user browses automatically, user is facilitated to find the important letter of text Breath, improves the effect of reading efficiency.
Innovation intention label automatic marking system based on big data, including:
Model training unit:
Text depth representing model Word2vector is trained using corpus, institute in corpus is obtained after training There are word and the corresponding vector model file of all words, that is, the Word2vector models trained;
The LDA moulds for obtaining LDA result sets and training are trained to document subject matter generation model LDA using corpus Type, the LDA result sets include several themes, and the word and word that each theme includes belonging to the theme belong to the master The probability of topic;
Data file processing unit:Use data text of Chinese Academy of Sciences's ICTCLAS Words partition systems to user's current browse webpage Shelves carry out participle operation, then remove stop words;Obtain pretreated data file;
Label generation unit:Generate this paper labels and theme label;
Visualization:Realize the visualization to final this paper labels and theme label.
Compared with prior art, the beneficial effects of the invention are as follows:
The keyword of data file is obtained using improved TextRank algorithm, result of calculation has compared with other algorithms There is higher accuracy rate and representativeness, from document in itself, with good representativeness, reaching can for the label extracted The effect of accurate expression content of text;
The theme label of text is generated using LDA models, solve feature word of text be not included in it is tired among text Difficulty, can preferably react the subject content of text, and comprehensive this paper labels realize the label of accurate expression content of text and theme;
Brief description of the drawings
The Figure of description for constituting the part of the application is used for providing further understanding of the present application, and the application's shows Meaning property embodiment and its illustrate be used for explain the application, do not constitute the improper restriction to the application.
Fig. 1 is pretreatment process figure of the invention;
Fig. 2 is this paper label product process figures of the invention;
Fig. 3 is subject of the present invention label product process figure.
Embodiment
It is noted that described further below is all exemplary, it is intended to provide further instruction to the application.Unless another Indicate, all technologies used herein and scientific terminology are with usual with the application person of an ordinary skill in the technical field The identical meanings of understanding.
It should be noted that term used herein above is merely to describe embodiment, and be not intended to restricted root According to the illustrative embodiments of the application.As used herein, unless the context clearly indicates otherwise, otherwise singulative It is also intended to include plural form, additionally, it should be understood that, when in this manual using term "comprising" and/or " bag Include " when, it indicates existing characteristics, step, operation, device, component and/or combinations thereof.
The present invention is comprehensive to use the improved text marking algorithm based on TextRank, Word2vector (one of Google Text analyzing instrument) calculate word correlation and LDA (document subject matter generation model) extract document subject matter come realize text from Dynamic mark.Original TextRank algorithm only accounts for the relation between word in calculating process, and have ignored the spy of word in itself Attribute is levied, causes to fail to make full use of text message during keyword is extracted.The present invention is improved to this relation, Put first with word frequency, lexeme, the information such as word span calculates word weight, and word is set up followed by the weight and word activating force model Between attraction relation, to replace it is original in word relation.Using this improved procedure, on the one hand for word individual, Take full advantage of the word frequency in text, lexeme put, the information such as word span, on the other hand for the relation between word, it is considered to word Co-occurrence rate in sentence, and consider the correlation between word, the Word2vector provided using Google calculates phase Guan Xing.The theme of document may not be included among the word content of document, can not then use the phrase in document content Into label mark, so determining the theme of document using LDA, and provide the label of the theme.
The technical scheme is that:To the Query Result or browsing pages of user, realize that the automatic mark that tags is related Data needed for intention, remove gibberish, and by correlation priority ranking.Under big data background, the visualization of data is got over Come more important, this patent is shown annotation results using the form of label-cloud, and keyword is highlighted out.Adopt With the present invention, data set automatic marking can be realized by unsupervised learning mode, label comes from data file, and noise is small, It is representative good.User can preferentially read the key content of automatic marking during inquiring and browsing, and can focus on notice To prior information.
The present invention is achieved through the following technical solutions the innovation intention automatic marking method based on big data, specific steps It is as follows:
Step one:Use training LDA and Word2vector.
Step 2:Word segmentation processing, filtering useless word are carried out to user's browsing pages.As shown in Figure 1;
Step 3:Label, automatic marking are generated using TextRank algorithm combination LDA.As shown in Figure 2;
Step 4:Label and key content realize visualization.As shown in Figure 3;
In step one, LDA and Word2vector are trained using search dog corpus.
1.Word2vector is the instrument that Google is developed, and it by word by being converted into vector, training set Contents processing is converted into the vector operation in fixed dimension vector space, is come using the distance between the vector calculated result Represent the correlation between text word.Training corpus is bigger, and term vector expression is better, is trained using search dog corpus, Obtain comprising word vector field homoemorphism type file corresponding with its all in corpus, it is possible to achieve calculate correlation between word Task.
2.Word2vec is a efficient work that word is characterized as to real number value vector that Google increases income in year in 2013 Tool, it utilizes the thought of deep learning, the processing to content of text can be reduced in K gts by training Vector operation, and the similarity in vector space can be for the similarity on expression text semantic.The word of Word2vec outputs Vector can be used to do the related work of many NLP, such as cluster, look for synonym, part of speech analysis etc..If changing a think of Road, assigns word as feature, then Word2vec can just seek Feature Mapping to K gts for text data More profound character representation.
3.Word2vec uses Distributed representation term vector representation. Distributed representation were proposed by Hinton in 1986 earliest.Its basic thought is by training each It is vectorial (K is generally the hyper parameter in model) that word is mapped to K dimension real numbers, by the distance between word (such as cosine similarities, Euclidean distance etc.) to judge the semantic similarity between them, it uses one three layers of neutral net, input layer-hidden layer-defeated Go out layer.The technology for having individual core is to be encoded according to word frequency with Huffman so that it is interior that the similar word hidden layer of all word frequency is activated Hold basically identical, the higher word of the frequency of occurrences, the hiding number of layers that they activate is fewer, so effectively reduces calculating Complexity.And a reason exactly its high efficiency popular Word2vec, Mikolov is by the article pointed out, one optimizes Unit version can train more than one hundred billion word within one day.
4. this three-layer neural network is that language model is modeled in itself, but also obtains a kind of word in vector simultaneously Expression spatially, and this side effect is only Word2vec real target.
5. with latent semantic analysis (Latent Semantic Index, LSI), potential Di Li Crays distribution (Latent Dirichlet Allocation, LDA) classical processes compare, Word2vec make use of the context of word, and semantic information is more Enrich on ground.
6.LDA (Latent Dirichlet Allocation) is a kind of document subject matter generation model, includes word, theme With document three-decker.Generation model think each word of an article be by " with certain probability selection some theme, And with some word of certain probability selection from this theme " such a process obtains.Document obeys multinomial point to theme Cloth, theme to word obeys multinomial distribution.
LDA is a kind of non-supervisory machine learning techniques, can be for subject information hiding in identification corpus.It is used The method of bag of words, by each document is considered as a word frequency vector, so that text message is converted into the number for ease of modeling Word information.The probability distribution that some themes of each documents representative are constituted, and each theme is represented a lot The probability distribution that word is constituted.It is trained using search dog corpus, obtains several themes, and each theme The set of the probability of middle word, can use LDA training results collection to calculate the probability distribution that document data belongs to all themes.
In step 2, the ICTCLAS Words partition systems researched and developed using cas computer carry out participle behaviour to text data Make, then remove stop words, part of speech filtering.
1. current Chinese Word Automatic Segmentation is broadly divided into three major types:Segmentation methods based on character string, the participle based on understanding Algorithm, the segmentation methods based on statistics, although above-mentioned several segmentation methods are very ripe, be due to Chinese language in itself Complexity, Chinese content is with ambiguity and neologisms continuously emerge, so current Words partition system is all to integrate to use a variety of participles Algorithm.Tsing-Hua University, Beijing University, Harbin Institute of Technology, Microsoft Research, China of the Chinese Academy of Sciences, magnanimity science and technology etc. have all carried out Chinese word segmentation research, its In the ICTCLAS Words partition systems researched and developed by cas computer protrude the most.
Specifically, ICTCLAS Words partition systems have five layers of hidden Markov model, main participle process includes preliminary point Word, word identification, again participle, part-of-speech tagging are not logged in, wherein employing shortest-path method in preliminary participle to Chinese word language Rough segmentation, unknown word identification is processed to name, place name, complex mechanism name, ensures the precision of participle as far as possible.It is internal and international The open evaluation result of authority shows that the Words partition system participle speed is fast, and accuracy is high.Here is the API used:
(1) initialize:BoolICTCLAS_Init (const char*pszInitDir=NULL);
PszInitDir is initialization path.Initialize and successfully return to true, otherwise return to false.
(2) participle is exited:boolICTCLAS_Exit();
The memory headroom that dictionary takes is discharged, extra buffer and other system resources is removed.
(3) file process:boolICTCLAS_FileProcess(const char*sSrcFilename, eCodeTypeeCt,const char*sDsnFilename,intbPOStagged);
SSrcFilename is source file path to be analyzed, and eCodeType is the character code of source file,
SDsnFilename is the destination file after participle, and whether bPOStagged is needs to carry out part-of-speech tagging, and 0 is
No, 1 is yes.File participle successfully returns to true, otherwise returns to false.
2. stop words is generally divided into two classes:One class is even excessively frequently some words using quite varied, such as " I ", " just " etc., another kind of is that the frequency of occurrences is very high in the text, but practical significance and little word, predominantly some tone Auxiliary word, adverbial word, preposition, conjunction etc., as " ", " ", " and " etc word.It is exactly that this two classes word is literary from building to remove stop words Remove in the word of present networks node, reduce the complexity of network.The part of speech of label is usually noun, verb, adjective, and word is long Generally higher than be equal to two words, it is therefore desirable to which the result after text participle is subjected to part-of-speech tagging, only retained according to part of speech this three The word of class part of speech.
3. idiographic flow is as shown in Figure 2:
(1) word segmentation processing is carried out to document data using ICTCLAS Words partition systems;
(2) performing word segmentation result goes stop words to operate, and removes useless stop words;
(3) to result carry out part-of-speech tagging, retain can as label noun, verb and adjective, filter out remaining Word, exclusive PCR.
In step 3, text data automatic marking is realized using the TextRank algorithm of unsupervised learning, and it is entered Row is improved, and the correlation between Word2vector calculating words and word is used in combination.Then text data is led using LDA Topic analysis, comprehensive generation label.
It is used for weighing the fine or not sole criterion in a website specifically, PageRank algorithms are Google, is Google wounds Beginning people Larry Page and Xie Er drop cloth woods proposed in 1998.The algorithm makes full use of the structure of hyperlinks evaluating network page on webpage Ranking, its basic thought is that the link of a webpage to another webpage is interpreted as into the former ballot to the latter.One webpage Linked number of times is more, it is meant that the ballot that the webpage possesses other webpages is more, and the webpage is more important.While webpage of voting Poll importance depend on the webpage importance of itself, if a webpage itself is important, the net linked by it Page is comparatively also important.PageRank algorithms can apply to the extraction of keyword and sentence:Word or sentence are regarded as Webpage, the chain that webpage is regarded in the contact between word or sentence as transfers the registration of Party membership, etc. from one unit to another, and the importance of word or sentence is calculated using algorithm, is extracted Important word or sentence.
1.RadeMihalcea and Paul Tarau proposed TextRank algorithm in 2004 according to PageRank algorithms. The essence of TextRank algorithm is a kind of algorithm based on figure, and word or sentence are equal to the node of figure, word or sentence in the algorithm Contact between son is equal to the side of figure, and text network is represented with DN=(W, R), and wherein W is the collection for the word for constituting text network Close, R is the set of the relation of any two word in W.Contact between word is using Term co-occurrence in the sliding window of length-specific Number of times is represented.
(1) it is similar to PageRank thought, if a word is directly connected by a line with another word, then it is assumed that should Word is that the latter has thrown a ticket, and the importance that the word is voted depends on the importance of its own, the importance of such a word again The poll that is just obtained by it and the importance for the other words voted for it are together decided on.Think in PageRank by a net The probability that page is linked to other webpages is random impartial, thus the figure obtained is no weight.But in text network, two There are a variety of contacts between individual word, it is considered to which the power contacted between word is necessary.Assuming that conn (wi,wj) represent word wiAnd wjIt Between contact (be herein the two co-occurrence number of times) in the word window that length is, then word wiTextRank values definition such as formula It is shown:
Wherein In (wi) represent sensing word wiSet of words, Out (wj) represent word wjPointed set of words, d represents damping The factor, value is 0.85.
(2) RadaMihalcea and Paul Tarau are experimentally confirmed turns into digraph extraction keyword by text mapping Accuracy rate be less than text be mapped to the accuracy rate of non-directed graph, there is no directionality between this declarer.Therefore by digraph TextRank definition is changed to:
Wherein L (wi) and L (wj) represent and word w respectivelyiAnd wjThe set for the word being directly connected to.
2. improve TextRank algorithm.
Relation in the TextRank algorithm that RadaMihalcea and Paul Tarau are proposed between word only considered word and exist Co-occurrence number of times in certain window length, and the word characteristic information of itself such as word frequency, lexeme in whole text are put, word span Etc. being ignored, in addition, the correlation between word and word is simply analyzed from current text, so cause the correlation of word inadequate Accurately.The present invention starts with terms of following three, and algorithm is improved:Information (including word frequency, lexeme first by word in itself Put, word span) calculate word weight, then weigh what is contacted between word and word by the frequency of co-occurrence between word weight and word and word Tightness degree, finally, the correlation between word and word is calculated using Word2vector.
(1) word weight is calculated.Word weight calculation is put and word span, word w by word frequency, lexemeiWeight calculation formula:
m(wi)=tf (wi)*loc(wi)*span(wi)
Wherein m (wi) it is word wiWeight, tf (wi) it is word wiThe word frequency factor, loc (wi) it is word wiLocation factor, span(wi) it is word wiSpan factor.The computational methods of each factor are as follows:
【1】The word frequency factor.The word frequency of one word is higher, and the word is more important in the text.The calculating of the word frequency factor is used Nonlinear function method, it is assumed that word wiOccurrence number is fre (w in the texti), then word frequency factor calculation formula:
【2】Word location factor.When position of the lexeme in text is different, role is also different, positioned at preceding 10% Word is most important to expression text subject, and 10%-30% word importance is taken second place before text.Text data is divided into three Region, preceding 10% is first area, and positional value is set to 50, and preceding 10%-30% is second area, and positional value is set to 30, finally Regional location value is set to 20, and the word that multizone all occurs takes maximum.Word wiPositional value area (wi) represent, calculation formula For:
【3】Word span factor.The coverage of word span reflection word in the text, span is bigger, to reflection global information Effect is bigger., it is necessary to which the big word of span, can reflect the global theme of text in tag extraction.Calculation formula:
Wherein first (wi) and last (wi) position that the position that word occurs first in the text occurs with last is represented respectively Put, sum is total word number for including in text.
(2) word relation is calculated.
There is the effect activated mutually between word and word, some words always occur in pairs with other words, when a word occurs When, it frequently can lead to this effect that people are naturally expected between another word, word and be referred to as word activating force.The opposing party Face, with the word often arrange in pairs or groups appearance word more than one, it is necessary to judge the word arranged in pairs or groups therewith according to specific language environment.Not With text in, word is also different from the intensity that word is activated mutually, can be in a text according to word importance in itself and word Between the contact set up between word of activation.
The physical meaning of word activating force is similar to gravitation, and it is initially defined as follows:Suppositive wiAnd wjIn corpus The number of times of middle appearance is respectively fre (wi) and fre (wj), the frequency of the two co-occurrence is co-occur (wi,wj), then word wiTo word wj Activating force such as formula:
Wherein d (wi,wj) it is word wiAnd wjAverage distance during co-occurrence therebetween.
【1】Formula of Universal Gravitation is analogous to it can be found that in word activating force formula, Section 1 and Section 2 represent two respectively The quality of individual object, d (wi,wj) represent the distance between object.Word activating force reflects the strong of " attraction " between two words Degree.But original word activating force formula only considered the number of times of the respective word frequency of word and Term co-occurrence, not by word its in itself Its characteristic is taken into account, it is impossible to make full use of the information of text.
In a document data, the information such as word frequency, position, span of word is word build-in attribute herein.Together Sample, exists between word and word and contacts, analogy Formula of Universal Gravitation, obtains " attraction " quantitative formula between word:
conn(wi,wj)=m (wi)*m(wj)/r(wi,wj)2
Wherein m (wi) and m (wj) it is respectively word wiWith word wjWeight, conn (wi,wj) reflect and possess different weights Contact between two words.
【2】Word2vector calculates the correlation between word and word.In the training process, obtain including corpus word After vector corresponding with its, k-means clusters are carried out to all words by vector, the poly- of the high word composition of correlation is obtained Cluster.Correlation is determined by the cosine value for calculating two words, cosine value is bigger, and correlation is bigger.It is assumed that word wi,wjAll be n tie up to Measure, then relevance values cos (wi,wj) calculation formula:
Thus it is possible to the word relation Conn (w after being improvedi,wj) formula:
Conn(wi,wj)=conn (wi,wj)*(1+cos(wi,wj))
By Conn (wi,wj) conn (w of replacement abovei,wj) it is TextRank formula after can be improved
3.LDA result sets include some themes, and the word and word that each theme includes belonging to this theme belong to the theme Probability, all words are ranked up from big to small by probable value.By treated data file as a sequence [w1, w2,...wn] wherein wiI-th of word is represented, n represents that one has n word.The quantity of word in data file is included by each theme ExpectationAssuming that there is K theme T, the probability branch that data file belongs to different themes can be obtained Calculate each theme probabilityFormula:
Wherein,Expression belongs to the desired value of theme i word quantity, it is assumed that word wjThe probability for belonging to theme i is p (wj, Ti), then TiCalculation formula is:
4. idiographic flow:
(1) data file of pretreatment is read, word frequency is recorded, the position that prefix time and last occur, text word sum is right The information of each word is counted in data.
(2) word weight is calculated, the value of the word frequency factor, word location factor and word span factor is calculated respectively.
(3) word spacing is calculated, in units of sentence, if among two words are while appear in a sentence, they Co-occurrence number of times adds 1, and word spacing is that co-occurrence number of times is reciprocal, if two Term co-occurrence number of times are 0, their distance is infinitely great.
(4) word attraction is calculated, the word spacing of previous step is substituted among attraction quantitative formula, you can draw two words Attraction quantization means.If two word distances are infinity, then it represents that two word attractions are 0, and whether they occur, no It can be affected one another.
(5) correlation between word is calculated, the cosine value for representing correlation size is calculated using Word2vector.
(6) word TextRank values are calculated.It is 1 to initialize TextRank values, result of calculation is substituted into after improving TextRank formula, it is 0.0001 to set iteration ends threshold value, continues on the formula iteration, until result convergence, so as to obtain Obtain the TextRank values of each word.
(7) word is ranked up from high to low according to the TextRank values of calculating.
(8) 20 words of TOP chosen in ranking results are used as this paper labels.
(9) the theme distribution probability of data file is calculated by LDA.
(10) the maximum theme of select probability, 5 words, group are taken according to probability by the word included in the theme from high to low Into this paper theme labels.
Wherein, step 3-1 includes (1) (2) (3) (4) (5) (6) (7) (8), as shown in Fig. 2 generation this paper labels, index Label data both are from data file.
Step 3-1 include (1) (9) (10), as shown in figure 3, generation document theme label, label data not necessarily from In data file.
In step 4, the visualization to document data label and key content is realized.The present invention uses two labels, one It is this paper labels, label substance comes from circumferential edge.Another is theme label, and label data comes from the master of document data Topic, can reflect the theme of document data, can also solve the problem of circumferential edge does not include theme.
The present invention shows form using label-cloud, is generated using PyTagCloud, PyTagCloud is to be based on The python expanding libraries that Wordle technologies are realized.The label-cloud of generation is with the different word of different color shows, this paper labels First five word of sequence is shown, word font size reflection term weighing size, word weight is bigger, shows and gets in label-cloud It is eye-catching.In addition, in document data, 20 this paper labels are marked out with the color different from other words to facilitate user During reading documents data content, emphasis can be quickly found.
The preferred embodiment of the application is the foregoing is only, the application is not limited to, for the skill of this area For art personnel, the application can have various modifications and variations.It is all within spirit herein and principle, made any repair Change, equivalent substitution, improvement etc., should be included within the protection domain of the application.

Claims (10)

1. the innovation intention label automatic marking method based on big data, it is characterized in that, including:
Step (1):Model training:
Text depth representing model Word2vector is trained using corpus, all words in corpus are obtained after training Language and the corresponding vector model file of all words, that is, the Word2vector models trained;
The LDA models for obtaining LDA result sets and training, institute are trained to document subject matter generation model LDA using corpus Stating LDA result sets includes several themes, and the word and word that each theme includes belonging to the theme belong to the theme Probability;
Step (2):Participle behaviour is carried out to the data file of user's current browse webpage using Chinese Academy of Sciences's ICTCLAS Words partition systems Make, then remove stop words;Obtain pretreated data file;
Step (3):Generate this paper labels and theme label;
Step (4):Realize the visualization to final this paper labels and theme label.
2. the innovation intention label automatic marking method as claimed in claim 1 based on big data, it is characterized in that, the step (2) stop words looked into the word of given threshold and word without practical significance including the use of frequency;
The word without practical significance includes auxiliary words of mood, adverbial word, preposition and conjunction;
The step of removal stop words, includes:After word segmentation processing, part of speech is labeled, retains noun, verb and describes Word, filters out the word of remaining part of speech, while also needing to filter out the word that frequency of use exceeds given threshold.
3. the innovation intention label automatic marking method as claimed in claim 1 based on big data, it is characterized in that, the step (3) the step of is:
Step (31):This paper labels are marked to pretreated data file using the TextRank algorithm of unsupervised learning, and And using the Word2vector models trained, the correlation between word and word is calculated based on vector model file, using word with Correlation between word is modified to this paper labels;The final this paper labels of generation;
Step (32):Subject analysis is carried out to pretreated data file using LDA result sets, theme label is generated.
4. the innovation intention label automatic marking method as claimed in claim 3 based on big data, it is characterized in that, the step (31) include:
Step (311):The data file of pretreatment is read, the information to each word in data file is counted;It is described each The information of word includes:Position, the position of word last appearance and word sum that word frequency, prefix time occur;
Step (312):Calculate word weight:The value of the word frequency factor, word location factor and word span factor is calculated respectively;
Word wiWeight m (wi) calculation formula:
m(wi)=tf (wi)*loc(wi)*span(wi);(1)
Wherein, tf (wi) it is word wiThe word frequency factor, loc (wi) it is word wiLocation factor, span (wi) it is word wiSpan because Son;
Step (313):Word spacing is calculated, in units of sentence, if among two words are while appear in a sentence, two The co-occurrence number of times of word adds 1, and word spacing is that co-occurrence number of times is reciprocal, if two Term co-occurrence number of times are 0, the distance of two words is infinite Greatly;
Step (314):Word attraction is calculated, among the attraction quantitative formula that the word spacing of step (313) is substituted into word, is drawn The attraction quantization means of two words;If two word distances are infinity, then it represents that two word attractions are 0, and two words go out Now whether, will not be affected one another;
Step (315):The correlation between word is calculated, is calculated using the Word2vector models trained and represents that correlation is big Small cosine value;
Step (316):Calculate word TextRank values:It is 1 to initialize TextRank values, word relation result of calculation is substituted into after improving TextRank formula, it is 0.0001 to set iteration ends threshold values, continues on the TextRank formula iteration after improving, until As a result restrain, so as to obtain the TextRank values of each word;
Step (317):Word is ranked up from high to low according to the TextRank values of calculating;
Step (318):Preceding 20 words chosen in ranking results are used as this paper labels.
5. the innovation intention label automatic marking method as claimed in claim 4 based on big data, it is characterized in that, the word frequency The calculation formula of the factor is:
t f ( w i ) = f r e ( w i ) 1 + f r e ( w i ) ; - - - ( 2 )
Wherein, fre (wi) represent word wiThe occurrence number in data file.
6. the innovation intention label automatic marking method as claimed in claim 4 based on big data, it is characterized in that, the lexeme The calculation formula for putting the factor is:
l o c ( w i ) = a r e a ( w i ) - 1 a r e a ( w i ) + 1 ; - - - ( 3 )
Wherein, area (wi) represent word wiPositional value;
When position of the lexeme in text is different, role is also different, and the word positioned at preceding 10% is to expressing text subject most Important, 10%-30% word importance is taken second place before text;Text data is divided into three regions, is positioned at preceding 10% First area, positional value is set to 50, is second area positioned at preceding 10%-30%, and positional value is set to 30, final area positional value 20 are set to, the word that multizone all occurs takes maximum.
7. the innovation intention label automatic marking method based on big data as claimed in claim 4, it is characterized in that, institute's predicate across Degree the factor calculation formula be:
s p a n ( w i ) = l a s t ( w i ) - f i r s t ( w i ) + 1 s u m ; - - - ( 4 )
Wherein, first (wi) represent the position that word occurs first in the text, last (wi) represent that last word occurs in the text Position, sum is total word number for including in text;
The coverage of word span reflection word in the text, span is bigger, bigger to reflection global information effect;In tag extraction In, the big word of span can reflect the global theme of text;
The attraction quantitative formula of word:
conn(wi,wj)=m (wi)*m(wj)/r(wi,wj)2; (5)
Wherein, m (wi) it is word wiWeight, m (wj) it is word wjWeight, conn (wi,wj) reflect and possess the two of different weights Contact between individual word;r(wi,wj) represent word wiWith word wjSpacing.
8. the innovation intention label automatic marking method as claimed in claim 4 based on big data, it is characterized in that,
During text depth representing model Word2vector is trained using corpus, obtain including corpus word After language and the corresponding vector of all words, k-means clusters are carried out to all words by vector correlation, correlation is obtained What high word was constituted clusters;Correlation is determined by the cosine value for calculating two words, cosine value is bigger, and correlation is bigger;
It is assumed that word wi,wjAll be n-dimensional vector, then correlation cos (wi,wj) calculation formula:
c o s ( w i , w j ) = Σ k = 1 n ( w i k * w j k ) Σ k = 1 n ( w i k ) 2 * Σ k = 1 n ( w j k ) 2 ; - - - ( 6 )
And then the word relation Conn (w after being improvedi,wj) formula:
Conn(wi,wj)=conn (wi,wj)*(1+cos(wi,wj)); (7)
TextRank formula after being improved:
T e x t R a n k ( w i ) = ( 1 - d ) + d * Σ w j ∈ L ( w i ) C o n n ( w i , w j ) Σ w k ∈ L ( w j ) C o n n ( w k , w j ) T e x t R a n k ( w j ) ; - - - ( 8 )
Wherein, TextRank (wi) represent wiImportance, TextRank (wj) represent word wjImportance.
9. the innovation intention label automatic marking method as claimed in claim 3 based on big data, it is characterized in that,
The step (32) includes:
Step (321):The data file of pretreatment is read, recording text word sum, the information to each word in data is united Meter;
Step (322):The theme distribution probability of data file is calculated by LDA result sets;
LDA result sets include some themes, and the word and word that each theme includes belonging to the theme belong to the general of the theme Rate,
All words are ranked up from big to small by probable value;By pretreated data file as a sequence [w1,w2, w3......wn], wherein wiI-th of word is represented, n represents that one has n word;Each theme includes the quantity of word in data file Be desired forAssuming that there is K theme, the probability distribution that data file belongs to different themes is obtained Calculate data file and belong to i-th of theme TiProbabilityFormula:
p T i = n ‾ T i / n
Wherein,Expression belongs to i-th of theme TiWord quantity desired value, it is assumed that word wjBelong to i-th of theme TiProbability be p(wj,Ti), thenCalculation formula is:
n ‾ T i = Σ j = 1 n p ( w j , T i )
Step (323):The maximum theme of select probability, 5 words are taken according to probability by the word included in the theme from high to low, Constitute this paper theme labels.
10. the innovation intention label automatic marking system based on big data, it is characterized in that, including:
Model training unit:
Text depth representing model Word2vector is trained using corpus, all words in corpus are obtained after training Language and the corresponding vector model file of all words, that is, the Word2vector models trained;
The LDA models for obtaining LDA result sets and training, institute are trained to document subject matter generation model LDA using corpus Stating LDA result sets includes several themes, and the word and word that each theme includes belonging to the theme belong to the theme Probability;
Data file processing unit:The data file of user's current browse webpage is entered using Chinese Academy of Sciences's ICTCLAS Words partition systems Row participle is operated, and then removes stop words;Obtain pretreated data file;
Label generation unit:Generate this paper labels and theme label;
Visualization:Realize the visualization to final this paper labels and theme label.
CN201710173029.3A 2017-03-22 2017-03-22 Innovative creative tag automatic labeling method and system based on big data Active CN106997382B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710173029.3A CN106997382B (en) 2017-03-22 2017-03-22 Innovative creative tag automatic labeling method and system based on big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710173029.3A CN106997382B (en) 2017-03-22 2017-03-22 Innovative creative tag automatic labeling method and system based on big data

Publications (2)

Publication Number Publication Date
CN106997382A true CN106997382A (en) 2017-08-01
CN106997382B CN106997382B (en) 2020-12-01

Family

ID=59431684

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710173029.3A Active CN106997382B (en) 2017-03-22 2017-03-22 Innovative creative tag automatic labeling method and system based on big data

Country Status (1)

Country Link
CN (1) CN106997382B (en)

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107861948A (en) * 2017-11-16 2018-03-30 百度在线网络技术(北京)有限公司 A kind of tag extraction method, apparatus, equipment and medium
CN108415953A (en) * 2018-02-05 2018-08-17 华融融通(北京)科技有限公司 A kind of non-performing asset based on natural language processing technique manages knowledge management method
CN108536679A (en) * 2018-04-13 2018-09-14 腾讯科技(成都)有限公司 Name entity recognition method, device, equipment and computer readable storage medium
CN108549626A (en) * 2018-03-02 2018-09-18 广东技术师范学院 A kind of keyword extracting method for admiring class
CN108763189A (en) * 2018-04-12 2018-11-06 武汉斗鱼网络科技有限公司 A kind of direct broadcasting room content tab weighing computation method, device and electronic equipment
CN108920466A (en) * 2018-07-27 2018-11-30 杭州电子科技大学 A kind of scientific text keyword extracting method based on word2vec and TextRank
CN108959431A (en) * 2018-06-11 2018-12-07 中国科学院上海高等研究院 Label automatic generation method, system, computer readable storage medium and equipment
CN109344248A (en) * 2018-07-27 2019-02-15 中山大学 A kind of academic subjects Life Cycle Analysis based on scientific and technical literature abstract cluster
CN109344253A (en) * 2018-09-18 2019-02-15 平安科技(深圳)有限公司 Add method, apparatus, computer equipment and the storage medium of user tag
WO2019041521A1 (en) * 2017-08-29 2019-03-07 平安科技(深圳)有限公司 Apparatus and method for extracting user keyword, and computer-readable storage medium
CN109614455A (en) * 2018-11-28 2019-04-12 武汉大学 A kind of automatic marking method and device of the geography information based on deep learning
CN109686445A (en) * 2018-12-29 2019-04-26 成都睿码科技有限责任公司 A kind of intelligent hospital guide's algorithm merged based on automated tag and multi-model
CN109710916A (en) * 2018-11-02 2019-05-03 武汉斗鱼网络科技有限公司 A kind of tag extraction method, apparatus, electronic equipment and storage medium
CN109783798A (en) * 2018-12-12 2019-05-21 平安科技(深圳)有限公司 Method, apparatus, terminal and the storage medium of text information addition picture
CN109885674A (en) * 2019-02-14 2019-06-14 腾讯科技(深圳)有限公司 A kind of determination of theme label, information recommendation method and device
CN110162592A (en) * 2019-05-24 2019-08-23 东北大学 A kind of news keyword extracting method based on the improved TextRank of gravitation
CN110263343A (en) * 2019-06-24 2019-09-20 北京理工大学 The keyword abstraction method and system of phrase-based vector
CN110347977A (en) * 2019-06-28 2019-10-18 太原理工大学 A kind of news automated tag method based on LDA model
CN110399606A (en) * 2018-12-06 2019-11-01 国网信息通信产业集团有限公司 A kind of unsupervised electric power document subject matter generation method and system
CN110413796A (en) * 2019-07-03 2019-11-05 北京信息科技大学 A kind of coal mine typical power disaster Methodologies for Building Domain Ontology
CN110557504A (en) * 2019-08-30 2019-12-10 Oppo广东移动通信有限公司 Dynamic update method, device, equipment and medium for ring of intelligent terminal equipment
CN110717329A (en) * 2019-09-10 2020-01-21 上海开域信息科技有限公司 Method for carrying out approximate search and quickly extracting advertisement text theme based on word vector
CN110738033A (en) * 2018-07-03 2020-01-31 百度在线网络技术(北京)有限公司 Report template generation method, device and storage medium
CN110807097A (en) * 2018-08-03 2020-02-18 北京京东尚科信息技术有限公司 Method and device for analyzing data
CN111125355A (en) * 2018-10-31 2020-05-08 北京国双科技有限公司 Information processing method and related equipment
CN111177321A (en) * 2019-12-27 2020-05-19 东软集团股份有限公司 Method, device and equipment for determining corpus and storage medium
CN111382265A (en) * 2018-12-28 2020-07-07 中国移动通信集团贵州有限公司 Search method, apparatus, device and medium
CN112270192A (en) * 2020-11-23 2021-01-26 科大国创云网科技有限公司 Semantic recognition method and system based on filtering of part of speech and stop words
CN112559853A (en) * 2019-09-26 2021-03-26 北京沃东天骏信息技术有限公司 User label generation method and device
CN112860919A (en) * 2021-02-20 2021-05-28 平安科技(深圳)有限公司 Data labeling method, device and equipment based on generative model and storage medium
CN112905741A (en) * 2021-02-08 2021-06-04 合肥供水集团有限公司 Water supply user focus mining method considering space-time characteristics
CN113128234A (en) * 2021-06-17 2021-07-16 明品云(北京)数据科技有限公司 Method and system for establishing entity recognition model, electronic equipment and medium
CN113761911A (en) * 2021-03-17 2021-12-07 中科天玑数据科技股份有限公司 Domain text labeling method based on weak supervision
CN114661900A (en) * 2022-02-25 2022-06-24 安阳师范学院 Text annotation recommendation method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103164394A (en) * 2012-07-16 2013-06-19 上海大学 Text similarity calculation method based on universal gravitation
CN106021620A (en) * 2016-07-14 2016-10-12 北京邮电大学 Method for realizing automatic detection for power failure event by utilizing social contact media
CN106372064A (en) * 2016-11-18 2017-02-01 北京工业大学 Characteristic word weight calculating method for text mining
CN106469187A (en) * 2016-08-29 2017-03-01 东软集团股份有限公司 The extracting method of key word and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103164394A (en) * 2012-07-16 2013-06-19 上海大学 Text similarity calculation method based on universal gravitation
CN106021620A (en) * 2016-07-14 2016-10-12 北京邮电大学 Method for realizing automatic detection for power failure event by utilizing social contact media
CN106469187A (en) * 2016-08-29 2017-03-01 东软集团股份有限公司 The extracting method of key word and device
CN106372064A (en) * 2016-11-18 2017-02-01 北京工业大学 Characteristic word weight calculating method for text mining

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
南江霞: "中文文本自动标注技术研究及其应用", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
夏天: "词向量聚类加权TextRank的关键词抽取", 《数据分析与知识发现》 *

Cited By (52)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019041521A1 (en) * 2017-08-29 2019-03-07 平安科技(深圳)有限公司 Apparatus and method for extracting user keyword, and computer-readable storage medium
CN107861948A (en) * 2017-11-16 2018-03-30 百度在线网络技术(北京)有限公司 A kind of tag extraction method, apparatus, equipment and medium
CN108415953A (en) * 2018-02-05 2018-08-17 华融融通(北京)科技有限公司 A kind of non-performing asset based on natural language processing technique manages knowledge management method
CN108415953B (en) * 2018-02-05 2021-08-13 华融融通(北京)科技有限公司 Method for managing bad asset management knowledge based on natural language processing technology
CN108549626A (en) * 2018-03-02 2018-09-18 广东技术师范学院 A kind of keyword extracting method for admiring class
WO2019165678A1 (en) * 2018-03-02 2019-09-06 广东技术师范学院 Keyword extraction method for mooc
CN108763189A (en) * 2018-04-12 2018-11-06 武汉斗鱼网络科技有限公司 A kind of direct broadcasting room content tab weighing computation method, device and electronic equipment
CN108763189B (en) * 2018-04-12 2022-03-25 武汉斗鱼网络科技有限公司 Live broadcast room content label weight calculation method and device and electronic equipment
CN108536679A (en) * 2018-04-13 2018-09-14 腾讯科技(成都)有限公司 Name entity recognition method, device, equipment and computer readable storage medium
CN108536679B (en) * 2018-04-13 2022-05-20 腾讯科技(成都)有限公司 Named entity recognition method, device, equipment and computer readable storage medium
CN108959431A (en) * 2018-06-11 2018-12-07 中国科学院上海高等研究院 Label automatic generation method, system, computer readable storage medium and equipment
CN108959431B (en) * 2018-06-11 2022-07-05 中国科学院上海高等研究院 Automatic label generation method, system, computer readable storage medium and equipment
CN110738033B (en) * 2018-07-03 2023-09-19 百度在线网络技术(北京)有限公司 Report template generation method, device and storage medium
CN110738033A (en) * 2018-07-03 2020-01-31 百度在线网络技术(北京)有限公司 Report template generation method, device and storage medium
CN109344248A (en) * 2018-07-27 2019-02-15 中山大学 A kind of academic subjects Life Cycle Analysis based on scientific and technical literature abstract cluster
CN109344248B (en) * 2018-07-27 2021-10-22 中山大学 Academic topic life cycle analysis method based on scientific and technological literature abstract clustering
CN108920466A (en) * 2018-07-27 2018-11-30 杭州电子科技大学 A kind of scientific text keyword extracting method based on word2vec and TextRank
CN110807097A (en) * 2018-08-03 2020-02-18 北京京东尚科信息技术有限公司 Method and device for analyzing data
CN109344253A (en) * 2018-09-18 2019-02-15 平安科技(深圳)有限公司 Add method, apparatus, computer equipment and the storage medium of user tag
CN111125355A (en) * 2018-10-31 2020-05-08 北京国双科技有限公司 Information processing method and related equipment
CN109710916B (en) * 2018-11-02 2024-02-23 广州财盟科技有限公司 Label extraction method and device, electronic equipment and storage medium
CN109710916A (en) * 2018-11-02 2019-05-03 武汉斗鱼网络科技有限公司 A kind of tag extraction method, apparatus, electronic equipment and storage medium
CN109614455A (en) * 2018-11-28 2019-04-12 武汉大学 A kind of automatic marking method and device of the geography information based on deep learning
CN110399606B (en) * 2018-12-06 2023-04-07 国网信息通信产业集团有限公司 Unsupervised electric power document theme generation method and system
CN110399606A (en) * 2018-12-06 2019-11-01 国网信息通信产业集团有限公司 A kind of unsupervised electric power document subject matter generation method and system
CN109783798A (en) * 2018-12-12 2019-05-21 平安科技(深圳)有限公司 Method, apparatus, terminal and the storage medium of text information addition picture
CN111382265A (en) * 2018-12-28 2020-07-07 中国移动通信集团贵州有限公司 Search method, apparatus, device and medium
CN109686445A (en) * 2018-12-29 2019-04-26 成都睿码科技有限责任公司 A kind of intelligent hospital guide's algorithm merged based on automated tag and multi-model
CN109686445B (en) * 2018-12-29 2023-07-21 成都睿码科技有限责任公司 Intelligent diagnosis guiding algorithm based on automatic label and multi-model fusion
CN109885674A (en) * 2019-02-14 2019-06-14 腾讯科技(深圳)有限公司 A kind of determination of theme label, information recommendation method and device
CN109885674B (en) * 2019-02-14 2022-10-25 腾讯科技(深圳)有限公司 Method and device for determining and recommending information of subject label
CN110162592A (en) * 2019-05-24 2019-08-23 东北大学 A kind of news keyword extracting method based on the improved TextRank of gravitation
CN110263343A (en) * 2019-06-24 2019-09-20 北京理工大学 The keyword abstraction method and system of phrase-based vector
CN110347977A (en) * 2019-06-28 2019-10-18 太原理工大学 A kind of news automated tag method based on LDA model
CN110413796A (en) * 2019-07-03 2019-11-05 北京信息科技大学 A kind of coal mine typical power disaster Methodologies for Building Domain Ontology
CN110557504A (en) * 2019-08-30 2019-12-10 Oppo广东移动通信有限公司 Dynamic update method, device, equipment and medium for ring of intelligent terminal equipment
CN110557504B (en) * 2019-08-30 2021-06-04 Oppo广东移动通信有限公司 Dynamic update method, device, equipment and medium for ring of intelligent terminal equipment
CN110717329A (en) * 2019-09-10 2020-01-21 上海开域信息科技有限公司 Method for carrying out approximate search and quickly extracting advertisement text theme based on word vector
CN110717329B (en) * 2019-09-10 2023-06-16 上海开域信息科技有限公司 Method for performing approximate search based on word vector to rapidly extract advertisement text theme
CN112559853A (en) * 2019-09-26 2021-03-26 北京沃东天骏信息技术有限公司 User label generation method and device
CN112559853B (en) * 2019-09-26 2024-01-12 北京沃东天骏信息技术有限公司 User tag generation method and device
CN111177321A (en) * 2019-12-27 2020-05-19 东软集团股份有限公司 Method, device and equipment for determining corpus and storage medium
CN111177321B (en) * 2019-12-27 2023-10-20 东软集团股份有限公司 Method, device, equipment and storage medium for determining corpus
CN112270192B (en) * 2020-11-23 2023-12-19 科大国创云网科技有限公司 Semantic recognition method and system based on part of speech and deactivated word filtering
CN112270192A (en) * 2020-11-23 2021-01-26 科大国创云网科技有限公司 Semantic recognition method and system based on filtering of part of speech and stop words
CN112905741B (en) * 2021-02-08 2022-04-12 合肥供水集团有限公司 Water supply user focus mining method considering space-time characteristics
CN112905741A (en) * 2021-02-08 2021-06-04 合肥供水集团有限公司 Water supply user focus mining method considering space-time characteristics
CN112860919A (en) * 2021-02-20 2021-05-28 平安科技(深圳)有限公司 Data labeling method, device and equipment based on generative model and storage medium
CN113761911A (en) * 2021-03-17 2021-12-07 中科天玑数据科技股份有限公司 Domain text labeling method based on weak supervision
CN113128234A (en) * 2021-06-17 2021-07-16 明品云(北京)数据科技有限公司 Method and system for establishing entity recognition model, electronic equipment and medium
CN113128234B (en) * 2021-06-17 2021-11-02 明品云(北京)数据科技有限公司 Method and system for establishing entity recognition model, electronic equipment and medium
CN114661900A (en) * 2022-02-25 2022-06-24 安阳师范学院 Text annotation recommendation method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN106997382B (en) 2020-12-01

Similar Documents

Publication Publication Date Title
CN106997382A (en) Innovation intention label automatic marking method and system based on big data
Ray et al. An ensemble-based hotel recommender system using sentiment analysis and aspect categorization of hotel reviews
Yan et al. Network-based bag-of-words model for text classification
CN104794169B (en) A kind of subject terminology extraction method and system based on sequence labelling model
CN112861990B (en) Topic clustering method and device based on keywords and entities and computer readable storage medium
CN107315738A (en) A kind of innovation degree appraisal procedure of text message
Gupta et al. A novel hybrid text summarization system for Punjabi text
CN110888991A (en) Sectional semantic annotation method in weak annotation environment
Archchitha et al. Opinion spam detection in online reviews using neural networks
Qiu et al. Query intent recognition based on multi-class features
CN115329085A (en) Social robot classification method and system
Kanapala et al. Passage-based text summarization for legal information retrieval
Al Imran et al. Bnnet: A deep neural network for the identification of satire and fake bangla news
CN109086443A (en) Social media short text on-line talking method based on theme
Özyirmidokuz Mining unstructured Turkish economy news articles
Ezzat et al. Topicanalyzer: A system for unsupervised multi-label arabic topic categorization
Clarizia et al. Sentiment analysis in social networks: A methodology based on the latent dirichlet allocation approach
Waghmare et al. Survey paper on sentiment analysis for tourist reviews
Shahbazi et al. Deep Learning Method to Estimate the Focus Time of Paragraph
Ahmed et al. Building multiview analyst profile from multidimensional query logs: from consensual to conflicting preferences
Song et al. Unsupervised learning of word semantic embedding using the deep structured semantic model
Gheni et al. Suggesting new words to extract keywords from title and abstract
Mallek et al. An Unsupervised Approach for Precise Context Identification from Unstructured Text Documents
Sharma et al. Multi-aspect sentiment analysis using domain ontologies
Li et al. A Method of Interest Degree Mining Based on Behavior Data Analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant