CN106997382B - Innovative creative tag automatic labeling method and system based on big data - Google Patents

Innovative creative tag automatic labeling method and system based on big data Download PDF

Info

Publication number
CN106997382B
CN106997382B CN201710173029.3A CN201710173029A CN106997382B CN 106997382 B CN106997382 B CN 106997382B CN 201710173029 A CN201710173029 A CN 201710173029A CN 106997382 B CN106997382 B CN 106997382B
Authority
CN
China
Prior art keywords
words
word
text
calculating
theme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710173029.3A
Other languages
Chinese (zh)
Other versions
CN106997382A (en
Inventor
鹿旭东
张盘龙
陈志勇
郭伟
崔立真
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN201710173029.3A priority Critical patent/CN106997382B/en
Publication of CN106997382A publication Critical patent/CN106997382A/en
Application granted granted Critical
Publication of CN106997382B publication Critical patent/CN106997382B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an innovative creative tag automatic labeling method and system based on big data, wherein the method comprises the following steps: and training Word2vector and LDA by using a dog searching corpus to obtain a training result set. And performing word segmentation, stop word removal and word filtering on the document data of the page browsed by the user. The preprocessed document data are combined by using a modified TextRank algorithm Word2vector to calculate the label derived from the text data. And the preprocessed document is calculated by LDA to derive tags about the subject matter of the document data. Visualization is achieved through a label cloud generating mode, all label words of the text are marked in document data, and a user can read and find key content parts conveniently.

Description

Innovative creative tag automatic labeling method and system based on big data
Technical Field
The invention relates to an innovative creative tag automatic labeling method and system based on big data.
Background
With the rapid development and popularization of the internet, information is explosively increased, so that a large amount of information is accumulated on the internet. Meanwhile, internet users are not only internet content browsers but also create various information on the internet, so that the internet information forms are diversified, and great difficulty is brought to information screening. The information using characters as carriers in the internet information accounts for a large proportion, the increase of information quantity and the confusion of structures enable people to have more references in the process of searching the information, the coverage rate of the information is more comprehensive, the aspects of the life of people are related, the life of people is greatly facilitated, however, a large amount of information easily causes people to fall into the place without selection, and the situation that effective information is quickly selected from a large amount of information is not easy.
When enterprises carry out innovation work, the big data is used as the basis of analysis and planning, and valuable data needs to be distinguished and viewed and analyzed. How to fully utilize big data and quickly and effectively obtain relevant data of a topic concerned by an enterprise, realize labeling key data, eliminate messy and useless information, and make the enterprise focus on more valuable and important information is a difficulty of current innovation, and text labeling is generated on such background. Text annotation refers to the use of a number of words or phrases that are specific and reflect the subject of the text, and these words or phrases are usually called tags, and the reader can quickly understand the subject of the text by reading these tags, so as to determine whether the text is the text of interest.
The automatic text labeling is an emerging research subject developed along with the Internet, is derived from information extraction and text classification technologies, and combines research methods in the directions of information retrieval, collaborative filtering and the like. In recent years, the developed text automatic labeling technology includes social labeling, multi-label classification labeling and keyword extraction labeling based on users;
the above describes the main methods of text labeling at present. The social annotation based on the user is at the initial stage of system service, and the problem of cold start exists because past data do not provide reference; the multi-label classification labeling method is mostly based on an algorithm of supervised learning, a large amount of manually labeled data sets are needed to be used as training sets, and the manually labeled data sets are time-consuming and labor-consuming and have high subjectivity.
Disclosure of Invention
In order to solve the defects of the prior art, the invention provides an innovative creative tag automatic labeling method and system based on big data, which has the effects of labeling texts by adopting a keyword extraction method, belonging to the field of unsupervised learning and avoiding manually labeling data sets.
An innovative creative tag automatic labeling method based on big data comprises the following steps:
step (1): model training:
training a text depth representation model Word2vector by using a corpus, and obtaining all words and vector model files corresponding to all words in the corpus after training to obtain a well-trained Word2vector model;
training a document theme generation model LDA by using a corpus to obtain an LDA result set and a trained LDA model, wherein the LDA result set comprises a plurality of themes, and each theme comprises words belonging to the theme and the probability of the words belonging to the theme;
step (2): performing word segmentation operation on a data document of a current browsing page of a user by using a Chinese academy ICTCCLAS word segmentation system, and then removing stop words; obtaining a preprocessed data document;
and (3): generating a text label and a subject label;
and (4): visualization of the final text label and the subject label is achieved.
The stop words in the step (2) comprise words with the use frequency of passing a set threshold value and words without practical meaning.
The words without practical meaning include moods, adverbs, prepositions and conjunctions.
The step of removing stop words comprises: after word segmentation processing, part of speech is labeled, nouns, verbs and adjectives are reserved, words of other part of speech are filtered, and meanwhile, words with the use frequency exceeding a set threshold value need to be filtered.
The step (3) comprises the following steps:
step (31): labeling text labels on the preprocessed data documents by using a TextRank algorithm of unsupervised learning, calculating the correlation between words based on a vector model file by using a trained Word2vector model, and modifying the text labels by using the correlation between the words; generating a final text label;
step (32): and performing theme analysis on the preprocessed data document by using the LDA result set to generate a theme label.
The step (31) comprises:
step (311): reading the preprocessed data document, and counting the information of each word in the data document; the information of each word includes: word frequency, position of first appearance of word, position of last appearance of word and total number of words;
a step (312): calculating the word weight: respectively calculating the values of the word frequency factor, the word position factor and the word span factor;
word wiWeight m (w) ofi) Calculating the formula:
m(wi)=tf(wi)*loc(wi)*span(wi);(1)
wherein, tf (w)i) Is the word wiWord frequency factor of, loc (w)i) Is the word wiThe position factor of (c), span (w)i) Is the word wiThe span factor of (2).
The calculation formula of the word frequency factor is as follows:
Figure BDA0001251618900000021
wherein, fre (w)i) The expression wiNumber of occurrences in the data document.
The calculation formula of the word position factor is as follows:
Figure BDA0001251618900000031
wherein, area (w)i) The expression wiThe position value of (a).
The words in different positions in the text play different roles, and the words in the top 10% are most important for expressing the text theme and the words in the top 10% -30% are the second most important. Dividing the text data into three areas, wherein the first 10% is a first area, the position value is set to be 50, the first 10% -30% is a second area, the position value is set to be 30, the position value of the last area is set to be 20, and words appearing in all the areas take the maximum value.
The calculation formula of the word span factor is as follows:
Figure BDA0001251618900000032
wherein, first (w)i) Indicating the position of the first occurrence of a word in the text, last(wi) Indicating the position of the word in the text at the last occurrence, sum being the total number of words contained in the text.
The word span reflects the coverage of the word in the text, and the larger the span is, the larger the effect on reflecting global information is. In the label extraction, the words with large span can reflect the global theme of the text.
Step (313): calculating word spacing, taking a sentence as a unit, if two words appear in one sentence at the same time, adding 1 to the co-occurrence times of the two words, wherein the word spacing is the reciprocal of the co-occurrence times, and if the co-occurrence times of the two words is 0, the distance between the two words is infinite;
step (314): calculating the word attraction force, and substituting the word space in the step (313) into an attraction force quantization formula of the words to obtain the attraction force quantization expression of the two words; if the distance between the two words is infinite, the attractive force of the two words is 0, and the two words are not influenced by each other if the two words appear or not;
word appeal quantization formula:
conn(wi,wj)=m(wi)*m(wj)/r(wi,wj)2;(5)
wherein, m (w)i) Is the word wiWeight of (c), m (w)j) Is the word wjWeight of (c), conn (w)i,wj) Reflecting the relation between two words with different weights; r (w)i,wj) The expression wiAnd the word wjThe pitch of (d);
step (315): and calculating the correlation between the words, and calculating a cosine value representing the magnitude of the correlation by using a trained Word2vector model.
In the process of training a text depth representation model Word2vector by using a corpus, after vectors comprising the words of the corpus and corresponding to all the words are obtained, k-means clustering is carried out on all the words through vector correlation, and a cluster composed of words with high correlation is obtained. The correlation is determined by calculating the cosine values of the two words, the greater the cosine value the greater the correlation.
Hypothesis word wi,wjAre all n-dimensional vectors, then the correlation cos (w)i,wj) Calculating the formula:
Figure BDA0001251618900000041
further obtaining the improved word relationship Conn (w)i,wj) The formula:
Conn(wi,wj)=conn(wi,wj)*(1+cos(wi,wj));(7)
obtaining an improved TextRank formula:
Figure BDA0001251618900000042
among them, TextRank (w)i) Denotes wiOf importance, TextRank (w)j) The expression wjThe importance of (c);
step (316): calculate the word TextRank value: initializing a TextRank value to be 1, substituting a word relation calculation result into an improved TextRank formula, setting an iteration termination threshold value to be 0.0001, continuously using the improved TextRank formula for iteration until the result is converged, and thus obtaining the TextRank value of each word;
step (317): sorting the words according to the calculated TextRank value from high to low;
step (318): and selecting the top 20 words in the sequencing result as text labels.
Said step (32) comprises:
step (321): reading the preprocessed data document, recording the total number of text words, and counting the information of each word in the data;
step (322): calculating the theme distribution probability of the data document through the LDA result set;
the LDA result set contains a number of topics, each topic comprising words belonging to the topic and a probability that a word belongs to the topic,
all words are sorted from large to small according to probability values; treating the preprocessed data document as a sequence [ w ]1,w2,w3......wn]Wherein w isiRepresenting the ith word and n representing a total of n words. The expectation that each topic contains the number of words in the data document is
Figure BDA0001251618900000051
If K subjects are provided, obtaining the probability distribution of the data documents belonging to different subjects
Figure BDA0001251618900000052
Calculating data document belonging to ith topic TiProbability of (2)
Figure BDA0001251618900000053
The formula of (a):
Figure BDA0001251618900000054
wherein the content of the first and second substances,
Figure BDA0001251618900000055
indicating belonging to the ith topic TiIs expected for the number of words, say word wjBelonging to the ith topic TiHas a probability of p (w)j,Ti) Then, then
Figure BDA0001251618900000056
The calculation formula is as follows:
Figure BDA0001251618900000057
step (323): and selecting a theme with the highest probability, and taking 5 words from the words contained in the theme according to the probability from high to low to form the theme label.
Furthermore, the invention also adopts the technical scheme of an innovative creative tag automatic labeling system based on big data, and the innovative creative tag automatic labeling system can automatically add text tags and theme tags to the data documents browsed by the user, thereby facilitating the user to find important information of the text and improving the reading efficiency.
Innovative creative tag automatic labeling system based on big data comprises:
a model training unit:
training a text depth representation model Word2vector by using a corpus, and obtaining all words and vector model files corresponding to all words in the corpus after training to obtain a well-trained Word2vector model;
training a document theme generation model LDA by using a corpus to obtain an LDA result set and a trained LDA model, wherein the LDA result set comprises a plurality of themes, and each theme comprises words belonging to the theme and the probability of the words belonging to the theme;
a data document processing unit: performing word segmentation operation on a data document of a current browsing page of a user by using a Chinese academy ICTCCLAS word segmentation system, and then removing stop words; obtaining a preprocessed data document;
a label generation unit: generating a text label and a subject label;
a visualization unit: visualization of the final text label and the subject label is achieved.
Compared with the prior art, the invention has the beneficial effects that:
the improved TextRank algorithm is adopted to obtain the keywords of the data document, compared with other algorithms, the calculation result has higher accuracy and representativeness, the extracted tags are from the document, the representativeness is good, and the effect of accurately expressing text content is achieved;
the LDA model is adopted to generate the subject label of the text, so that the difficulty that the subject word of the text is not contained in the text is solved, the subject content of the text can be better reflected, the text label is integrated, and the label for accurately expressing the text content and the subject is realized;
drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.
FIG. 1 is a flow diagram of the pretreatment of the present invention;
FIG. 2 is a flow chart of tag generation herein according to the present invention;
FIG. 3 is a flowchart of the subject label generation of the present invention.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
The invention integrates an improved text labeling algorithm based on TextRank, Word2vector (a text analysis tool of Google) to calculate the relevance of words and LDA (document theme generation model) to extract document themes so as to realize automatic labeling of texts. The original TextRank algorithm only considers the relationship among the words in the calculation process, but ignores the characteristic attributes of the words, so that the text information cannot be fully utilized in the process of extracting the keywords. The invention improves the relation, firstly, the word weight is calculated by utilizing information such as word frequency, word position, word span and the like, and then the attractive force relation between words is established by utilizing the weight and a word activation force model to replace the original word relation. By adopting the improved mode, on one hand, information such as Word frequency, Word position, Word span and the like in the text is fully utilized for individual words, on the other hand, the co-occurrence rate of words in sentences is considered for the relation among the words, and the correlation among the words is considered, and the Word2vector provided by Google is used for calculating the correlation. The topic of the document may not be included in the textual content of the document and cannot be labeled with a label made up of words in the document content, so LDA is used to determine the topic of the document and provide a label for that topic.
The technical scheme of the invention is as follows: and automatically labeling data required by related creatives to the query result or the browsing page of the user, removing messy information, and sequencing according to the priority of the relevance. Under big data background, the visualization of data is more and more important, and this patent uses the form of label cloud to show the mark result to show keyword highlighting. By adopting the invention, the automatic labeling of the data set can be realized in an unsupervised learning mode, the label comes from the data document, the noise is low, and the representativeness is good. The user can preferentially read the automatically labeled key contents in the process of query browsing and can focus attention on more important information.
The invention realizes the innovative and creative automatic labeling method based on big data through the following technical scheme, which comprises the following specific steps:
the method comprises the following steps: LDA and Word2vector were trained using a corpus.
Step two: and performing word segmentation processing on the user browsing page and filtering the useless words. As shown in fig. 1;
step three: and (5) generating a label by combining a TextRank algorithm with LDA, and automatically labeling. As shown in fig. 2;
step four: the tag and key content are visualized. As shown in fig. 3;
in step one, the search corpus is used to train LDA and Word2 vector.
Word2vector is a tool developed by Google that transforms the content processing of a training set into vector operations in a fixed-dimension vector space by converting words into vectors, using the computed distance results between vectors to represent the correlation between text words. The larger the training corpus is, the better the word vector expression is, the dog searching corpus is used for training to obtain a model file containing all the words and the corresponding vectors in the corpus, and the task of calculating the correlation between the words can be realized.
Word2vec is an efficient tool for Google to open sources in 2013 to characterize words as real-valued vectors, and by using the idea of deep learning, processing of text contents can be simplified into vector operation in a K-dimensional vector space through training, and similarity in the vector space can be used for representing similarity in text semantics. Word vectors output by Word2vec can be used to do many NLP related tasks such as clustering, synonym finding, part-of-speech analysis, etc. If the idea is changed and a Word is taken as a feature, Word2vec can map the feature to a K-dimensional vector space and can search deeper feature representation for text data.
Word2vec uses the Word vector representation of the Distributed representation. The Distributed representation was first proposed by Hinton in 1986. The basic idea is to map each word into a K-dimensional real number vector (K is generally a hyper-parameter in a model) through training, and judge semantic similarity between words through distances (such as cosine similarity, Euclidean distance and the like) between the words. The Huffman coding is used according to the word frequency, so that the activated contents of all word hiding layers with similar word frequencies are basically consistent, the number of the word hiding layers activated by the words with higher occurrence frequency is less, and the calculation complexity is effectively reduced. While Word2vec is popular because of its high efficiency, Mikolov in the paper indicates that an optimized standalone version can train billions of words a day.
4. The three-layer neural network models a language model, but obtains a representation of a Word on a vector space, and the side effect is the real target of Word2 vec.
5. Compared with the classical processes of Latent Semantic analysis (LSI) and Latent Dirichlet Allocation (LDA), Word2vec utilizes the context of words and Semantic information is richer.
LDA (tension Dirichlet allocation) is a document theme generation model, and comprises three layers of structures of words, themes and documents. The generative model considers that each word of an article is obtained through a process of selecting a certain topic with a certain probability and selecting a certain word from the topic with a certain probability. Document-to-topic follows a polynomial distribution, and topic-to-word follows a polynomial distribution.
LDA is an unsupervised machine learning technique that can be used to identify underlying topic information in a corpus. The method adopts a bag-of-words method, and each document is regarded as a word frequency vector, so that text information is converted into digital information which is easy to model. Each document represents a probability distribution of topics, and each topic represents a probability distribution of words. Training is carried out by using the dog searching corpus to obtain a plurality of topics and a set of probabilities of words in each topic, and the probability distribution of document data belonging to all topics can be calculated by using an LDA training result set.
And in the second step, performing word segmentation operation on the text data by using an ICTCCLAS word segmentation system developed by a Chinese academy of sciences computer, and then removing stop words and filtering part of speech.
1. The present Chinese word segmentation algorithms are mainly divided into three categories: although several word segmentation algorithms are mature, the current word segmentation system comprehensively uses various word segmentation algorithms due to the complexity of the Chinese language and the ambiguity of Chinese content and the continuous appearance of new words. Chinese word segmentation research is carried out by Qinghua, Beida, Haugh, Chinese academy Microsoft China research institute, massive science and technology and the like, wherein an ICTCCLAS word segmentation system developed by a computer of the Chinese academy is the most prominent.
Specifically, the ICTCCLAS word segmentation system has five layers of hidden Markov models, the main word segmentation process comprises preliminary word segmentation, unregistered word identification, word re-segmentation and part-of-speech tagging, wherein the preliminary word segmentation adopts a shortest path method to roughly divide Chinese words, and the unregistered word identification is used for processing names of people, places and complex mechanisms, so that the word segmentation precision is ensured as much as possible. The public evaluation results of national and international authorities show that the word segmentation system has high word segmentation speed and high accuracy. The following are the APIs used:
(1) initialization: boolICTCLAS _ Init (const char pszInitDir NULL);
pszInitDir is the initialization path. Initialization returns true successfully, otherwise false is returned.
(2) Quitting the word segmentation: boolICTCLAS _ Exit ();
and releasing the memory space occupied by the dictionary and clearing the temporary buffer area and other system resources.
(3) File processing: boolICTCCLAS _ FileProcesses (const char sSrcFilename, eModeTypeCT, const char sDsnFilename, intbPOStagged);
the sSrcFilename is a source file path to be analyzed, the eCodeType is character encoding of a source file,
the sDsnFilename is a result file after word segmentation, the bPOStagged is whether the part of speech tagging is needed or not, and 0 is
No, 1 is YES. And if the word segmentation of the file is successful, returning true, and otherwise, returning false.
2. Stop words are generally divided into two categories: one is words which are used widely or even frequently, such as "i", "just", etc., and the other is words which have high frequency of occurrence in the text but have little practical meaning, mainly words such as "assistant words", "adverb words", "preposition words", conjunctions, etc., such as "the", "at", "and", etc. The words for going to stop are the words for removing the two types of words from the words for constructing the nodes of the text network, so that the complexity of the network is reduced. The part of speech of the label is generally noun, verb and adjective, and the word length is generally more than or equal to two characters, so that part of speech tagging is needed to be carried out on the result after the text is divided, and only the three types of words with parts of speech are reserved according to the part of speech.
3. The specific flow is shown in fig. 2:
(1) performing word segmentation processing on the document data by using an ICTS word segmentation system;
(2) executing the word-dividing result to stop word operation, and removing useless stop words;
(3) and performing part-of-speech tagging on the result, reserving nouns, verbs and adjectives which can be used as labels, filtering out other words, and eliminating interference.
In step three, the automatic labeling of the text data is realized by using a TextRank algorithm of unsupervised learning, the text data is improved, and Word2vector is used for calculating the correlation between words. And then performing theme analysis on the text data by using LDA, and comprehensively generating a label.
Specifically, the PageRank algorithm is the only criterion used by Google to measure the goodness and badness of a web site, and was proposed by Google founders in 1998 as Larrepecky and Sherry cover forest. The algorithm fully utilizes the hyperlink structure on the web pages to evaluate the ranking of the web pages, and the basic idea is to understand the link from one web page to another web page as the former voting for the latter. The more times a web page is linked means that the more votes the web page has for other web pages, the more important the web page. Meanwhile, the importance of the number of votes of the voting web page depends on the importance of the web page, and if a web page is important, the web page linked by the voting web page is relatively important. The PageRank algorithm can be applied to extraction of keywords and sentences: the words or sentences are regarded as web pages, the links among the words or sentences are regarded as link-transfer relations of the web pages, the importance of the words or sentences is calculated by utilizing an algorithm, and the important words or sentences are extracted.
The TextRank algorithm was proposed by RadeMihalcea and Paul Tarau in 2004 according to the PageRank algorithm. The essence of the TextRank algorithm is a graph-based algorithm in which words or sentences are equivalent to nodes of a graph, links between words or sentences are equivalent to edges of the graph, and a text network is represented by DN ═ W, R, where W is a set of words that make up the text network and R is a set of relationships between any two words in W. The association between words is represented by the number of word co-occurrences in a sliding window of a certain length.
(1) Similar to the idea of PageRank, if a word is directly connected to another word by an edge, the word is considered to cast a vote for the latter, and the importance of the word voted for depends on its own importance, so that the importance of a word is determined by the number of votes it obtains and the importance of the other words voted for it. In PageRank, the probability of linking from one web page to another is considered to be randomly equal, and the resulting graph is weightless. In a text network, however, there are multiple associations between two words, taking into account the association between wordsThe strength is necessary. Suppose conn (w)i,wj) The expression wiAnd wjThe relation between (here, the number of co-occurrences of the two in the word window of length), the word wiThe TextRank value is defined as shown in the formula:
Figure BDA0001251618900000101
in (w)i) Denotes a directional word wiSet of words of (a), Out (w)j) The expression wjThe pointed word set, d represents a damping factor, and the value is 0.85.
(2) Rada Mihalcea and Paul Tarau prove through experiments that the accuracy of mapping texts into directed graphs to extract keywords is lower than the accuracy of mapping texts into undirected graphs, which indicates that no directionality exists between words. Therefore, the TextRank definition of the directed graph is changed to:
Figure BDA0001251618900000102
wherein L (w)i) And L (w)j) Respectively represent and word wiAnd wjA set of directly connected words.
2. And improving the TextRank algorithm.
The relation between words in the TextRank algorithm proposed by Rada Mihalcaand Paul Tarau only considers the co-occurrence times of the words in a specific window length, the characteristic information of the words in the whole text such as word frequency, word position, word span and the like is ignored, and in addition, the correlation between the words is only analyzed from the current text, so that the correlation of the words is not accurate enough. The invention starts from the following three aspects and improves the algorithm: the method comprises the steps of firstly calculating Word weight through information (including Word frequency, Word position and Word span) of words, then measuring the closeness degree of connection between words through the Word weight and the frequency of co-occurrence between words, and finally calculating the correlation between words by using Word2 vector.
(1) The word weight is calculated. Word weight calculation by word frequency, wordPosition and word span, word wiThe weight calculation formula of (c):
m(wi)=tf(wi)*loc(wi)*span(wi)
wherein m (w)i) Is the word wiWeight of, tf (w)i) Is the word wiWord frequency factor of, loc (w)i) Is the word wiThe position factor of (c), span (w)i) Is the word wiThe span factor of (2). The calculation method of each factor is as follows:
【1】 A word frequency factor. The higher the word frequency of a word, the more important the word is in the text. The calculation of the word frequency factor adopts a nonlinear function method, and a word w is assumediThe number of occurrences in the text is fre (w)i) Then, the word frequency factor calculation formula:
Figure BDA0001251618900000103
【2】 A word position factor. The words in different positions in the text play different roles, and the words in the top 10% are most important for expressing the text theme and the words in the top 10% -30% are the second most important. Dividing the text data into three regions, wherein the first 10% is a first region, the position value is set to be 50, the first 10% -30% is a second region, the position value is set to be 30, the position value of the last region is set to be 20, and words appearing in all the regions take the maximum value. Word wiThe position value of (d) is area (w)i) Expressed, the calculation formula is:
Figure BDA0001251618900000111
【3】 A word span factor. The word span reflects the coverage of the word in the text, and the larger the span is, the larger the effect on reflecting global information is. In the label extraction, words with large span are needed, and the global theme of the text can be reflected. Calculating the formula:
Figure BDA0001251618900000112
wherein first (w)i) And last (w)i) Respectively representing the position of the first appearance and the position of the last appearance of a word in the text, and sum is the total number of words contained in the text.
(2) And calculating word relation.
The mutual activation between words exists, some words always appear in pairs with other words, when one word appears, people usually think of the other word in nature, and the mutual activation between words is called as word activation force. On the other hand, more than one word often collocated with the word needs to be judged according to a specific language environment. In different texts, the strength of mutual activation of words is different, and the relation between words can be established in one text according to the importance of the words and the activation between the words.
The physical meaning of the word activation force is similar to gravitational force, and its initial definition is as follows: hypothesis word wiAnd wjThe number of occurrences in the corpus is fre (w)i) And fre (w)j) The co-occurrence frequency of the two is co-occur (w)i,wj) Then word wiWord pair wjThe activation force of (a) is as follows:
Figure BDA0001251618900000113
wherein d (w)i,wj) Is the word wiAnd wjThe average distance between the two when co-occurring.
【1】 Analogy to the formula of universal gravitation, it can be found that in the formula of the word activation force, the first term and the second term respectively represent the masses of two objects, d (w)i,wj) Representing the distance between the objects. The word activation force reflects the strength of the "attraction" between the two words. However, the original word activation force formula only considers the respective word frequency of the words and the co-occurrence times of the words, does not take other characteristics of the words into consideration, and cannot fully utilize the information of the text.
In one document data, information of word frequency, position, span, etc. of a word is an inherent attribute of the word in this document. Similarly, there is a relation between words, and the 'attraction' quantification formula between words is obtained by analogy with the universal gravitation formula:
conn(wi,wj)=m(wi)*m(wj)/r(wi,wj)2
wherein m (w)i) And m (w)j) Are respectively a word wiAnd the word wjWeight of (c), conn (w)i,wj) Reflecting the link between two words with different weights.
【2】 Word2vector calculates the Word-to-Word correlation. In the training process, after words containing a corpus and vectors corresponding to the words are obtained, k-means clustering is carried out on all the words through the vectors, and clusters formed by words with high relevance are obtained. The correlation is determined by calculating the cosine values of the two words, the greater the cosine value the greater the correlation. Hypothesis word wi,wjAre all n-dimensional vectors, the correlation value cos (w)i,wj) Calculating the formula:
Figure BDA0001251618900000121
thus, an improved word relationship Conn (w) can be obtainedi,wj) The formula:
Conn(wi,wj)=conn(wi,wj)*(1+cos(wi,wj))
conn (w)i,wj) Replace conn (w) abovei,wj) The improved TextRank formula can be obtained
Figure BDA0001251618900000122
And 3. the LDA result set comprises a plurality of subjects, each subject comprises words belonging to the subject and the probability of the words belonging to the subject, and all the words are sorted from large to small according to the probability value. Treating the processed data document as a sequence [ w ]1,w2,...wn]WhereinwiRepresenting the ith word and n representing a total of n words. By including the expectation of the number of words in a data document per topic
Figure BDA0001251618900000123
If K topics T are provided, probability subsections of data documents belonging to different topics can be obtained
Figure BDA0001251618900000124
Calculating per topic probability
Figure BDA0001251618900000125
The formula of (a):
Figure BDA0001251618900000131
wherein the content of the first and second substances,
Figure BDA0001251618900000132
expected value representing the number of words belonging to topic i, assuming word wjThe probability of belonging to topic i is p (w)j,Ti) Then, then
Figure BDA0001251618900000133
TiThe calculation formula is as follows:
Figure BDA0001251618900000134
4. the specific process comprises the following steps:
(1) reading the preprocessed data document, recording the word frequency, the first and last appearance positions of words and the total number of text words, and counting the information of each word in the data.
(2) And calculating word weight, and respectively calculating the values of the word frequency factor, the word position factor and the word span factor.
(3) And calculating word spacing, taking a sentence as a unit, if two words appear in one sentence at the same time, adding 1 to the co-occurrence times of the words, wherein the word spacing is the reciprocal of the co-occurrence times, and if the co-occurrence times of the two words are 0, the distance between the two words is infinite.
(4) And calculating the word attraction force, and substituting the word space in the previous step into an attraction force quantization formula to obtain the attraction force quantization expression of the two words. If the distance between two words is infinite, it means that the attraction of the two words is 0, and their presence or absence is not influenced by each other.
(5) The correlation between words is calculated and a cosine value representing the magnitude of the correlation is calculated using Word2 vector.
(6) The word TextRank value is calculated. Initializing a TextRank value to be 1, substituting the calculation result into the improved TextRank formula, setting an iteration termination threshold value to be 0.0001, and continuously using the formula for iteration until the result is converged, thereby obtaining the TextRank value of each word.
(7) The words are sorted from high to low according to the calculated TextRank value.
(8) And selecting TOP 20 words in the sequencing result as the text labels.
(9) Calculating the topic distribution probability of the data document through LDA.
(10) And selecting a theme with the highest probability, and taking 5 words from the words contained in the theme according to the probability from high to low to form the theme label.
Wherein, step 3-1 includes (1) (2) (3) (4) (5) (6) (7) (8), as shown in fig. 2, generating text labels, and the label data are all from the data document.
Step 3-1 includes (1) (9) (10), as shown in FIG. 3, generating a subject label for the document, the label data not necessarily being from the data document.
And in the fourth step, the visualization of the document data labels and the key contents is realized. The present invention uses two tags, one being a text tag, the tag content being derived from text data. The other is a theme tag, and the tag data is from the theme of the document data, can reflect the theme of the document data, and can also solve the problem that the document data does not include the theme.
The present invention is generated using a presentation form of the tag cloud using PyTagCloud, which is a python extended library implemented based on the word technology. The generated label cloud shows different words in different colors, the label displays the first five ordered words, the word font size reflects the word weight, and the larger the word weight is, the more striking the display in the label cloud is. In addition, in the document data, 20 text labels are marked by colors different from other characters, so that the user can quickly find out the key points when reading the content of the document data.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (7)

1. An innovative creative label automatic labeling method based on big data is characterized by comprising the following steps:
step (1): model training:
training a text depth representation model Word2vector by using a corpus, and obtaining all words and vector model files corresponding to all words in the corpus after training to obtain a well-trained Word2vector model;
training a document theme generation model LDA by using a corpus to obtain an LDA result set and a trained LDA model, wherein the LDA result set comprises a plurality of themes, and each theme comprises words belonging to the theme and the probability of the words belonging to the theme;
step (2): performing word segmentation operation on a data document of a current browsing page of a user by using a Chinese academy ICTCCLAS word segmentation system, and then removing stop words; obtaining a preprocessed data document;
and (3): generating a text label and a subject label;
and (4): the visualization of the final text label and the theme label is realized;
the step (3) comprises the following steps:
step (31): labeling text labels on the preprocessed data documents by using a TextRank algorithm of unsupervised learning, calculating the correlation between words based on a vector model file by using a trained Word2vector model, and modifying the text labels by using the correlation between the words; generating a final text label;
step (32): performing theme analysis on the preprocessed data document by using an LDA result set to generate a theme label;
the step (31) comprises:
step (311): reading the preprocessed data document, and counting the information of each word in the data document; the information of each word includes: word frequency, position of first appearance of word, position of last appearance of word and total number of words;
a step (312): calculating the word weight: respectively calculating the values of the word frequency factor, the word position factor and the word span factor;
word wiWeight m (w) ofi) Calculating the formula:
m(wi)=tf(wi)*loc(wi)*span(wi); (1)
wherein, tf (w)i) Is the word wiWord frequency factor of, loc (w)i) Is the word wiThe position factor of (c), span (w)i) Is the word wiA span factor of (d);
step (313): calculating word spacing, taking a sentence as a unit, if two words appear in one sentence at the same time, adding 1 to the co-occurrence times of the two words, wherein the word spacing is the reciprocal of the co-occurrence times, and if the co-occurrence times of the two words is 0, the distance between the two words is infinite;
step (314): calculating the word attraction force, and substituting the word space in the step (313) into an attraction force quantization formula of the words to obtain the attraction force quantization expression of the two words; if the distance between the two words is infinite, the attractive force of the two words is 0, and the two words are not influenced by each other if the two words appear or not;
step (315): calculating the correlation among the words, and calculating a cosine value representing the magnitude of the correlation by using a trained Word2vector model;
step (316): calculate the word TextRank value: initializing a TextRank value to be 1, substituting a word relation calculation result into an improved TextRank formula, setting an iteration termination threshold value to be 0.0001, continuously using the improved TextRank formula for iteration until the result is converged, and thus obtaining the TextRank value of each word;
step (317): sorting the words according to the calculated TextRank value from high to low;
step (318): selecting the top 20 words in the sequencing result as text labels;
in the process of training a text depth representation model Word2vector by using a corpus, after vectors comprising the words of the corpus and corresponding to all the words are obtained, k-means clustering is carried out on all the words through vector correlation, and a cluster composed of words with high correlation is obtained; determining the relevance by calculating cosine values of the two words, wherein the larger the cosine value is, the larger the relevance is;
hypothesis word wi,wjAre all n-dimensional vectors, then the correlation cos (w)i,wj) Calculating the formula:
Figure FDA0002565606950000021
further obtaining the improved word relationship Conn (w)i,wj) The formula:
Conn(wi,wj)=conn(wi,wj)*(1+cos(wi,wj)); (7)
word appeal quantization formula:
conn(wi,wj)=m(wi)*m(wj)/r(wi,wj)2; (5)
wherein, m (w)i) Is the word wiWeight of (c), m (w)j) Is the word wjWeight of (c), conn (w)i,wj) Reflecting the relation between two words with different weights; r (w)i,wj) The expression wiAnd the word wjThe pitch of (d);
obtaining an improved TextRank formula:
Figure FDA0002565606950000022
among them, TextRank (w)i) Denotes wiOf importance, TextRank (w)j) The expression wjThe importance of (c).
2. The big data-based innovative creative tag automatic labeling method of claim 1, wherein the stop words of step (2) include words whose frequency of use crosses a set threshold and words without actual meaning;
the words without practical meaning comprise moods, auxiliary words, prepositions and conjunctions;
the step of removing stop words comprises: after word segmentation processing, part of speech is labeled, nouns, verbs and adjectives are reserved, words of other part of speech are filtered, and meanwhile, words with the use frequency exceeding a set threshold value need to be filtered.
3. The method as claimed in claim 1, wherein the formula for calculating the word frequency factor is as follows:
Figure FDA0002565606950000031
wherein, fre (w)i) The expression wiNumber of occurrences in the data document.
4. The method as claimed in claim 1, wherein the formula for calculating the word position factor is as follows:
Figure FDA0002565606950000032
wherein, area (w)i) The expression wiA position value of (a);
when the positions of the words in the text are different, the roles of the words are also different, wherein the words positioned in the top 10% are most important for expressing the text theme, and the importance of the words positioned in the top 10% -30% of the text is the second order; dividing the text data into three areas, wherein the first 10% is a first area, the position value is set to be 50, the first 10% -30% is a second area, the position value is set to be 30, the position value of the last area is set to be 20, and words appearing in all the areas take the maximum value.
5. The big data-based innovative creative tag automatic labeling method of claim 1, characterized in that the calculation formula of the word span factor is:
Figure FDA0002565606950000033
wherein, first (w)i) Indicating the position of the first occurrence of the word in the text, last (w)i) Representing the position of the word in the text at the last time, and sum is the total number of words contained in the text;
the word span reflects the coverage range of a word in a text, and the larger the span is, the larger the effect on reflecting global information is; in the label extraction, the words with large span can reflect the global theme of the text.
6. The big data based innovative creative tag auto-tagging method of claim 1, characterized in that,
said step (32) comprises:
step (321): reading the preprocessed data document, recording the total number of text words, and counting the information of each word in the data;
step (322): calculating the theme distribution probability of the data document through the LDA result set;
the LDA result set contains a number of topics, each topic comprising words belonging to the topic and a probability that a word belongs to the topic,
all words are sorted from large to small according to probability values; treating the preprocessed data document as a sequence [ w ]1,w2,w3......wn]Wherein w isiDenotes the ith word, n denotes oneN words in total; the expectation that each topic contains the number of words in the data document is
Figure FDA0002565606950000041
If K subjects are provided, obtaining the probability distribution of the data documents belonging to different subjects
Figure FDA0002565606950000042
Calculating data document belonging to ith topic TiProbability of (2)
Figure FDA0002565606950000043
The formula of (a):
wherein the content of the first and second substances,
Figure FDA0002565606950000045
indicating belonging to the ith topic TiIs expected for the number of words, say word wjBelonging to the ith topic TiHas a probability of p (w)j,Ti) Then, then
Figure FDA0002565606950000046
The calculation formula is as follows:
Figure FDA0002565606950000047
step (323): and selecting a theme with the highest probability, and taking 5 words from the words contained in the theme according to the probability from high to low to form the theme label.
7. Innovative intention label automatic labeling system based on big data, characterized by includes:
a model training unit:
training a text depth representation model Word2vector by using a corpus, and obtaining all words and vector model files corresponding to all words in the corpus after training to obtain a well-trained Word2vector model;
training a document theme generation model LDA by using a corpus to obtain an LDA result set and a trained LDA model, wherein the LDA result set comprises a plurality of themes, and each theme comprises words belonging to the theme and the probability of the words belonging to the theme;
a data document processing unit: performing word segmentation operation on a data document of a current browsing page of a user by using a Chinese academy ICTCCLAS word segmentation system, and then removing stop words; obtaining a preprocessed data document;
a label generation unit: generating a text label and a subject label;
a visualization unit: the visualization of the final text label and the theme label is realized;
the label generating unit is as follows:
step (31): labeling text labels on the preprocessed data documents by using a TextRank algorithm of unsupervised learning, calculating the correlation between words based on a vector model file by using a trained Word2vector model, and modifying the text labels by using the correlation between the words; generating a final text label;
step (32): performing theme analysis on the preprocessed data document by using an LDA result set to generate a theme label;
the step (31) comprises:
step (311): reading the preprocessed data document, and counting the information of each word in the data document; the information of each word includes: word frequency, position of first appearance of word, position of last appearance of word and total number of words;
a step (312): calculating the word weight: respectively calculating the values of the word frequency factor, the word position factor and the word span factor;
word wiWeight m (w) ofi) Calculating the formula:
m(wi)=tf(wi)*loc(wi)*span(wi); (1)
wherein, tf (w)i) Is the word wiWord frequency factor of, loc (w)i) Is the word wiThe position factor of (c), span (w)i) Is the word wiA span factor of (d);
step (313): calculating word spacing, taking a sentence as a unit, if two words appear in one sentence at the same time, adding 1 to the co-occurrence times of the two words, wherein the word spacing is the reciprocal of the co-occurrence times, and if the co-occurrence times of the two words is 0, the distance between the two words is infinite;
step (314): calculating the word attraction force, and substituting the word space in the step (313) into an attraction force quantization formula of the words to obtain the attraction force quantization expression of the two words; if the distance between the two words is infinite, the attractive force of the two words is 0, and the two words are not influenced by each other if the two words appear or not;
step (315): calculating the correlation among the words, and calculating a cosine value representing the magnitude of the correlation by using a trained Word2vector model;
step (316): calculate the word TextRank value: initializing a TextRank value to be 1, substituting a word relation calculation result into an improved TextRank formula, setting an iteration termination threshold value to be 0.0001, continuously using the improved TextRank formula for iteration until the result is converged, and thus obtaining the TextRank value of each word;
step (317): sorting the words according to the calculated TextRank value from high to low;
step (318): selecting the top 20 words in the sequencing result as text labels;
in the process of training a text depth representation model Word2vector by using a corpus, after vectors comprising the words of the corpus and corresponding to all the words are obtained, k-means clustering is carried out on all the words through vector correlation, and a cluster composed of words with high correlation is obtained; determining the relevance by calculating cosine values of the two words, wherein the larger the cosine value is, the larger the relevance is;
hypothesis word wi,wjAre all n-dimensional vectors, then the correlation cos (w)i,wj) Calculating the formula:
Figure FDA0002565606950000061
further obtaining the improved word relationship Conn (w)i,wj) The formula:
Conn(wi,wj)=conn(wi,wj)*(1+cos(wi,wj)); (7)
word appeal quantization formula:
conn(wi,wj)=m(wi)*m(wj)/r(wi,wj)2; (5)
wherein, m (w)i) Is the word wiWeight of (c), m (w)j) Is the word wjWeight of (c), conn (w)i,wj) Reflecting the relation between two words with different weights; r (w)i,wj) The expression wiAnd the word wjThe pitch of (d);
obtaining an improved TextRank formula:
Figure FDA0002565606950000062
among them, TextRank (w)i) Denotes wiOf importance, TextRank (w)j) The expression wjThe importance of (c).
CN201710173029.3A 2017-03-22 2017-03-22 Innovative creative tag automatic labeling method and system based on big data Active CN106997382B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710173029.3A CN106997382B (en) 2017-03-22 2017-03-22 Innovative creative tag automatic labeling method and system based on big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710173029.3A CN106997382B (en) 2017-03-22 2017-03-22 Innovative creative tag automatic labeling method and system based on big data

Publications (2)

Publication Number Publication Date
CN106997382A CN106997382A (en) 2017-08-01
CN106997382B true CN106997382B (en) 2020-12-01

Family

ID=59431684

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710173029.3A Active CN106997382B (en) 2017-03-22 2017-03-22 Innovative creative tag automatic labeling method and system based on big data

Country Status (1)

Country Link
CN (1) CN106997382B (en)

Families Citing this family (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704503A (en) * 2017-08-29 2018-02-16 平安科技(深圳)有限公司 User's keyword extracting device, method and computer-readable recording medium
CN107861948B (en) * 2017-11-16 2021-09-17 百度在线网络技术(北京)有限公司 Label extraction method, device, equipment and medium
CN108415953B (en) * 2018-02-05 2021-08-13 华融融通(北京)科技有限公司 Method for managing bad asset management knowledge based on natural language processing technology
CN108549626B (en) * 2018-03-02 2020-11-20 广东技术师范学院 Keyword extraction method for admiration lessons
CN108763189B (en) * 2018-04-12 2022-03-25 武汉斗鱼网络科技有限公司 Live broadcast room content label weight calculation method and device and electronic equipment
CN108536679B (en) * 2018-04-13 2022-05-20 腾讯科技(成都)有限公司 Named entity recognition method, device, equipment and computer readable storage medium
CN108959431B (en) * 2018-06-11 2022-07-05 中国科学院上海高等研究院 Automatic label generation method, system, computer readable storage medium and equipment
CN110738033B (en) * 2018-07-03 2023-09-19 百度在线网络技术(北京)有限公司 Report template generation method, device and storage medium
CN109344248B (en) * 2018-07-27 2021-10-22 中山大学 Academic topic life cycle analysis method based on scientific and technological literature abstract clustering
CN108920466A (en) * 2018-07-27 2018-11-30 杭州电子科技大学 A kind of scientific text keyword extracting method based on word2vec and TextRank
CN110807097A (en) * 2018-08-03 2020-02-18 北京京东尚科信息技术有限公司 Method and device for analyzing data
CN109344253A (en) * 2018-09-18 2019-02-15 平安科技(深圳)有限公司 Add method, apparatus, computer equipment and the storage medium of user tag
CN111125355A (en) * 2018-10-31 2020-05-08 北京国双科技有限公司 Information processing method and related equipment
CN109710916B (en) * 2018-11-02 2024-02-23 广州财盟科技有限公司 Label extraction method and device, electronic equipment and storage medium
CN109614455B (en) * 2018-11-28 2020-12-01 武汉大学 Deep learning-based automatic labeling method and device for geographic information
CN110399606B (en) * 2018-12-06 2023-04-07 国网信息通信产业集团有限公司 Unsupervised electric power document theme generation method and system
CN109783798A (en) * 2018-12-12 2019-05-21 平安科技(深圳)有限公司 Method, apparatus, terminal and the storage medium of text information addition picture
CN111382265B (en) * 2018-12-28 2023-09-19 中国移动通信集团贵州有限公司 Searching method, device, equipment and medium
CN109686445B (en) * 2018-12-29 2023-07-21 成都睿码科技有限责任公司 Intelligent diagnosis guiding algorithm based on automatic label and multi-model fusion
CN109885674B (en) * 2019-02-14 2022-10-25 腾讯科技(深圳)有限公司 Method and device for determining and recommending information of subject label
CN110162592A (en) * 2019-05-24 2019-08-23 东北大学 A kind of news keyword extracting method based on the improved TextRank of gravitation
CN110263343B (en) * 2019-06-24 2021-06-15 北京理工大学 Phrase vector-based keyword extraction method and system
CN110347977A (en) * 2019-06-28 2019-10-18 太原理工大学 A kind of news automated tag method based on LDA model
CN110413796A (en) * 2019-07-03 2019-11-05 北京信息科技大学 A kind of coal mine typical power disaster Methodologies for Building Domain Ontology
CN110557504B (en) * 2019-08-30 2021-06-04 Oppo广东移动通信有限公司 Dynamic update method, device, equipment and medium for ring of intelligent terminal equipment
CN110717329B (en) * 2019-09-10 2023-06-16 上海开域信息科技有限公司 Method for performing approximate search based on word vector to rapidly extract advertisement text theme
CN112559853B (en) * 2019-09-26 2024-01-12 北京沃东天骏信息技术有限公司 User tag generation method and device
CN111177321B (en) * 2019-12-27 2023-10-20 东软集团股份有限公司 Method, device, equipment and storage medium for determining corpus
CN112270192B (en) * 2020-11-23 2023-12-19 科大国创云网科技有限公司 Semantic recognition method and system based on part of speech and deactivated word filtering
CN112905741B (en) * 2021-02-08 2022-04-12 合肥供水集团有限公司 Water supply user focus mining method considering space-time characteristics
CN113761911A (en) * 2021-03-17 2021-12-07 中科天玑数据科技股份有限公司 Domain text labeling method based on weak supervision
CN113128234B (en) * 2021-06-17 2021-11-02 明品云(北京)数据科技有限公司 Method and system for establishing entity recognition model, electronic equipment and medium
CN114661900A (en) * 2022-02-25 2022-06-24 安阳师范学院 Text annotation recommendation method, device, equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103164394A (en) * 2012-07-16 2013-06-19 上海大学 Text similarity calculation method based on universal gravitation

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021620A (en) * 2016-07-14 2016-10-12 北京邮电大学 Method for realizing automatic detection for power failure event by utilizing social contact media
CN106469187B (en) * 2016-08-29 2019-12-03 东软集团股份有限公司 The extracting method and device of keyword
CN106372064B (en) * 2016-11-18 2019-04-19 北京工业大学 A kind of term weight function calculation method of text mining

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103164394A (en) * 2012-07-16 2013-06-19 上海大学 Text similarity calculation method based on universal gravitation

Also Published As

Publication number Publication date
CN106997382A (en) 2017-08-01

Similar Documents

Publication Publication Date Title
CN106997382B (en) Innovative creative tag automatic labeling method and system based on big data
US9201957B2 (en) Method to build a document semantic model
Nguyen et al. Keyphrase extraction in scientific publications
KR101136007B1 (en) System and method for anaylyzing document sentiment
CN110888991B (en) Sectional type semantic annotation method under weak annotation environment
El-Shishtawy et al. Arabic keyphrase extraction using linguistic knowledge and machine learning techniques
CN114254653A (en) Scientific and technological project text semantic extraction and representation analysis method
Gupta et al. A novel hybrid text summarization system for Punjabi text
Alami et al. Hybrid method for text summarization based on statistical and semantic treatment
Turdakov Word sense disambiguation methods
Lynn et al. An improved method of automatic text summarization for web contents using lexical chain with semantic-related terms
CN112036178A (en) Distribution network entity related semantic search method
CN114706972A (en) Unsupervised scientific and technical information abstract automatic generation method based on multi-sentence compression
Gopan et al. Comparative study on different approaches in keyword extraction
CN112711666B (en) Futures label extraction method and device
Tahrat et al. Text2geo: from textual data to geospatial information
Varghese et al. Lexical and semantic analysis of sacred texts using machine learning and natural language processing
US8862459B2 (en) Generating Chinese language banners
Liu et al. Keyword extraction using PageRank on synonym networks
Pokharana et al. A Review on diverse algorithms used in the context of Plagiarism Detection
CN114265936A (en) Method for realizing text mining of science and technology project
CN112800243A (en) Project budget analysis method and system based on knowledge graph
Wang Query Segmentation and Tagging
Shaban A semantic graph model for text representation and matching in document mining
Gheni et al. Suggesting new words to extract keywords from title and abstract

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant