CN106997382B

CN106997382B - Innovative creative tag automatic labeling method and system based on big data

Info

Publication number: CN106997382B
Application number: CN201710173029.3A
Authority: CN
Inventors: 鹿旭东; 张盘龙; 陈志勇; 郭伟; 崔立真
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2017-03-22
Filing date: 2017-03-22
Publication date: 2020-12-01
Anticipated expiration: 2037-03-22
Also published as: CN106997382A

Abstract

The invention discloses an innovative creative tag automatic labeling method and system based on big data, wherein the method comprises the following steps: and training Word2vector and LDA by using a dog searching corpus to obtain a training result set. And performing word segmentation, stop word removal and word filtering on the document data of the page browsed by the user. The preprocessed document data are combined by using a modified TextRank algorithm Word2vector to calculate the label derived from the text data. And the preprocessed document is calculated by LDA to derive tags about the subject matter of the document data. Visualization is achieved through a label cloud generating mode, all label words of the text are marked in document data, and a user can read and find key content parts conveniently.

Description

Innovative creative tag automatic labeling method and system based on big data

Technical Field

The invention relates to an innovative creative tag automatic labeling method and system based on big data.

Background

With the rapid development and popularization of the internet, information is explosively increased, so that a large amount of information is accumulated on the internet. Meanwhile, internet users are not only internet content browsers but also create various information on the internet, so that the internet information forms are diversified, and great difficulty is brought to information screening. The information using characters as carriers in the internet information accounts for a large proportion, the increase of information quantity and the confusion of structures enable people to have more references in the process of searching the information, the coverage rate of the information is more comprehensive, the aspects of the life of people are related, the life of people is greatly facilitated, however, a large amount of information easily causes people to fall into the place without selection, and the situation that effective information is quickly selected from a large amount of information is not easy.

When enterprises carry out innovation work, the big data is used as the basis of analysis and planning, and valuable data needs to be distinguished and viewed and analyzed. How to fully utilize big data and quickly and effectively obtain relevant data of a topic concerned by an enterprise, realize labeling key data, eliminate messy and useless information, and make the enterprise focus on more valuable and important information is a difficulty of current innovation, and text labeling is generated on such background. Text annotation refers to the use of a number of words or phrases that are specific and reflect the subject of the text, and these words or phrases are usually called tags, and the reader can quickly understand the subject of the text by reading these tags, so as to determine whether the text is the text of interest.

The automatic text labeling is an emerging research subject developed along with the Internet, is derived from information extraction and text classification technologies, and combines research methods in the directions of information retrieval, collaborative filtering and the like. In recent years, the developed text automatic labeling technology includes social labeling, multi-label classification labeling and keyword extraction labeling based on users;

the above describes the main methods of text labeling at present. The social annotation based on the user is at the initial stage of system service, and the problem of cold start exists because past data do not provide reference; the multi-label classification labeling method is mostly based on an algorithm of supervised learning, a large amount of manually labeled data sets are needed to be used as training sets, and the manually labeled data sets are time-consuming and labor-consuming and have high subjectivity.

Disclosure of Invention

In order to solve the defects of the prior art, the invention provides an innovative creative tag automatic labeling method and system based on big data, which has the effects of labeling texts by adopting a keyword extraction method, belonging to the field of unsupervised learning and avoiding manually labeling data sets.

An innovative creative tag automatic labeling method based on big data comprises the following steps:

step (1): model training:

training a text depth representation model Word2vector by using a corpus, and obtaining all words and vector model files corresponding to all words in the corpus after training to obtain a well-trained Word2vector model;

training a document theme generation model LDA by using a corpus to obtain an LDA result set and a trained LDA model, wherein the LDA result set comprises a plurality of themes, and each theme comprises words belonging to the theme and the probability of the words belonging to the theme;

step (2): performing word segmentation operation on a data document of a current browsing page of a user by using a Chinese academy ICTCCLAS word segmentation system, and then removing stop words; obtaining a preprocessed data document;

and (3): generating a text label and a subject label;

and (4): visualization of the final text label and the subject label is achieved.

The stop words in the step (2) comprise words with the use frequency of passing a set threshold value and words without practical meaning.

The words without practical meaning include moods, adverbs, prepositions and conjunctions.

The step of removing stop words comprises: after word segmentation processing, part of speech is labeled, nouns, verbs and adjectives are reserved, words of other part of speech are filtered, and meanwhile, words with the use frequency exceeding a set threshold value need to be filtered.

The step (3) comprises the following steps:

step (31): labeling text labels on the preprocessed data documents by using a TextRank algorithm of unsupervised learning, calculating the correlation between words based on a vector model file by using a trained Word2vector model, and modifying the text labels by using the correlation between the words; generating a final text label;

step (32): and performing theme analysis on the preprocessed data document by using the LDA result set to generate a theme label.

The step (31) comprises:

step (311): reading the preprocessed data document, and counting the information of each word in the data document; the information of each word includes: word frequency, position of first appearance of word, position of last appearance of word and total number of words;

a step (312): calculating the word weight: respectively calculating the values of the word frequency factor, the word position factor and the word span factor;

word w_iWeight m (w) of_i) Calculating the formula:

m(w_i)＝tf(w_i)*loc(w_i)*span(w_i)；(1)

wherein, tf (w)_i) Is the word w_iWord frequency factor of, loc (w)_i) Is the word w_iThe position factor of (c), span (w)_i) Is the word w_iThe span factor of (2).

The calculation formula of the word frequency factor is as follows:

wherein, fre (w)_i) The expression w_iNumber of occurrences in the data document.

The calculation formula of the word position factor is as follows:

wherein, area (w)_i) The expression w_iThe position value of (a).

The words in different positions in the text play different roles, and the words in the top 10% are most important for expressing the text theme and the words in the top 10% -30% are the second most important. Dividing the text data into three areas, wherein the first 10% is a first area, the position value is set to be 50, the first 10% -30% is a second area, the position value is set to be 30, the position value of the last area is set to be 20, and words appearing in all the areas take the maximum value.

The calculation formula of the word span factor is as follows:

wherein, first (w)_i) Indicating the position of the first occurrence of a word in the text, last(w_i) Indicating the position of the word in the text at the last occurrence, sum being the total number of words contained in the text.

The word span reflects the coverage of the word in the text, and the larger the span is, the larger the effect on reflecting global information is. In the label extraction, the words with large span can reflect the global theme of the text.

Step (313): calculating word spacing, taking a sentence as a unit, if two words appear in one sentence at the same time, adding 1 to the co-occurrence times of the two words, wherein the word spacing is the reciprocal of the co-occurrence times, and if the co-occurrence times of the two words is 0, the distance between the two words is infinite;

step (314): calculating the word attraction force, and substituting the word space in the step (313) into an attraction force quantization formula of the words to obtain the attraction force quantization expression of the two words; if the distance between the two words is infinite, the attractive force of the two words is 0, and the two words are not influenced by each other if the two words appear or not;

word appeal quantization formula:

conn(w_i,w_j)＝m(w_i)*m(w_j)/r(w_i,w_j)²；(5)

wherein, m (w)_i) Is the word w_iWeight of (c), m (w)_j) Is the word w_jWeight of (c), conn (w)_i,w_j) Reflecting the relation between two words with different weights; r (w)_i,w_j) The expression w_iAnd the word w_jThe pitch of (d);

step (315): and calculating the correlation between the words, and calculating a cosine value representing the magnitude of the correlation by using a trained Word2vector model.

In the process of training a text depth representation model Word2vector by using a corpus, after vectors comprising the words of the corpus and corresponding to all the words are obtained, k-means clustering is carried out on all the words through vector correlation, and a cluster composed of words with high correlation is obtained. The correlation is determined by calculating the cosine values of the two words, the greater the cosine value the greater the correlation.

Hypothesis word w_i,w_jAre all n-dimensional vectors, then the correlation cos (w)_i,w_j) Calculating the formula:

further obtaining the improved word relationship Conn (w)_i,w_j) The formula:

Conn(w_i,w_j)＝conn(w_i,w_j)*(1+cos(w_i,w_j))；(7)

obtaining an improved TextRank formula:

among them, TextRank (w)_i) Denotes w_iOf importance, TextRank (w)_j) The expression w_jThe importance of (c);

step (316): calculate the word TextRank value: initializing a TextRank value to be 1, substituting a word relation calculation result into an improved TextRank formula, setting an iteration termination threshold value to be 0.0001, continuously using the improved TextRank formula for iteration until the result is converged, and thus obtaining the TextRank value of each word;

step (317): sorting the words according to the calculated TextRank value from high to low;

step (318): and selecting the top 20 words in the sequencing result as text labels.

Said step (32) comprises:

step (321): reading the preprocessed data document, recording the total number of text words, and counting the information of each word in the data;

step (322): calculating the theme distribution probability of the data document through the LDA result set;

the LDA result set contains a number of topics, each topic comprising words belonging to the topic and a probability that a word belongs to the topic,

all words are sorted from large to small according to probability values; treating the preprocessed data document as a sequence [ w ]₁,w₂,w₃......w_n]Wherein w is_iRepresenting the ith word and n representing a total of n words. The expectation that each topic contains the number of words in the data document is

If K subjects are provided, obtaining the probability distribution of the data documents belonging to different subjects

Calculating data document belonging to ith topic T_iProbability of (2)

The formula of (a):

wherein the content of the first and second substances,

indicating belonging to the ith topic T_iIs expected for the number of words, say word w_jBelonging to the ith topic T_iHas a probability of p (w)_j,T_i) Then, then

The calculation formula is as follows:

step (323): and selecting a theme with the highest probability, and taking 5 words from the words contained in the theme according to the probability from high to low to form the theme label.

Furthermore, the invention also adopts the technical scheme of an innovative creative tag automatic labeling system based on big data, and the innovative creative tag automatic labeling system can automatically add text tags and theme tags to the data documents browsed by the user, thereby facilitating the user to find important information of the text and improving the reading efficiency.

Innovative creative tag automatic labeling system based on big data comprises:

a model training unit:

a data document processing unit: performing word segmentation operation on a data document of a current browsing page of a user by using a Chinese academy ICTCCLAS word segmentation system, and then removing stop words; obtaining a preprocessed data document;

a label generation unit: generating a text label and a subject label;

a visualization unit: visualization of the final text label and the subject label is achieved.

Compared with the prior art, the invention has the beneficial effects that:

the improved TextRank algorithm is adopted to obtain the keywords of the data document, compared with other algorithms, the calculation result has higher accuracy and representativeness, the extracted tags are from the document, the representativeness is good, and the effect of accurately expressing text content is achieved;

the LDA model is adopted to generate the subject label of the text, so that the difficulty that the subject word of the text is not contained in the text is solved, the subject content of the text can be better reflected, the text label is integrated, and the label for accurately expressing the text content and the subject is realized;

drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.

FIG. 1 is a flow diagram of the pretreatment of the present invention;

FIG. 2 is a flow chart of tag generation herein according to the present invention;

FIG. 3 is a flowchart of the subject label generation of the present invention.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The invention integrates an improved text labeling algorithm based on TextRank, Word2vector (a text analysis tool of Google) to calculate the relevance of words and LDA (document theme generation model) to extract document themes so as to realize automatic labeling of texts. The original TextRank algorithm only considers the relationship among the words in the calculation process, but ignores the characteristic attributes of the words, so that the text information cannot be fully utilized in the process of extracting the keywords. The invention improves the relation, firstly, the word weight is calculated by utilizing information such as word frequency, word position, word span and the like, and then the attractive force relation between words is established by utilizing the weight and a word activation force model to replace the original word relation. By adopting the improved mode, on one hand, information such as Word frequency, Word position, Word span and the like in the text is fully utilized for individual words, on the other hand, the co-occurrence rate of words in sentences is considered for the relation among the words, and the correlation among the words is considered, and the Word2vector provided by Google is used for calculating the correlation. The topic of the document may not be included in the textual content of the document and cannot be labeled with a label made up of words in the document content, so LDA is used to determine the topic of the document and provide a label for that topic.

The technical scheme of the invention is as follows: and automatically labeling data required by related creatives to the query result or the browsing page of the user, removing messy information, and sequencing according to the priority of the relevance. Under big data background, the visualization of data is more and more important, and this patent uses the form of label cloud to show the mark result to show keyword highlighting. By adopting the invention, the automatic labeling of the data set can be realized in an unsupervised learning mode, the label comes from the data document, the noise is low, and the representativeness is good. The user can preferentially read the automatically labeled key contents in the process of query browsing and can focus attention on more important information.

The invention realizes the innovative and creative automatic labeling method based on big data through the following technical scheme, which comprises the following specific steps:

the method comprises the following steps: LDA and Word2vector were trained using a corpus.

Step two: and performing word segmentation processing on the user browsing page and filtering the useless words. As shown in fig. 1;

step three: and (5) generating a label by combining a TextRank algorithm with LDA, and automatically labeling. As shown in fig. 2;

step four: the tag and key content are visualized. As shown in fig. 3;

in step one, the search corpus is used to train LDA and Word2 vector.

Word2vector is a tool developed by Google that transforms the content processing of a training set into vector operations in a fixed-dimension vector space by converting words into vectors, using the computed distance results between vectors to represent the correlation between text words. The larger the training corpus is, the better the word vector expression is, the dog searching corpus is used for training to obtain a model file containing all the words and the corresponding vectors in the corpus, and the task of calculating the correlation between the words can be realized.

Word2vec is an efficient tool for Google to open sources in 2013 to characterize words as real-valued vectors, and by using the idea of deep learning, processing of text contents can be simplified into vector operation in a K-dimensional vector space through training, and similarity in the vector space can be used for representing similarity in text semantics. Word vectors output by Word2vec can be used to do many NLP related tasks such as clustering, synonym finding, part-of-speech analysis, etc. If the idea is changed and a Word is taken as a feature, Word2vec can map the feature to a K-dimensional vector space and can search deeper feature representation for text data.

Word2vec uses the Word vector representation of the Distributed representation. The Distributed representation was first proposed by Hinton in 1986. The basic idea is to map each word into a K-dimensional real number vector (K is generally a hyper-parameter in a model) through training, and judge semantic similarity between words through distances (such as cosine similarity, Euclidean distance and the like) between the words. The Huffman coding is used according to the word frequency, so that the activated contents of all word hiding layers with similar word frequencies are basically consistent, the number of the word hiding layers activated by the words with higher occurrence frequency is less, and the calculation complexity is effectively reduced. While Word2vec is popular because of its high efficiency, Mikolov in the paper indicates that an optimized standalone version can train billions of words a day.

4. The three-layer neural network models a language model, but obtains a representation of a Word on a vector space, and the side effect is the real target of Word2 vec.

5. Compared with the classical processes of Latent Semantic analysis (LSI) and Latent Dirichlet Allocation (LDA), Word2vec utilizes the context of words and Semantic information is richer.

LDA (tension Dirichlet allocation) is a document theme generation model, and comprises three layers of structures of words, themes and documents. The generative model considers that each word of an article is obtained through a process of selecting a certain topic with a certain probability and selecting a certain word from the topic with a certain probability. Document-to-topic follows a polynomial distribution, and topic-to-word follows a polynomial distribution.

LDA is an unsupervised machine learning technique that can be used to identify underlying topic information in a corpus. The method adopts a bag-of-words method, and each document is regarded as a word frequency vector, so that text information is converted into digital information which is easy to model. Each document represents a probability distribution of topics, and each topic represents a probability distribution of words. Training is carried out by using the dog searching corpus to obtain a plurality of topics and a set of probabilities of words in each topic, and the probability distribution of document data belonging to all topics can be calculated by using an LDA training result set.

And in the second step, performing word segmentation operation on the text data by using an ICTCCLAS word segmentation system developed by a Chinese academy of sciences computer, and then removing stop words and filtering part of speech.

1. The present Chinese word segmentation algorithms are mainly divided into three categories: although several word segmentation algorithms are mature, the current word segmentation system comprehensively uses various word segmentation algorithms due to the complexity of the Chinese language and the ambiguity of Chinese content and the continuous appearance of new words. Chinese word segmentation research is carried out by Qinghua, Beida, Haugh, Chinese academy Microsoft China research institute, massive science and technology and the like, wherein an ICTCCLAS word segmentation system developed by a computer of the Chinese academy is the most prominent.

Specifically, the ICTCCLAS word segmentation system has five layers of hidden Markov models, the main word segmentation process comprises preliminary word segmentation, unregistered word identification, word re-segmentation and part-of-speech tagging, wherein the preliminary word segmentation adopts a shortest path method to roughly divide Chinese words, and the unregistered word identification is used for processing names of people, places and complex mechanisms, so that the word segmentation precision is ensured as much as possible. The public evaluation results of national and international authorities show that the word segmentation system has high word segmentation speed and high accuracy. The following are the APIs used:

(1) initialization: boolICTCLAS _ Init (const char pszInitDir NULL);

pszInitDir is the initialization path. Initialization returns true successfully, otherwise false is returned.

(2) Quitting the word segmentation: boolICTCLAS _ Exit ();

and releasing the memory space occupied by the dictionary and clearing the temporary buffer area and other system resources.

(3) File processing: boolICTCCLAS _ FileProcesses (const char sSrcFilename, eModeTypeCT, const char sDsnFilename, intbPOStagged);

the sSrcFilename is a source file path to be analyzed, the eCodeType is character encoding of a source file,

the sDsnFilename is a result file after word segmentation, the bPOStagged is whether the part of speech tagging is needed or not, and 0 is

No, 1 is YES. And if the word segmentation of the file is successful, returning true, and otherwise, returning false.

2. Stop words are generally divided into two categories: one is words which are used widely or even frequently, such as "i", "just", etc., and the other is words which have high frequency of occurrence in the text but have little practical meaning, mainly words such as "assistant words", "adverb words", "preposition words", conjunctions, etc., such as "the", "at", "and", etc. The words for going to stop are the words for removing the two types of words from the words for constructing the nodes of the text network, so that the complexity of the network is reduced. The part of speech of the label is generally noun, verb and adjective, and the word length is generally more than or equal to two characters, so that part of speech tagging is needed to be carried out on the result after the text is divided, and only the three types of words with parts of speech are reserved according to the part of speech.

3. The specific flow is shown in fig. 2:

(1) performing word segmentation processing on the document data by using an ICTS word segmentation system;

(2) executing the word-dividing result to stop word operation, and removing useless stop words;

(3) and performing part-of-speech tagging on the result, reserving nouns, verbs and adjectives which can be used as labels, filtering out other words, and eliminating interference.

In step three, the automatic labeling of the text data is realized by using a TextRank algorithm of unsupervised learning, the text data is improved, and Word2vector is used for calculating the correlation between words. And then performing theme analysis on the text data by using LDA, and comprehensively generating a label.

Specifically, the PageRank algorithm is the only criterion used by Google to measure the goodness and badness of a web site, and was proposed by Google founders in 1998 as Larrepecky and Sherry cover forest. The algorithm fully utilizes the hyperlink structure on the web pages to evaluate the ranking of the web pages, and the basic idea is to understand the link from one web page to another web page as the former voting for the latter. The more times a web page is linked means that the more votes the web page has for other web pages, the more important the web page. Meanwhile, the importance of the number of votes of the voting web page depends on the importance of the web page, and if a web page is important, the web page linked by the voting web page is relatively important. The PageRank algorithm can be applied to extraction of keywords and sentences: the words or sentences are regarded as web pages, the links among the words or sentences are regarded as link-transfer relations of the web pages, the importance of the words or sentences is calculated by utilizing an algorithm, and the important words or sentences are extracted.

The TextRank algorithm was proposed by RadeMihalcea and Paul Tarau in 2004 according to the PageRank algorithm. The essence of the TextRank algorithm is a graph-based algorithm in which words or sentences are equivalent to nodes of a graph, links between words or sentences are equivalent to edges of the graph, and a text network is represented by DN ═ W, R, where W is a set of words that make up the text network and R is a set of relationships between any two words in W. The association between words is represented by the number of word co-occurrences in a sliding window of a certain length.

(1) Similar to the idea of PageRank, if a word is directly connected to another word by an edge, the word is considered to cast a vote for the latter, and the importance of the word voted for depends on its own importance, so that the importance of a word is determined by the number of votes it obtains and the importance of the other words voted for it. In PageRank, the probability of linking from one web page to another is considered to be randomly equal, and the resulting graph is weightless. In a text network, however, there are multiple associations between two words, taking into account the association between wordsThe strength is necessary. Suppose conn (w)_i,w_j) The expression w_iAnd w_jThe relation between (here, the number of co-occurrences of the two in the word window of length), the word wⁱThe TextRank value is defined as shown in the formula:

in (w)_i) Denotes a directional word w_iSet of words of (a), Out (w)_j) The expression w_jThe pointed word set, d represents a damping factor, and the value is 0.85.

(2) Rada Mihalcea and Paul Tarau prove through experiments that the accuracy of mapping texts into directed graphs to extract keywords is lower than the accuracy of mapping texts into undirected graphs, which indicates that no directionality exists between words. Therefore, the TextRank definition of the directed graph is changed to:

wherein L (w)_i) And L (w)_j) Respectively represent and word w_iAnd w_jA set of directly connected words.

2. And improving the TextRank algorithm.

The relation between words in the TextRank algorithm proposed by Rada Mihalcaand Paul Tarau only considers the co-occurrence times of the words in a specific window length, the characteristic information of the words in the whole text such as word frequency, word position, word span and the like is ignored, and in addition, the correlation between the words is only analyzed from the current text, so that the correlation of the words is not accurate enough. The invention starts from the following three aspects and improves the algorithm: the method comprises the steps of firstly calculating Word weight through information (including Word frequency, Word position and Word span) of words, then measuring the closeness degree of connection between words through the Word weight and the frequency of co-occurrence between words, and finally calculating the correlation between words by using Word2 vector.

(1) The word weight is calculated. Word weight calculation by word frequency, wordPosition and word span, word w_iThe weight calculation formula of (c):

m(w_i)＝tf(w_i)*loc(w_i)*span(w_i)

wherein m (w)_i) Is the word w_iWeight of, tf (w)_i) Is the word w_iWord frequency factor of, loc (w)_i) Is the word w_iThe position factor of (c), span (w)_i) Is the word w_iThe span factor of (2). The calculation method of each factor is as follows:

【1】 A word frequency factor. The higher the word frequency of a word, the more important the word is in the text. The calculation of the word frequency factor adopts a nonlinear function method, and a word w is assumed_iThe number of occurrences in the text is fre (w)_i) Then, the word frequency factor calculation formula:

【2】 A word position factor. The words in different positions in the text play different roles, and the words in the top 10% are most important for expressing the text theme and the words in the top 10% -30% are the second most important. Dividing the text data into three regions, wherein the first 10% is a first region, the position value is set to be 50, the first 10% -30% is a second region, the position value is set to be 30, the position value of the last region is set to be 20, and words appearing in all the regions take the maximum value. Word wⁱThe position value of (d) is area (w)ⁱ) Expressed, the calculation formula is:

【3】 A word span factor. The word span reflects the coverage of the word in the text, and the larger the span is, the larger the effect on reflecting global information is. In the label extraction, words with large span are needed, and the global theme of the text can be reflected. Calculating the formula:

wherein first (w)_i) And last (w)_i) Respectively representing the position of the first appearance and the position of the last appearance of a word in the text, and sum is the total number of words contained in the text.

(2) And calculating word relation.

The mutual activation between words exists, some words always appear in pairs with other words, when one word appears, people usually think of the other word in nature, and the mutual activation between words is called as word activation force. On the other hand, more than one word often collocated with the word needs to be judged according to a specific language environment. In different texts, the strength of mutual activation of words is different, and the relation between words can be established in one text according to the importance of the words and the activation between the words.

The physical meaning of the word activation force is similar to gravitational force, and its initial definition is as follows: hypothesis word wⁱAnd w^jThe number of occurrences in the corpus is fre (w)ⁱ) And fre (w)^j) The co-occurrence frequency of the two is co-occur (w)ⁱ,w^j) Then word wⁱWord pair w^jThe activation force of (a) is as follows:

wherein d (w)_i,w_j) Is the word w_iAnd w_jThe average distance between the two when co-occurring.

【1】 Analogy to the formula of universal gravitation, it can be found that in the formula of the word activation force, the first term and the second term respectively represent the masses of two objects, d (w)ⁱ,w^j) Representing the distance between the objects. The word activation force reflects the strength of the "attraction" between the two words. However, the original word activation force formula only considers the respective word frequency of the words and the co-occurrence times of the words, does not take other characteristics of the words into consideration, and cannot fully utilize the information of the text.

In one document data, information of word frequency, position, span, etc. of a word is an inherent attribute of the word in this document. Similarly, there is a relation between words, and the 'attraction' quantification formula between words is obtained by analogy with the universal gravitation formula:

conn(w_i,w_j)＝m(w_i)*m(w_j)/r(w_i,w_j)²

wherein m (w)_i) And m (w)_j) Are respectively a word w_iAnd the word w_jWeight of (c), conn (w)_i,w_j) Reflecting the link between two words with different weights.

【2】 Word2vector calculates the Word-to-Word correlation. In the training process, after words containing a corpus and vectors corresponding to the words are obtained, k-means clustering is carried out on all the words through the vectors, and clusters formed by words with high relevance are obtained. The correlation is determined by calculating the cosine values of the two words, the greater the cosine value the greater the correlation. Hypothesis word wⁱ,w^jAre all n-dimensional vectors, the correlation value cos (w)ⁱ,w^j) Calculating the formula:

thus, an improved word relationship Conn (w) can be obtained_i,w_j) The formula:

Conn(w_i,w_j)＝conn(w_i,w_j)*(1+cos(w_i,w_j))

conn (w)_i,w_j) Replace conn (w) above_i,w_j) The improved TextRank formula can be obtained

And 3. the LDA result set comprises a plurality of subjects, each subject comprises words belonging to the subject and the probability of the words belonging to the subject, and all the words are sorted from large to small according to the probability value. Treating the processed data document as a sequence [ w ]₁,w₂,...w_n]Whereinw_iRepresenting the ith word and n representing a total of n words. By including the expectation of the number of words in a data document per topic

If K topics T are provided, probability subsections of data documents belonging to different topics can be obtained

Calculating per topic probability

The formula of (a):

wherein the content of the first and second substances,

expected value representing the number of words belonging to topic i, assuming word w_jThe probability of belonging to topic i is p (w)_j,T_i) Then, then

_TiThe calculation formula is as follows:

4. the specific process comprises the following steps:

(1) reading the preprocessed data document, recording the word frequency, the first and last appearance positions of words and the total number of text words, and counting the information of each word in the data.

(2) And calculating word weight, and respectively calculating the values of the word frequency factor, the word position factor and the word span factor.

(3) And calculating word spacing, taking a sentence as a unit, if two words appear in one sentence at the same time, adding 1 to the co-occurrence times of the words, wherein the word spacing is the reciprocal of the co-occurrence times, and if the co-occurrence times of the two words are 0, the distance between the two words is infinite.

(4) And calculating the word attraction force, and substituting the word space in the previous step into an attraction force quantization formula to obtain the attraction force quantization expression of the two words. If the distance between two words is infinite, it means that the attraction of the two words is 0, and their presence or absence is not influenced by each other.

(5) The correlation between words is calculated and a cosine value representing the magnitude of the correlation is calculated using Word2 vector.

(6) The word TextRank value is calculated. Initializing a TextRank value to be 1, substituting the calculation result into the improved TextRank formula, setting an iteration termination threshold value to be 0.0001, and continuously using the formula for iteration until the result is converged, thereby obtaining the TextRank value of each word.

(7) The words are sorted from high to low according to the calculated TextRank value.

(8) And selecting TOP 20 words in the sequencing result as the text labels.

(9) Calculating the topic distribution probability of the data document through LDA.

(10) And selecting a theme with the highest probability, and taking 5 words from the words contained in the theme according to the probability from high to low to form the theme label.

Wherein, step 3-1 includes (1) (2) (3) (4) (5) (6) (7) (8), as shown in fig. 2, generating text labels, and the label data are all from the data document.

Step 3-1 includes (1) (9) (10), as shown in FIG. 3, generating a subject label for the document, the label data not necessarily being from the data document.

And in the fourth step, the visualization of the document data labels and the key contents is realized. The present invention uses two tags, one being a text tag, the tag content being derived from text data. The other is a theme tag, and the tag data is from the theme of the document data, can reflect the theme of the document data, and can also solve the problem that the document data does not include the theme.

The present invention is generated using a presentation form of the tag cloud using PyTagCloud, which is a python extended library implemented based on the word technology. The generated label cloud shows different words in different colors, the label displays the first five ordered words, the word font size reflects the word weight, and the larger the word weight is, the more striking the display in the label cloud is. In addition, in the document data, 20 text labels are marked by colors different from other characters, so that the user can quickly find out the key points when reading the content of the document data.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. An innovative creative label automatic labeling method based on big data is characterized by comprising the following steps:

step (1): model training:

and (3): generating a text label and a subject label;

and (4): the visualization of the final text label and the theme label is realized;

the step (3) comprises the following steps:

step (32): performing theme analysis on the preprocessed data document by using an LDA result set to generate a theme label;

the step (31) comprises:

word w_iWeight m (w) of_i) Calculating the formula:

m(w_i)＝tf(w_i)*loc(w_i)*span(w_i)； (1)

wherein, tf (w)_i) Is the word w_iWord frequency factor of, loc (w)_i) Is the word w_iThe position factor of (c), span (w)_i) Is the word w_iA span factor of (d);

step (315): calculating the correlation among the words, and calculating a cosine value representing the magnitude of the correlation by using a trained Word2vector model;

step (318): selecting the top 20 words in the sequencing result as text labels;

in the process of training a text depth representation model Word2vector by using a corpus, after vectors comprising the words of the corpus and corresponding to all the words are obtained, k-means clustering is carried out on all the words through vector correlation, and a cluster composed of words with high correlation is obtained; determining the relevance by calculating cosine values of the two words, wherein the larger the cosine value is, the larger the relevance is;

further obtaining the improved word relationship Conn (w)_i,w_j) The formula:

Conn(w_i,w_j)＝conn(w_i,w_j)*(1+cos(w_i,w_j))； (7)

word appeal quantization formula:

conn(w_i,w_j)＝m(w_i)*m(w_j)/r(w_i,w_j)²； (5)

obtaining an improved TextRank formula:

among them, TextRank (w)_i) Denotes w_iOf importance, TextRank (w)_j) The expression w_jThe importance of (c).

2. The big data-based innovative creative tag automatic labeling method of claim 1, wherein the stop words of step (2) include words whose frequency of use crosses a set threshold and words without actual meaning;

the words without practical meaning comprise moods, auxiliary words, prepositions and conjunctions;

3. The method as claimed in claim 1, wherein the formula for calculating the word frequency factor is as follows:

4. The method as claimed in claim 1, wherein the formula for calculating the word position factor is as follows:

wherein, area (w)_i) The expression w_iA position value of (a);

when the positions of the words in the text are different, the roles of the words are also different, wherein the words positioned in the top 10% are most important for expressing the text theme, and the importance of the words positioned in the top 10% -30% of the text is the second order; dividing the text data into three areas, wherein the first 10% is a first area, the position value is set to be 50, the first 10% -30% is a second area, the position value is set to be 30, the position value of the last area is set to be 20, and words appearing in all the areas take the maximum value.

5. The big data-based innovative creative tag automatic labeling method of claim 1, characterized in that the calculation formula of the word span factor is:

wherein, first (w)_i) Indicating the position of the first occurrence of the word in the text, last (w)_i) Representing the position of the word in the text at the last time, and sum is the total number of words contained in the text;

the word span reflects the coverage range of a word in a text, and the larger the span is, the larger the effect on reflecting global information is; in the label extraction, the words with large span can reflect the global theme of the text.

6. The big data based innovative creative tag auto-tagging method of claim 1, characterized in that,

said step (32) comprises:

all words are sorted from large to small according to probability values; treating the preprocessed data document as a sequence [ w ]₁,w₂,w₃......w_n]Wherein w is_iDenotes the ith word, n denotes oneN words in total; the expectation that each topic contains the number of words in the data document is