CN110362680B

CN110362680B - Soft-wide detection and advertisement extraction method based on graph network structure analysis

Info

Publication number: CN110362680B
Application number: CN201910515297.8A
Authority: CN
Inventors: 王晨旭; 梁潇
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2019-06-14
Filing date: 2019-06-14
Publication date: 2021-07-13
Anticipated expiration: 2039-06-14
Also published as: CN110362680A

Abstract

The invention discloses a soft wide detection and advertisement extraction method based on graph network structure analysis, which converts an article into a sentence graph network structure and a word graph network structure, and constructs two graph network structures by respectively taking sentences and words as nodes; then extracting some network structure attribute features from the two graph network structures, combining the network structure attribute features with TF-IDF feature vectors of each article, and training an SVM classifier to identify and detect the soft-wide articles according to labels marked in a corpus; when the soft advertisement content is extracted, the semantic importance degree of the words and the influence of the word nodes in the word graph network structure are considered, the soft advertisement content is accurately extracted, and the social network platform is effectively helped to supervise and manage soft advertisements.

Description

Soft-wide detection and advertisement extraction method based on graph network structure analysis

Technical Field

The invention belongs to the field of text classification and advertisement content extraction, and particularly relates to a soft wide detection and advertisement extraction method based on graph network structure analysis.

Background

With the rapid development of internet technology, various social network platforms greatly change and enrich our lives. In 8 months 2012, the WeChat pushes out a WeChat public number platform subscription function, and a user can send articles, pictures and other information to subscribers of the public number. With the great influence of the WeChat public number platform, thousands of fans are harvested by some public numbers, and intimate relations are established between the fans and audience fans. In recent years, many enterprises and merchants have focused on this point, and have developed a new marketing method, i.e., soft text advertising (soft broadcloth), by using the wechat public platform for branding. The soft and wide advertisement content is usually embedded into the article content in a hidden and circuitous way and directly reaches the user, and the readability is strong. However, there are also soft and broad sources of false information that mislead consumers and even harm their economic benefits. Moreover, the softness and breadth can also influence the reading experience of readers and further damage the ecosystem of the platform. Therefore, accurate detection of the soft advertisement and extraction of the advertisement content are very necessary for supervision and management of the platform, and the consumption rights and interests of readers can be effectively protected.

Guo method (refer to Guo's method: Y.Guo and M.Iwaihara, Detection of text-based advertising and promotion in wikipedia by deep learning method) proposes a plain text-based method to automatically detect advertisements and promotional articles in the Wikipedia. Firstly, each article is converted into a document vector form by adopting an improved deep learning method, and then a supervised SVM classifier is trained on the document vector to predict. However, the method only considers the pure text information and does not consider the characteristics of the advertisement and the promotion article, which leaves a great space for improving the detection accuracy.

The Bhosale method (see Bhosale's method: S.Bhosale, H.Vinicombe, and R.Mooney, "Detecting promotional content in wikipedia," in Proceedings of the 2013Conference on electronic Methods in Natural Language Processing,2013, pp.1851-1857.) proposes a method of Detecting promotional articles in the Wikipedia that improves the accuracy in identifying promotional articles by combining the n-gram features with the features of the PCFG Language model. Although this method considers two features, it still does not consider the features in terms of the relevance of the subject.

The Zhang method (see Zhang's method: x.zhang, s.zhu, and w.liang, "Detecting spam and Detecting campaigns in the wireless network," in 2012IEEE 12th international conference on data mining. IEEE,2012, pp.1194-1199.) proposes an extensible framework to detect marketing activities and spam. Linking accounts publishing similar marketing and spam URLs first, then extracting candidate advertisement families for spam or promotional purposes that may exist, and finally differentiating their intent. The method measures the similarity between URLs issued by the account according to the characteristics of the URLs, and distinguishes the characteristics of various marketing activities based on a machine learning method. However, the method focuses more on the characteristics of the URL, and is not suitable for the identification and detection of the soft broadcast issued on the wechat public number platform.

Disclosure of Invention

In order to overcome the problems in the prior art, the invention aims to provide a soft wide detection and advertisement extraction method based on graph network structure analysis.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a soft wide detection and advertisement extraction method based on graph network structure analysis comprises the following steps:

step 1: calculating TF-IDF characteristic vectors;

firstly, segmenting words of each article in a corpus, then calculating TF-IDF values of each word, and obtaining TF-IDF characteristic vectors of each article according to the TF-IDF values of each word;

step 2: constructing a sentence graph network structure;

and step 3: constructing a word graph network structure;

and 4, step 4: extracting the attribute characteristics of the graph network structure to train and detect;

carrying out community detection and division on a sentence graph network structure and a word graph network structure, then extracting graph network structure attribute characteristics of each article, and combining the obtained graph network structure attribute characteristics of each article with TF-IDF characteristic vectors to obtain combined characteristic vectors; and (3) taking the soft broad articles and normal articles marked in the corpus as data sets, training a supervised SVM classifier by using the combined feature vector, and predicting the articles with unknown classification by using the classifier.

The further improvement of the invention is that in the step 1, the TF-IDF characteristic value of each word is calculated by adopting the following formula:

wherein tfidf (w) is the TF-IDF value of the word w, # (w, A) represents the word frequency of the word w in the article A, N_wIs the number of words in article a, | D | is the number of articles in the corpus, and I (w, a) is the indicator function.

A further improvement of the invention is that the function value of the indicator function I (w, a) is 1 when the word w is in article a, and 0 otherwise.

The invention is further improved in that the specific process of the step 2 is as follows: firstly, sentence segmentation is carried out on each article, sentences are used as nodes, and semantic similarity values between sentence pairs are used as weights of edges to construct a sentence graph network structure of the article;

wherein the weight of the edge

Calculated using the following formula:

wherein, # (w)_t,s_i) Meaning word w_tIn sentence s_iThe word frequency, i and j in (1) respectively represent the appearance sequence of sentences in the article; when node s_iAnd node s_jWeight between

And when the value is larger than or equal to the threshold value alpha, constructing an edge so as to construct a sentence graph network structure of an article.

A further development of the invention is that the threshold α is 0.1.

The invention is further improved in that the specific process of step 3 is as follows: the method adopts words as nodes and the co-occurrence frequency between the words as the weight of edges

Constructing a word graph network structure of an article;

wherein the weight of the edge

The calculation formula is as follows:

wherein the content of the first and second substances,

meaning word w_iWord frequency, I ((w) in article A)_i,w_j),s_t) To indicate the function, when the word w_iAnd the word w_jCo-occurrence in sentence s_tIn middle time, the fingerDenotes the function I ((w)_i,w_j),s_t) The function value of (1), otherwise 0; if the weight is not 0, an edge is constructed, thereby constructing a vocabulary network structure.

The further improvement of the invention is that the community attribute in the graph network structure attribute feature of each article comprises a community association degree feature, and the community association degree feature is calculated by adopting the following formula:

where the indices i and j represent different communities, respectively, n_ijNumber of edges, n, representing a connection between community i and community j_iNumber of nodes, n, representing community i_jThe number of nodes representing community j.

The invention further improves the method, and also comprises the following step 5: and extracting the advertisement content of the soft advertisements.

The invention has the further improvement that the specific process of the step 5 is as follows: the method comprises the steps of firstly calculating the importance scores of words in each community in a word graph network structure, then carrying out reverse order ordering on the importance of word sets in each community in the word graph network structure, and extracting advertisement contents in articles.

A further improvement of the present invention is that the importance scores for the terms in each community in the term graph network structure are calculated using the following formula:

wherein, Score (w)_i) For the importance score of the term in each community in the term graph network structure, tfidf (w)_i) The representative word w_iThe TF-IDF value of (a),

the representative word w_iDegree of the node represented, N_cMeaning comprising the word w_iThe number of nodes of the community。

Compared with the prior art, the invention has the following beneficial effects: according to the invention, two text network structures including a sentence graph network structure and a word graph network structure are constructed for one article, the attribute characteristics of some graph network structures are extracted, and training and classification are carried out by combining the traditional TF-IDF characteristic vector, so that more accurate detection effect is obtained; the invention adopts the characteristic (Community-Connectivity) of measuring the topic association degree between communities, can obviously compare the difference between the soft broad articles and the normal articles in the topic association degree of the communities, and improves the accuracy.

Furthermore, the method for measuring the importance of the word nodes in the community can accurately extract the advertisement content in the soft and medium articles and help the social platform to better manage and monitor the soft and medium articles.

Furthermore, when the importance of the word nodes in the community is measured, the semantic importance degree of the words and the influence of the nodes in the community are considered, so that the importance scores of the words in each community in the word graph network structure are calculated, and then the advertisement content in the article is extracted after the ranking is carried out according to the scores.

Drawings

FIG. 1 is a comparison of the network structure property-Community-Connectivity property of the Sentence graph of a soft Wide article and a Normal article.

FIG. 2 is a comparison of the WordGraph graph network structure property characteristics Community-Connectivity of a soft broad article and a normal article.

FIG. 3 is a comparison of the distribution of the network structure property-Community-Connectivity property of the SentenGraph graph of 200 soft Wide articles and 200 normal articles.

FIG. 4 is a distribution comparison of the WordGraph graph network structure property characteristics Community-Connectivity of 200 soft broad articles and 200 normal articles.

Fig. 5 is a schematic diagram of an advertisement content extraction result of a soft broad article, and a green node in a box is the extracted advertisement content.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings.

The invention provides a soft wide detection and advertisement extraction method based on graph network structure analysis, which comprises the following steps:

in the invention, TF-IDF is the word frequency-reverse document frequency.

Step 1: calculating TF-IDF feature vectors: firstly, segmenting words of each article in a corpus, then calculating TF-IDF values of each word, and obtaining TF-IDF characteristic vectors of each article according to the TF-IDF values of each word;

the TF-IDF eigenvalue formula is calculated as follows:

wherein tfidf (w) is the TF-IDF value of the word w, # (w, A) represents the word frequency of the word w in the article A, N_wThe number of words in article a, | D | is the number of articles in the corpus, and I (w, a) is an indicator function, which is 1 when word w is in article a, and 0 otherwise. After the TF-IDF value of each word is calculated, the TF-IDF characteristic vector of an article is obtained.

Step 2: construct the sentencyclopph (sentence graph network structure): firstly, segmenting sentences of each article, and constructing a sentence graph network structure of one article by taking the sentences as nodes and taking semantic similarity values between sentence pairs as the weight of edges after completing word segmentation of each article in the corpus based on the step 1; the specific process is as follows:

construction of weights for edges in the Sentence graph (sentence graph network Structure)

The calculation formula is as follows:

wherein, # (w)_t,s_i) Meaning word w_tIn sentence s_iThe word frequencies i and j in (1) respectively represent the appearance sequence of sentences in the article.

When node s_iAnd node s_jWeight between

And when the value is larger than or equal to the threshold value alpha, constructing an edge so as to construct a sentence graph network structure of an article. Here, α is set to 0.1.

And step 3: building WordGraph (word graph network structure): after sentence segmentation is completed for each article based on the step 2 and word segmentation is completed for each article in the corpus based on the step 1, a word graph network structure of one article is constructed by taking words as nodes and taking the co-occurrence frequency between the words as the weight of edges;

building weights of edges in WordGraph (word graph network Structure)

The calculation formula is as follows:

wherein the content of the first and second substances,

meaning word w_iWord frequency, I ((w) in article A)_i,w_j),s_t) To indicate the function, when the word w_iAnd the word w_jCo-occurrence in sentence s_tIn (1), the indicator function I ((w)_i,w_j),s_t) The function value of (1) is otherwise 0. If the weight is 0, no edge is constructed. If the weight is not 0, an edge is constructed, thereby constructing WordGraph (word graph network structure).

And 4, step 4: extracting the attribute characteristics of the graph network structure for training and detecting: based on the step 2 and the step 3, community detection and division are firstly carried out on the sentence graph network structure and the word graph network structure, then graph network structure attribute characteristics of each article are extracted, see table 1, and the extracted graph network structure attribute characteristics of each article are combined with the TF-IDF feature vector in the step 1 to obtain a combined feature vector. Taking a soft broad article and a normal article marked in a corpus as a data set, training a supervised SVM classifier by using a feature vector formed by combining graph network structure attribute features and TF-IDF feature vectors, and predicting the articles with unknown classification by using the classifier;

table 1 shows the graph network structure attribute features extracted by the present invention.

TABLE 1 brief description of network architecture Properties

The method for calculating the Community association-Connectivity characteristic in table 1 is as follows:

FIG. 1 is a comparison of the network structure property-Community-Connectivity property of the Sentence graph of a soft Wide article and a Normal article. FIG. 2 is a comparison of the WordGraph graph network structure property characteristics Community-Connectivity of a soft broad article and a normal article. FIG. 3 is a comparison of the distribution of the network structure property-Community-Connectivity property of the SentenGraph graph of 200 soft Wide articles and 200 normal articles. FIG. 4 is a distribution comparison of the WordGraph graph network structure property characteristics Community-Connectivity of 200 soft broad articles and 200 normal articles. It can be seen from fig. 1 and 2 that the median of the Community-Connectivity of the normal articles is significantly greater than that of the soft broad articles, and it can be seen from fig. 3 and 4 that the majority of the Community-Connectivity of the normal articles have a greater value than that of the soft broad articles, indicating that the topic relationship between the communities of the normal articles is much closer than that of the soft broad articles.

And 5: and (3) extracting advertisement content of the soft advertisements: calculating the importance score of the words in each community in the word graph network structure, then performing importance reverse ordering on the word set in each community in the word graph network structure, and extracting the advertisement content in the article;

when extracting advertisement content, not only the semantic importance degree of a word, but also the influence of a node in a community is considered, specifically, the calculation formula of the importance score of the word in each community in the word graph network structure is as follows:

wherein, Score (w)_i) An importance score is assigned to the terms in each community in the term graph network structure. tfidf (w)_i) The representative word w_iThe TF-IDF value of (a),

the representative word w_iDegree of the node represented, N_cMeaning comprising the word w_iThe number of nodes of the community. Fig. 5 is a schematic diagram of an advertisement content extraction result of a soft broad article, and nodes in a block are extracted advertisement content.

The invention converts an article into a sentence graph network structure and a word graph network structure, and respectively takes sentences and words as nodes to construct two graph network structures; then extracting some network structure attribute features from the two graph network structures, combining the network structure attribute features with TF-IDF feature vectors of each article, and training an SVM classifier to identify and detect the soft-wide articles according to labels marked in a corpus; when the soft advertisement content is extracted, the semantic importance degree of the words and the influence of the word nodes in the word graph network structure are considered, the soft advertisement content is accurately extracted, and the social network platform is effectively helped to supervise and manage soft advertisements.

According to the invention, two text network structures including a sentence graph network structure and a word graph network structure are constructed for one article, the attribute characteristics of some graph network structures are extracted, and training and classification are carried out by combining the traditional TF-IDF characteristic vector, so that more accurate detection effect is obtained; the invention defines a characteristic (Community-Connectivity) for measuring the topic association degree between communities, and can obviously compare the difference between the soft broad articles and the normal articles in the topic association degree of the communities; the invention provides a method for measuring the importance of word nodes in a community, which can accurately extract advertisement contents in soft and medium articles by combining the semantic importance degree of words and the influence of the nodes in a graph network structure and help a social platform to better manage and supervise the soft and medium articles.

Claims

1. A soft wide detection and advertisement extraction method based on graph network structure analysis is characterized by comprising the following steps:

step 1: calculating TF-IDF characteristic vectors;

step 2: constructing a sentence graph network structure;

and step 3: constructing a word graph network structure;

2. The method for soft wide detection and advertisement extraction based on graph network structure analysis as claimed in claim 1, wherein in step 1, the following formula is adopted for calculating the TF-IDF characteristic value of each word:

3. The method of claim 2, wherein when the word w is in article a, the function value of the indication function I (w, a) is 1, otherwise it is 0.

4. The method for soft wide detection and advertisement extraction based on graph network structure analysis according to claim 1, wherein the specific process of step 2 is as follows: firstly, sentence segmentation is carried out on each article, sentences are used as nodes, and semantic similarity values between sentence pairs are used as weights of edges to construct a sentence graph network structure of the article;

wherein the weight of the edge

Calculated using the following formula:

wherein, # (w)_t，s_i) Meaning word w_tIn sentence s_iThe word frequency, i and j in (1) respectively represent the appearance sequence of sentences in the article; when node s_iAnd node s_jWeight between

5. The method of claim 4, wherein the threshold α is 0.1.

6. The method for soft wide detection and advertisement extraction based on graph network structure analysis according to claim 1, wherein the specific process of step 3 is as follows: the method adopts words as nodes and the co-occurrence frequency between the words as the weight of edges

Constructing a word graph network structure of an article;

wherein the weight of the edge

The calculation formula is as follows:

wherein the content of the first and second substances,

meaning word w_iWord frequency, I ((w) in article A)_i，w_j)，s_t) To indicate the function, when the word w_iAnd the word w_jCo-occurrence in sentence s_tIn (1), the indicator function I ((w)_i，w_j)，s_t) The function value of (1), otherwise 0; if the weight is not 0, an edge is constructed, thereby constructing a vocabulary network structure.

7. The method for soft wide detection and advertisement extraction based on graph network structure analysis as claimed in claim 1, wherein the community attributes in the graph network structure attribute features of each article include a community association feature, and the community association feature is calculated by using the following formula:

8. The method for soft wide detection and advertisement extraction based on graph network structure analysis according to claim 1, further comprising step 5: and extracting the advertisement content of the soft advertisements.

9. The method for soft wide detection and advertisement extraction based on graph network structure analysis according to claim 8, wherein the specific process of step 5 is as follows: the method comprises the steps of firstly calculating the importance scores of words in each community in a word graph network structure, then carrying out reverse order ordering on the importance of word sets in each community in the word graph network structure, and extracting advertisement contents in articles.

10. The method of claim 9, wherein the importance score of the words in each community in the word graph network structure is calculated by the following formula:

the representative word w_iDegree of the node represented, N_cMeaning comprising the word w_iThe number of nodes of the community.