CN112527964B

CN112527964B - Microblog abstract generation method based on multi-mode manifold learning and social network characteristics

Info

Publication number: CN112527964B
Application number: CN202011503521.0A
Authority: CN
Inventors: 夏书银; 曹洋洋; 陈子忠
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2022-07-01
Anticipated expiration: 2040-12-18
Also published as: CN112527964A

Abstract

The invention discloses a microblog abstract generating method based on multi-modal manifold learning and social network characteristics, which comprises the following steps of: acquiring a microblog set of a specific topic of a user and user interaction information; constructing a relation matrix in the text and a relation matrix across the text; calculating the microblog significance by combining the matrix; calculating social acceptance by combining the user interaction information; and combining the microblog significance with the social recognition to obtain the final microblog significance, and selecting a plurality of sentences with the highest significance to become the abstract. The invention improves the popular learning method commonly used in the multi-document abstract, integrates the social network information, better utilizes sentence relation characteristics among different subject documents and sentence relation characteristics in the same document, simultaneously adopts the maximum boundary correlation algorithm (MMR) to reduce redundant information, and considers the coverage and diversity of the abstract.

Description

Microblog abstract generation method based on multi-mode manifold learning and social network characteristics

Technical Field

The invention relates to an automatic text summarization technology in natural language processing, in particular to an automatic generation of a microblog speech summary based on multi-modal manifold learning and social network characteristics.

Background

The rapid development of social networking media, such as twitter and microblog, provides a large amount of information for people and increases the cost for acquiring effective information. Therefore, microblog abstract research for compressing and summarizing massive microblog information becomes necessary. At present, the main research methods of microblog abstractions comprise: (1) based on the traditional abstraction method: sumbic, Textrank, Lexrank, Centriod, Data Reconstruction. (2) Utilizing social network static and dynamic data: and summarizing discussions of people under a certain topic, such as praise number, microblog forwarding number, user influence and the like. Most of the latest research methods are methods for combining two kinds of materials: some of these are microblog saliency calculations using static social network information, such as (Li et al, 2012) in combination with the forwarded number of a certain microblog and a following-following representing the user's influence. Still other social network information based on dynamic, such as the social network relationship information of people considered in the way of Heyu, etc., provide a social network abstract algorithm with lower redundancy; (Duan Y et al, 2012) combines the static and dynamic information, sorts the microblogs based on microblog publication time, weights the utterance based on the influence of the user and the content quality of the utterance, and calculates the sentence significance. In addition, research is performed based on the time sequence of the microblogs, for example, (Nichols et al, 2012) for abstracting a certain event, a timestamp of the microblog can be used as a node for detecting the occurrence of the event according to a characteristic, and a peak value appears in a change curve of the number of posts when the event occurs.

Disclosure of Invention

The existing research is usually to summarize according to a hot topic in a certain time or a certain event, when the existing research is applied to a user speech summarization, the effect is not ideal, and meanwhile, for some algorithms such as a Data Reconstruction algorithm (Data Reconstruction), the problem of high complexity exists. The method improves the common manifold learning method in the multi-document abstract, integrates the social network information, better utilizes sentence relation characteristics among different subject documents and sentence relation characteristics in the documents, simultaneously reduces redundant information by adopting MMR, and gives consideration to the coverage and diversity of the abstract.

The technical scheme adopted by the invention is as follows: a microblog abstract generating method based on multi-modal manifold learning and social network characteristics comprises the following steps:

step one, acquiring a microblog set of a specific topic of a user and user interaction information;

secondly, constructing a text relation matrix in a single document and a text relation matrix between cross documents;

thirdly, calculating the microblog significance by combining the matrix in the second step;

step four, calculating social acceptance by combining the user interaction information;

and step five, integrating the microblog significance and social identity information to obtain the final microblog significance, and selecting a plurality of sentences with the highest microblog significance under the maximum border correlation algorithm (MMR) strategy to be abstracted in consideration of redundancy.

Specifically, the step of acquiring the microblog set of the specific topic of the user in the step one comprises the steps of counting the word frequencies of nouns in all acquired microblog texts, screening the top n topic nouns as hot topic words, then screening the user through prior topic words, if the speeches published by the user relate to the n topics and exceed k, reserving the speeches, and then integrating the speeches of the user on each class into one sample.

In the technical scheme, the method further comprises the step of cleaning the microblog set of the specific topic, specifically removing numbers of Hashtag, @ URL and microblog tail, and removing the microblog with the number of words less than m in the microblog.

The user interaction information comprises the number of praise, forwarding and comment of the user microblog, is extracted through a regular expression, and is set to be 0 if the user interaction information is not extracted.

Specifically, step two is a text relation matrix in the same document

If the microblogs i and j belong to the same topic, then

Is x_iAnd x_jResidual chord similarity, otherwise

Is 0, x_iAnd expressing TF-IDF codes of a single microblog.

The text relation matrix between the cross-documents

If i, j belong to different documents or if one of i, j is 0, then

Is x_iAnd x_jResidual chord similarity measure, otherwise

Is 0.

Specifically, the microblog significance of the third step is calculated by the following formula:

Q_a(f)＝μ^T·(1-S_a)f+(1-μ)·(f-y)^T(f-y)

Q_b(f)＝η·f^T(1-S_b)f+(1-η)·(f-y)^T(f-y)

wherein Q_a(f)、Q_b(f) For the Loss function, f is a vector representing the saliency score of each sentence, when Q_a(f) Middle f is reachedTo the minimum time

When, the representation f well considers the sentence relationship with the document. When Q is_b(f) When medium f reaches the minimum

When representing f, the different document sentence relationships are well considered. y ═ y₀,y₁,y₂,...,y_n]^TAnd y represents that given a set of data points, the first point represents the topic description sentence point and the remaining n points represent all sentences in the document (data points to be sorted). μ, η represents an information smoothness constraint that considers the subject information and the text information. λ represents an information smoothness constraint between two modalities that consider both co-document information and cross-document information.

Specifically, the social recognition degree in the fourth step is calculated according to the formula

R_i＝α·c_i+β·re_i+γ·l_i

Wherein, c_i、re_i、l_iThe values are respectively the dispersion standardized values of the praise number, the forwarding number and the comment number of the ith microblog, and alpha, beta and gamma are hyper parameters and meet the condition that alpha + beta + gamma is 1.

Further, the final microblog significance is

RankScore＝ω·f^*+(1-ω)·R

ω is an adjustable hyper-parameter, where 0< ω < 1. And R represents the final microblog significance.

In order to ensure that the redundancy of the screened abstract is as small as possible, the scheme further comprises a redundancy removing step, which specifically comprises the following steps:

1) the set a and the set B are initialized,

B＝{x_i1,2,. n }, wherein A represents a set for storing summary microblogs, B represents a set of candidate microblogs sorted according to the microblog significance scores, and x represents_iIndicating the ith microblogAnd n represents the total number of microblogs. Wherein the significance score of each microblog is by S_i＝RankScore_(i)Calculating;

2) sorting the microblogs in the set B according to the significance scores;

3) taking the first element x from set B_iIf x_iSatisfies the following conditions:

wherein s represents the microblog in the A set.

Then x is_iMoving from the set B to the set A, wherein epsilon is a hyper-parameter and represents a threshold value of the similarity, and otherwise, deleting;

4) repeating the step 3) until

And the number of the microblogs in the set A or the set A reaches the expected digest length.

The invention has the following beneficial technical effects:

1. the multi-mode manifold learning algorithm is improved, so that the method can be applied to multi-topic microblog text abstracts without topic sentences. Specifically, it is considered that the status of no subject sentence is the same as that of the subject sentence and other microblog sentences, so that y in the original algorithm is equal to [ y ═ y₀,y₁,y₂,...,y_n]^TWherein y is₀＝1；

y_i0, modified by y₀＝y_i1. Thus, the status of the first sentence is completely parallel to the status of the other sentences, and the subject sentence disappears. Therefore, the algorithm can be applied to the microblog data set, so that the microblog abstract is not interfered by the 'subject sentence' information when being generated, and the information consistency and complementarity among a plurality of documents are well considered.

2. And integrating social network interaction information of the microblog, such as praise number, comment number forwarding number and the like into an abstract algorithm. So as to obtain the abstract with high information coverage, novelty and summarization. The user issues a microblog, and the interaction amount of the friends of the user and the people who browse the microblog represents the attention degree of the people and the recognition degree of the microblog information. Generally, the information coverage, novelty and summarization of the microblog are indicated to a certain degree by the attention and the recognition of a piece of information, and the text abstract is just to select sentences with high information coverage, novelty and summarization. Therefore, the interactive information is integrated into the algorithm, and the abstract with better information coverage, novelty and summarization can be obtained.

In conclusion, the method develops a microblog abstract generating algorithm considering social identity and consistency and complementarity of information among different documents aiming at the particularity of the microblog speeches of the users covering a plurality of topics. Therefore, the abstract with better information coverage, novelty and summarization is obtained. This is the advantage of the present invention.

Drawings

FIG. 1 is a schematic process flow diagram of the present invention.

Detailed Description

Referring to fig. 1, when the method is used for generating the multi-topic text abstract, a microblog is selected as the abstract from the social identity and the microblog significance, and the similarity of the final abstract is controlled, so that the coverage, diversity and social identity of the generated microblog abstract are comprehensively considered. .

Considering the diversity of the user abstract topics, the task can be regarded as a multi-document abstract task, and the multi-modal manifold learning is a widely used multi-document abstract method, which comprehensively considers the full-text topics, the intra-document importance and the inter-document importance. Aiming at the user abstract, as no predefined topic sentence information exists, the method improves the multi-mode popular learning algorithm, so that the method can be applied to the microblog abstract without topic sentences. Meanwhile, because the microblog contains social network interaction information, such as the forwarding amount, praise number, comment number and the like of each microbump, and considering that the possibility that a language with high social recognition degree is used as a summary is higher, the invention designs a microblog summary method combining the social network interaction information and multi-mode popular learning, which comprises the following specific steps:

1. preparing data: because of lack of public microblog corpora and abstract corpora, the original data come from user microblog data obtained through a public microblog API, and finally 500 users are sorted, and the number of microblogs of each user is not more than 1000. All user ids are replaced numerically, taking into account user privacy. And counting the word frequency of the nouns in all the microblogs, and screening the first n topic nouns as hot topic words. And then screening the users through prior subject terms, if the language published by the users relates to the n topics and exceeds k, reserving the language, and integrating the language of the users on each category into a sample. And after a microblog set of a specific topic of the user is obtained, further cleaning the data. Firstly removing noisy information such as Hashtag, @, URL, and the number at the tail of the microblog, and then removing the microblog with the number of words less than m in the microblog. And e, the number of praise, forwarding and comment of the user microblog is extracted through a regular expression, and if the number of praise, forwarding and comment of the user microblog is not extracted, the number of praise is set to 0.

2. The main ideas of the multi-mode popular learning in the text abstract are as follows: the sentence relation in the multi-document abstract can be divided into a relation in the same document and a relation between texts in different documents, which respectively reflect the text information coverage and the full text information coverage of the sentences, and based on the difference of the two relations, the relation between the two sentences can be represented by two matrixes. And combining the two kinds of information to obtain the final microblog significance, and selecting a plurality of sentences with the highest significance to become the abstract.

Encoding a microblog relation matrix: by using

Representing a matrix of relationships between sentences within the same document,

representing all sentences in the document (data points to be sorted). By x_iPresentation sheetAnd (4) carrying out TF-IDF coding on the microblog. If the microblogs i and j belong to the same topic, then

Is x_iAnd x_jResidual chord similarity, otherwise

Is 0. In a similar manner to that described above,

representing a text relationship matrix across documents, if i, j belong to different texts or if one of i, j is 0

Is x_iAnd x_jResidual chord similarity measure, otherwise

Is 0. Then will be

Is normalized as S_a、S_bThe regularization method is S_x＝(D^x)^(-1/2)W^x(D^x)^(-1/2)Wherein D is^xIs formed by W^xThe sum of the row elements of each column constitutes a diagonal matrix.

The microblog significance: firstly, calculating the sentence significance score in each modality, and then combining the information of the two modalities, wherein the calculation formula is as follows:

Q_a(f)＝μ^T·(1-S_a)f+(1-μ)·(f-y)^T(f-y)

Q_b(f)＝η·f^T(1-S_b)f+(1-η)·(f-y)^T(f-y)

Q_a(f)、Q_b(f) for the Loss function, f is a vector representing the saliency score of each sentence, when Q_a(f) When f reaches the minimum value, i.e. f_a ^*When f, it means that f well considers the sentence relationship with the document. When Q is_b(f) When f reaches the minimum value, i.e. f_b ^*When f, the representation f takes into account the different document sentence relationships well. y ═ y₀,y₁,y₂,…,y_n]^TAnd y represents that given a set of data points, the first point represents the topic description sentence point and the remaining n points represent all sentences in the document (data points to be sorted). μ, η represents an information smoothness constraint that considers the subject information and the text information. The lambda representation considers information smoothness constraints between the two modalities of the same document and different documents.

Social recognition: if one microblog is forwarded, praise and the number of comments of the microblog are more than those of other microblogs, the relative acceptance of the microblog in the document is considered to be higher than that of other microblogs, and the calculation formula is as follows:

R_i＝α·c_i+β·re_i+γ·l_i

And integrating the social identity and the microblog significance information, wherein the final microblog significance is as follows:

RankScore＝ω·f^*+(1-ω)·R

omega represents a hyper-parameter and represents smoothness constraint considering two modal information of social acceptance and microblog significance. And R represents the final microblog significance.

Redundancy penalty strategy: to ensure that the redundancy of the selected digests is as small as possible, our strategy for mmr (maximum local retrieval) is as follows:

1) initializing a set

B＝{x_i1,2,. n }, wherein A represents a set for storing summary microblogs, B represents a set of candidate microblogs sorted according to microblog significance scores, and x represents_iThe ith microblog is represented, and n represents the total number of microblogs. The significance score of each microblog is calculated according to the formula S_i＝RankScore_(i)

2) Ordering microblogs in the B set according to the significance scores

wherein s represents the microblog in the set A.

Then x is_iMoving from set B to set a, where epsilon is a hyperparameter representing a threshold of similarity. Otherwise delete

4) Repeating step 3) until or

And A or A sets of microblogs reach the expected digest length.

Reference to the literature

[1] Hoechamine, wubo, penhao, zhangyan chong, li jiangxin microblog emergency detection method and device based on semantic expansion [ P ]. beijing city: CN106886567B,2019-11-08.

[2] Tenghui, Liu Shimeng, Longfei A convolutional neural network-based microblog news abstract extraction type generation method [ P ]. Beijing City: CN110362674B,2020-08-04.

[3] Herzufang, guangchua, dangjianwu, huqinghua, topic-oriented multi-microblog time sequence summarization method [ P ]. tianjin city: CN105740448B,2019-06-25.

[4] Congress, duxuefe, zhangxiefe, lie sanfei summary method [ P ] based on social media microblog specific topics: CN107992634A,2018-05-04.

[5] Hoechamine, wubo, penhao, zhangyang, li jiangxin microblog emergency detection method and device based on semantic expansion [ P ]. beijing: CN106886567A,2017-06-23.

[6] A method for generating an abstract of a self-adaptive microblog topic [ P ]. beijing: CN106503064A,2017-03-15.

Claims

1. The microblog abstract generation method based on the multi-mode manifold learning and social network characteristics is characterized by comprising the following steps of:

relationship matrix within the same document

If the microblogs i and j belong to the same topic, then

Is x_iAnd x_jThe residual lengths are similar, otherwise, the order

Is 0, x_iThe TF-IDF codes represent a single microblog;

text relation matrix across documents

If i, j belong to different documents or if one of i, j is 0, then

Is x_iAnd x_jThe residual lengths are similar otherwise

Is 0;

step three, calculating the microblog significance by combining the matrix in the step two and calculating by the following formula:

Q_a(f)＝μ^T·(1-S_a)f+(1-μ)·(f-y)^T(f-y)

Q_b(f)＝η·f^T(1-S_b)f+(1-η)·(f-y)^T(f-y)

Q_a(f)、Q_b(f) for the Loss function, f is a vector representing the saliency score of each sentence, when Q_a(f) When f reaches the minimum

When f is good, the relation of the sentence with the document is considered, when Q is_b(f) When f reaches the minimum

When f is good at considering different document sentence relations, y ═ y₀,y₁,y₂,...,y_n]^TY denotes a given set of data points, the first point representing the topic description sentence point and the remaining n points representing all sentences in the document, μ, η representing the information smoothness constraint considering topic information and text information, respectively, and λ denotes the information smoothness constraint considering the same document and different documentsShifting information smoothness constraints between two modalities;

and step five, integrating the microblog significance and the social identity information to obtain the final microblog significance, and selecting a plurality of sentences with the highest microblog significance under the MMR strategy to be abstracts in consideration of redundancy.

2. The method for generating the microblog digest based on the multi-modal manifold learning and social network features according to claim 1, wherein the method comprises the following steps: the step of acquiring the microblog set of the specific topic of the user comprises the steps of counting word frequencies of nouns in all acquired microblog texts, screening top n topic nouns as hot topic words, screening the user through prior topic words, if the speeches published by the user relate to the n topics and exceed k, reserving the speeches, and integrating the speeches of the user on each class into a sample.

3. The method for generating the microblog digest based on the multi-modal manifold learning and social network features according to claim 2, wherein the method comprises the following steps: the method further comprises the step of cleaning the microblog set of the specific topic, specifically, removing numbers of Hashtag, @ URL and microblog tail, and removing the microblog with the number of words less than m in the microblog.

4. The method for generating the microblog digest based on the multi-modal manifold learning and social network features according to claim 1, wherein the method comprises the following steps: the user interaction information comprises the number of praise, forwarding and comment of the user microblog, is extracted through a regular expression, and is set to be 0 if the user interaction information is not extracted.

5. The method for generating the microblog digest based on the multi-modal manifold learning and social network features according to claim 1, wherein the method comprises the following steps: step four, the calculation formula of the social recognition degree is

R_i＝α·c_i+β·re_i+γ·l_i

6. The method for generating the microblog abstract based on the multi-modal manifold learning and the social network features according to any one of claims 1 to 5, wherein the method comprises the following steps: the final microblog significance is

RankScore＝ω·f^*+(1-ω)·R

ω is an adjustable hyper-parameter, where 0< ω <1, R represents the final microblog prominence.

7. The method for generating the microblog digest based on the multi-modal manifold learning and social network features according to claim 6, wherein the method comprises the following steps: the method further comprises a redundancy removing step which specifically comprises the following steps:

1) the set a and the set B are initialized,

B＝{x_i1,2,. n }, wherein A represents a set for storing summary microblogs, B represents a set of candidate microblogs sorted according to microblog significance scores, and x represents_iRepresenting the ith microblog, n representing the total number of microblogs, wherein the significance score of each microblog is S_i＝RankScore_(i)Calculating;

2) sorting the microblogs in the set B according to the significance scores;

s represents the microblog in the A set;

then x is_iMoving from B set to A set, where ε is a hyperparameter representing a threshold of similarity, otherwise deletingRemoving;

4) repeating the step 3) until

Or the number of the A set microblogs reaches the expected digest length.