CN111694955A

CN111694955A - Early dispute message detection method and system for social platform

Info

Publication number: CN111694955A
Application number: CN202010382894.0A
Authority: CN
Inventors: 曹娟; 卢名彦; 谢添; 刘浩远; 郭俊波
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2020-05-08
Filing date: 2020-05-08
Publication date: 2020-09-22
Anticipated expiration: 2040-05-08
Also published as: CN111694955B

Abstract

The invention provides a method and a system for detecting early dispute messages of a social platform, wherein the method comprises the following steps: collecting all messages under a preset topic in a social platform, marking the disputeness of each message according to the comment information of each message, extracting the multi-dimensional disputeness characteristics of the marked messages as training data, training a gradient lifting tree model according to the training data, and obtaining a disputeness message detection model; and acquiring the message to be published from the social platform as a message to be detected, and inputting the multi-dimensional disputed characteristics of the message to be detected into the disputed message detection model to obtain a disputed message detection result of the message to be detected. The invention can obtain the dispute of the published message to be audited in the social platform.

Description

Early dispute message detection method and system for social platform

Technical Field

The invention relates to the field of big data analysis and the technical field of information mining, in particular to an early dispute message detection method and system for a social platform.

Background

With the rapid development of the internet and the wide use of various network communication tools, people's social contact ways have been deeply changed. Social media represented by microblogs and Twitter become important channels for public to acquire information by virtue of characteristics of openness, instantaneity and the like. The rich content of the social media enables users to prefer to acquire information from the social media, and meanwhile, the participation of the users enriches the message content of the social media, so that a good cyclic process is formed.

The rapid development of social media and the virtuous circle embodied by the social media provide convenience for people to acquire information and develop social contact, but the social media still have a lot of problems. At present, disputed messages on social media are endless and may appear in various fields. Often, there are multiple pairs of cubes holding different perspectives in these controversial messages, each of which forms a group that opens up a strong debate against the controversial messages. The spread and fermentation of controversial topics such as DeEurope in the United kingdom, the great election in the United states, etc. on the Internet has led to isolation and misunderstanding between different levels of society. Some serious disputed messages even jeopardize the national conscious morphological safety, and urgently need to be supervised. Therefore, timely detection is required before the fermentation of the controversial message to prevent further deterioration of the situation. The invention provides an early dispute message detection method based on a microblog platform, and aims to predict whether a message causes dispute discussion or not when the message is just issued and a comment is not received.

Through investigation, no mature early dispute message detection method exists at present.

Disclosure of Invention

The present invention aims to address the detection of early dispute messages. Specifically, the invention provides a method for detecting an early dispute message of a social platform, which comprises the following steps:

step 1, collecting all messages under a preset topic in a social platform, marking the disputeness of each message according to the comment information of each message, extracting the multi-dimensional disputeness characteristics of the marked messages as training data, training a gradient lifting tree model by using the training data, and obtaining a disputeness message detection model;

and 2, acquiring the message to be published from the social platform as a message to be detected, and inputting the multi-dimensional disputed characteristics of the message to be detected into the disputed message detection model to obtain a disputed message detection result of the message to be detected.

The method for detecting the early dispute message of the social platform comprises the following steps of 1:

step 11, collecting and publishing hot topics within a preset time period, collecting all messages and comments under the hot topics by using a web crawler, labeling each message label according to disputeness of viewpoints contained in the comments of each message, extracting multi-dimensional disputeness characteristics of each message, and combining the labels of each message to obtain training data of a training gradient lifting tree model.

The method for detecting the early disputed messages of the social platform comprises the following steps:

the number of micro-blogs of the user publishing the message, and/or the number of fans of the user publishing the message, and/or the number of interests of the user publishing the message, and/or the number of characters of the message, and/or the number of words of the message, and/or the number of commas of the message, and/or the number of periods of the message, and/or the number of questions of the message, and/or the number of ellipses of the message, and/or the number of exclamations of the message in proportion to the number of characters of the message, and/or the number of periods of the message in proportion to the number of characters of the message, and/or the number of ellipses of the message in proportion to the number of characters of the message, and/or the average word length of the message, and/or the longest exclamations of the message, and/or the number of longest question marks of the message, and/or the number of longest commas of the message, and/or the number of longest sentence marks of the message, and/or the number of longest escape marks of the message, and/or the number of pronouns of the message, and/or the number of quantifiers of the message, and/or the number of negatives of the message, and the proportion of the number of pronouns of the message, and/or the number of the weakly pronounced words of the message, and the proportion of the pronouns of the message, and/or the number of the unsure pronounced words of the message, and the proportion of the number of inflected words of the message, and the number of pronouns of the message, and the proportion of the pronouns of the first, second, and third pronouns of the message, and the proportion of the total words of the message, and/or the number of the names of people, places and organizations of the information and the proportion of the names of all the words of the information, and/or the emotional polarity and the emotional value of the information.

The method for detecting the early dispute message of the social platform comprises the following steps of 2:

and step 21, the dispute message detection model scores the message to be detected according to the multi-dimensional dispute characteristics of the message to be detected, and selects the message to be detected with the score higher than a threshold value as the dispute message.

The method for detecting the early dispute message of the social platform is characterized in that the social platform is a microblog platform.

The invention also provides a system for detecting early dispute messages of a social platform, which comprises the following steps:

the method comprises the following steps that a module 1 collects all messages in a social platform about a preset topic, marks disputeness of each message according to comment information of each message, extracts multi-dimensional disputeness characteristics of the marked messages as training data, trains a gradient lifting tree model according to the training data, and obtains a disputeness message detection model;

and the module 2 acquires the message to be published from the social platform as the message to be detected, and inputs the multidimensional controversy characteristics of the message to be detected into the controversy message detection model to obtain the controversy message detection result of the message to be detected.

The early dispute message detection system of the social platform, wherein the module 1 comprises:

the module 11 collects and issues hot topics within a preset time period, collects all messages and comments under the hot topics by using a web crawler, marks a label for each message label according to disputeness of viewpoints contained in the comments of each message, extracts multi-dimensional disputeness characteristics of each message, and obtains training data of a training gradient lifting tree model by combining the label of each message.

The system for detecting the early disputed messages of the social platform comprises the following components:

The early dispute message detection system of the social platform, wherein the module 2 comprises:

the module 21 and the dispute message detection model score the message to be detected according to the multidimensional dispute characteristic of the message to be detected, and select the message to be detected with the score higher than the threshold value as the dispute message.

The early dispute message detection system of the social platform is characterized in that the social platform is a microblog platform.

According to the scheme, the invention has the advantage that the disputeness of published messages to be audited in the social platform can be obtained.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The invention comprises the following steps: and collecting candidate messages. Finding some hot topics in a manually defined mode, and collecting microblog contents under the topics by using a web crawler to serve as candidate messages for detection; and extracting multi-dimensional disputed characteristics. For each collected microblog text, extracting a plurality of features from two dimensions of a user and the text for dispute message detection; early dispute message detection. And classifying the extracted multi-dimensional features by using a supervised learning method. And (4) scoring each message by using a pre-trained scoring model, and selecting the message with the score value higher than a certain threshold value, namely the dispute message.

In order to make the aforementioned features and effects of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.

First, candidate message collection

Therefore, all microblogs under the hot topics appearing on the social media are collected and monitored to serve as candidate microblogs for early dispute message detection.

Extracting multi-dimensional dispute characteristics

After collecting the candidate messages, extracting multidimensional disputed characteristics for each candidate message from both the user and the message text, which is described in detail as follows:

1. the microblog number of the user who published the message

2. Number of fans of user who published the message

3. Number of interest of user who published the message

4. Number of characters of the message

5. Number of words of the message

6. Comma number of the message

7. Number of exclamation marks of the message

8. Number of periods of the message

9. Number of question marks of the message

10. Number of ellipses of the message

11. The proportion of exclamation mark number in the message to character number

12. The ratio of the number of periods in the message to the number of characters in the message

13. The number of question marks in the message is in proportion to the number of characters in the message

14. The proportion of the number of ellipses in the message to the number of characters in the message

15. Average word length of the message

16. Longest exclamation mark number of the message

17. The longest number of questions of the message

18. The longest comma number of the message

19. The longest period number of the message

20. Longest number of omitted digits of the message

21. Pronoun number of the message

22. Number of quantifier of the message

23. Number of Arabic numerals of the message

24. The number of negative words of the message and the proportion of the negative words in the message

25. The number of strong words of the message and the ratio of the strong words to the number of words of the message

26. The number of weak words of the message and the proportion of the weak words to the number of the message words

27. The number of words of the message with uncertain degree and the proportion of the number of words of the message

28. The turning word number of the message and the proportion of the turning word number to the word number of the message

29. The number of the first, second and third person named pronouns of the message and the proportion of the first, second and third person named pronouns to all the pronouns of the message

30. The number of names of people, places and organizations of the message, and the proportion of the names to all the words of the message

31. The emotional polarity and the emotional value of the message are calculated by utilizing the TengSen emotional value calculation API

Third, early dispute message detection

Early dispute detection refers to determining whether a message causes dispute discussions when the message is just released and a comment is not received, wherein the message causing dispute discussions is a dispute message, and otherwise, the message is a non-dispute message. After extracting the multi-dimensional dispute characteristics from the candidate messages, the invention adopts the gradient lifting tree model of Light GBT to score, so as to realize the early detection of the dispute messages. Therefore, the gradient lifting tree model needs to be trained in advance.

Training the gradient lifting tree model needs positive and negative samples, and the patent collects the positive and negative samples through the following mode.

1. The method is characterized in that hot topics of more than one month are collected and published firstly, and the topics are fermented in sufficient time to obtain enough comments.

2. All messages and comments on these trending topics are collected using a web crawler.

3. And judging whether each message is a dispute message or not by manually checking the collected messages and comments. If the comment of a message contains both supporting and anti-opinions, and the supporting and anti-opinions are in equal proportion, the message is a dispute message, otherwise, the message is a non-dispute message. The result of the manual examination is the true tag of the message.

4. And extracting multi-dimensional dispute characteristics of each message, and combining the real labels obtained by manual examination to obtain positive and negative samples of the training gradient lifting tree.

Training a gradient lifting tree model by using the obtained positive and negative samples, then scoring all candidate messages by using the trained model, and selecting the messages with scores higher than a certain threshold value as dispute messages.

The following are system examples corresponding to the above method examples, and this embodiment can be implemented in cooperation with the above embodiments. The related technical details mentioned in the above embodiments are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the above-described embodiments.

Claims

1. A method for detecting early dispute messages of a social platform is characterized by comprising the following steps:

2. The method for detecting early dispute messages of a social platform as claimed in claim 1, wherein the step 1 comprises:

3. The method for detecting early disputed messages on a social platform as claimed in claim 1 or 2 wherein the multidimensional disputed characteristics comprise:

4. The method for detecting early dispute messages of a social platform according to claim 1 or 2, wherein the step 2 comprises:

5. The method for detecting early dispute messages of a social platform as claimed in claim 1 or 2, wherein the social platform is a micro blog platform.

6. An early dispute message detection system of a social platform, comprising:

7. The system for early dispute message detection of a social platform as claimed in claim 1, wherein the module 1 comprises:

8. An early dispute message detection system for a social platform as claimed in claim 6 or 7 wherein the multi-dimensional dispute feature comprises:

9. An early dispute message detection system for a social platform as claimed in claim 6 or 7 wherein the module 2 comprises:

10. The system for early dispute message detection of a social platform of claim 6 or 7, wherein the social platform is a micro blog platform.