CN110867250B - A social media self-harm behavior detection method based on strong robust feature selection - Google Patents

A social media self-harm behavior detection method based on strong robust feature selection Download PDF

Info

Publication number
CN110867250B
CN110867250B CN201911033392.0A CN201911033392A CN110867250B CN 110867250 B CN110867250 B CN 110867250B CN 201911033392 A CN201911033392 A CN 201911033392A CN 110867250 B CN110867250 B CN 110867250B
Authority
CN
China
Prior art keywords
post
self
posts
social media
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911033392.0A
Other languages
Chinese (zh)
Other versions
CN110867250A (en
Inventor
罗敏楠
董怡翔
郑庆华
秦涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN201911033392.0A priority Critical patent/CN110867250B/en
Publication of CN110867250A publication Critical patent/CN110867250A/en
Application granted granted Critical
Publication of CN110867250B publication Critical patent/CN110867250B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/56Extraction of image or video features relating to colour

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Public Health (AREA)
  • Biomedical Technology (AREA)
  • Multimedia (AREA)
  • Pathology (AREA)
  • General Engineering & Computer Science (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种基于强鲁棒性特征选择的社交媒体自残行为检测方法,1)从网络社交媒体网站进行多维度的异质信息获取;2)对数据从文本、用户、时间和图片四个方面进行特征提取,构造自残内容数据集和正常内容数据集;3)通过l_2,1范数的loss函数和正则化项,构建基于强鲁棒性特征选择的有监督自残检测模型;4)对待检测的目标数据进行特征抽取,使用构建的检测模型进行自残检测。本发明所公开的面向社交媒体的自残检测方法,较传统的自残检测相比,可以更广泛的接触到自残主体、更深度的发掘自残主体的行为模式、更高效及时的发现自残行为,具有实际应用的优势。

Figure 201911033392

The invention discloses a social media self-harm behavior detection method based on strong robust feature selection, 1) obtaining multi-dimensional heterogeneous information from online social media websites; Perform feature extraction in each aspect to construct a dataset of self-mutilation content and a dataset of normal content; 3) Construct a supervised self-mutilation detection model based on strong robust feature selection through the loss function and regularization term of l_2,1 norm; 4) Feature extraction is performed on the target data to be detected, and the constructed detection model is used for self-harm detection. Compared with the traditional self-mutilation detection method, the social media-oriented self-mutilation detection method disclosed in the present invention can more widely contact the self-mutilation subjects, explore the behavior patterns of the self-mutilation subjects more deeply, and discover the self-mutilation behaviors more efficiently and in a timely manner. application advantages.

Figure 201911033392

Description

一种基于强鲁棒性特征选择的社交媒体自残行为检测方法A social media self-harm behavior detection method based on strong robust feature selection

技术领域technical field

本发明属于社交媒体数据挖掘领域,特别涉及一种基于强鲁棒性特征选择的社交媒体自残行为检测方法。The invention belongs to the field of social media data mining, in particular to a social media self-harm behavior detection method based on strong robust feature selection.

背景技术Background technique

近年来,自残行为逐渐成为社会公共卫生领域的一大挑战。能够及时有效的发现社会中的自残行为,是应对该挑战的现实需要。由于现有的基于自残主体自身及其家人朋友的传统自残发现策略,存在着执行困难、效率低下等缺陷,所以急需一种新的自残检测策略。随着网络社交媒体的普及,越来越多的人倾向于在社交媒体上发表想法和记录生活,因而使得利用社交媒体进行自残行为检测成为可能。相较传统的自残检测方法,利用社交媒体进行自残行为检测可以更高效的发现更多的自残行为。In recent years, self-harm has gradually become a major challenge in the field of social public health. Being able to detect self-harm in society in a timely and effective manner is a realistic need to meet this challenge. Since the existing traditional self-mutilation detection strategies based on self-mutilation subjects themselves and their family members and friends have shortcomings such as difficulty in execution and low efficiency, a new self-mutilation detection strategy is urgently needed. With the popularity of online social media, more and more people tend to post their thoughts and record their lives on social media, which makes it possible to use social media to detect self-harm behavior. Compared with traditional self-harm detection methods, self-harm behavior detection using social media can more efficiently detect more self-harm behaviors.

现在已经存在了大量利用社交媒体为主的各种网络数据来进行网络用户健康状况研究的工作。现有技术提出了一种基于心率和社交媒体微博的心理压力检测方法,来对目标个体的压力区间和压力源事件进行发现,其主要包括:首先,对个体的心率异常进行检测,来反映测试周期内个体的神经系统的紧张程度;然后,对个体微博的异常区间进行检测,来发现测试周期内用户发布积极微博频率的异常情况;最后,将心率异常和微博发布异常进行匹配,从而对压力区间进行确定,并通过微博数据进行压力源事件的发现。There has been a lot of research on the health status of network users using various network data mainly based on social media. The prior art proposes a psychological stress detection method based on heart rate and social media microblogs to discover the stress interval and stressor events of the target individual, which mainly includes: first, detecting the abnormal heart rate of the individual to reflect The nervous system tension level of the individual during the test period; then, the abnormal interval of individual microblogs is detected to find the abnormal situation of the frequency of users posting positive microblogs during the test period; finally, the abnormal heart rate and the abnormality of microblog posting are matched , so as to determine the pressure interval, and discover the pressure source events through the microblog data.

现有技术提出了一种社交媒体用户心理危机的预警方法,方法主要包括:首先,获取社交媒体上用户发布的文本数据,并对该数据进行预处理得到词语构成的数据集;然后,通过对负面词汇的词频统计来对文本进行量化的情感分析计算,得到用户发布文本的情感特征向量;最后,将得到的特征向量输入到神经网络中得到用户的负向情感强度,并对用户的心理状态进行评级。The prior art proposes an early warning method for the psychological crisis of social media users. The method mainly includes: first, acquiring text data published by users on social media, and preprocessing the data to obtain a data set composed of words; The word frequency statistics of negative words are used to quantify the sentiment analysis and calculation of the text, and the sentiment feature vector of the text published by the user is obtained; finally, the obtained feature vector is input into the neural network to obtain the user's negative emotional strength, and the user's psychological state. rating.

上述基于社交媒体的数据分析方法,都只选择使用了同质信息源,没有充分利用社交媒体上丰富的异质信息源来进行全面的数据挖掘。同时,以上方法的数据挖掘算法都过于简单,不能充分挖掘媒体数据中有价值的信息及适应实际应用中充满噪音的复杂数据。The above-mentioned data analysis methods based on social media only choose to use homogeneous information sources, and do not make full use of the rich heterogeneous information sources on social media for comprehensive data mining. At the same time, the data mining algorithms of the above methods are too simple, and cannot fully mine valuable information in media data and adapt to the complex data full of noise in practical applications.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于提供一种基于强鲁棒性特征选择的社交媒体自残行为检测方法,以解决上述问题。The purpose of the present invention is to provide a social media self-harm behavior detection method based on strong robust feature selection to solve the above problems.

为实现上述目的,本发明采用以下技术方案:To achieve the above object, the present invention adopts the following technical solutions:

一种基于强鲁棒性特征选择的社交媒体自残行为检测方法,包括以下步骤:A social media self-harm behavior detection method based on strong robust feature selection, comprising the following steps:

步骤1,社交媒体数据采集:以网络社交媒体网站历史数据为数据源,获取自残相关帖子和非自残帖子的文本信息、用户行为信息、时间信息和图片信息,得到由若干个帖子组成的帖子集合;记由n个帖子组成的帖子集合

Figure BDA0002250782010000021
Step 1, social media data collection: take the historical data of online social media sites as the data source, obtain text information, user behavior information, time information and picture information of self-harm related posts and non-self-harm posts, and obtain a post collection consisting of several posts. ; remember a post collection consisting of n posts
Figure BDA0002250782010000021

步骤2,数据特征提取及数据集构建:对从数据采集中得到的帖子pi(i=1,2,…,n)提取其4个异质信息源的特征,得到帖子特征向量fii={wi,ui,ti,pi},其中,wi表示文本特征,ui表示用户行为特征,ti表示帖子的时间特征,pi表示帖子的图片特征,由此分别构建自残帖子数据集和正常帖子数据集;Step 2, data feature extraction and data set construction: extract the features of four heterogeneous information sources from the posts p i (i=1, 2,..., n) obtained from the data collection, and obtain the post feature vector fi i = { wi , ui , t i , pi }, where wi represents text features, ui represents user behavior features, ti represents temporal features of posts, and pi represents image features of posts, thus constructing self-mutilation features respectively. post dataset and normal post dataset;

步骤3,自残检测模型建立:从步骤2所构建的数据集中抽取训练样本,基于强鲁棒性特征选择的目标函数,构建和训练有监督的自残检测模型;Step 3, establish a self-harm detection model: extract training samples from the data set constructed in step 2, and build and train a supervised self-harm detection model based on the objective function selected by strong robust features;

步骤4,自残内容检测:对需要检测的目标帖子p,根据步骤2中特征提取方法构建其特征向量f,再将其特征向量f输入到步骤3所训练得到的检测模型中进行特征选择,同时判断其是否为自残相关帖子。Step 4, self-mutilation content detection: For the target post p to be detected, construct its feature vector f according to the feature extraction method in step 2, and then input its feature vector f into the detection model trained in step 3 for feature selection. Determine if it is a self-harm related post.

进一步的,步骤1社交媒体数据采集中,通过不同社交媒体帖子的标签信息,利用网络爬虫或社交媒体提供的应用程序接口进行自残相关和非自残帖子的主题爬取,对每条帖子获取的主要内容包括:Further, in the social media data collection in step 1, through the tag information of different social media posts, the web crawler or the application program interface provided by social media is used to crawl the subject of self-harm related and non-self-harm posts, and the main information obtained for each post is obtained. content include:

(1)文本信息:获取帖子的标题、主题标签词列表、正文文本和其所含的所有评论文本;(1) Text information: Get the title of the post, the list of hashtags, the body text and all the comment texts it contains;

(2)用户行为信息:获取发帖用户的总发帖量、用户加入该社交媒体平台的时间、该用户的关注数和粉丝数;(2) User behavior information: obtain the total number of posts posted by the user, the time when the user joined the social media platform, the number of followers and followers of the user;

(3)时间信息:获取帖子的发布时间和帖子中图片的拍摄时间;(3) Time information: get the posting time of the post and the shooting time of the pictures in the post;

(4)图片信息:获取帖子中所附的所有图片。(4) Picture information: Get all pictures attached in the post.

进一步的,步骤2特征提取及数据集构建中,主要包括:Further, in step 2 feature extraction and data set construction, it mainly includes:

(1)文本特征:文本词性分布特征,计算每条帖子文本内容中不同词性所占比例;可读性特征,利用语言学中的可读性计算公式,对文本的可读性指数进行计算;情感倾向特征,利用文本情感分析判断帖子的情感倾向为积极、中性或消极;文本的词向量表示,利用深度模型为每条帖子的文本计算其向量表示;以上的特征表示为w={wling,wread,wsent,wvec};(1) Text features: distribution features of text parts of speech, calculating the proportion of different parts of speech in the text content of each post; readability features, using the readability calculation formula in linguistics to calculate the readability index of the text; Sentiment tendency feature, using text sentiment analysis to judge whether the sentiment tendency of the post is positive, neutral or negative; the word vector representation of the text, using the deep model to calculate its vector representation for the text of each post; the above features are expressed as w={w ling ,w read ,w sent ,w vec };

(2)用户行为特征:根据用户的总发帖量和使用该社交平台的时间,计算该用户的平均发帖量;利用用户的帖子总量和存在回复的帖子的数量,计算该用户帖子的平均回复率;再加上该用户的关注数和粉丝数,其特征可表示为u={upost,urep,ufol,ufan};(2) User behavior characteristics: Calculate the user's average posting volume according to the user's total posting volume and the time of using the social platform; use the user's total number of posts and the number of replies to calculate the average response to the user's posts. rate; plus the number of followers and fans of the user, its characteristics can be expressed as u={u post , u rep , u fol , u fan };

(3)时间特征:将每天按小时划分为24个时间段,统计该帖子发布时间和所附图片的拍摄时间所在时间段,其特征可表示为t={tpost,tpic};(3) Time characteristics: Divide each day into 24 time periods by hour, and count the time period of the posting time of the post and the shooting time of the attached picture, and its characteristics can be expressed as t={t post , t pic };

(4)图片特征:对图片中的色彩模式进行表征,同时利用色彩信息对图片的情感维度进行定量分析;根据图像处理中的算法对图片的局部特征进行提取并用神经网络对图片表征,其特征可表示为p={pcol,psent,plocal,pnet}。(4) Picture features: characterize the color pattern in the picture, and use color information to quantitatively analyze the emotional dimension of the picture; extract the local features of the picture according to the algorithm in the image processing and use the neural network to characterize the picture. It can be expressed as p={p col , p sent , p local , p net }.

进一步的,步骤3自残检测模型建立中,使用了强鲁棒性的高效且稳健的特征选择方法:首先,使用

Figure BDA0002250782010000031
来表示训练数据中可用的标注信息,其中,对
Figure BDA0002250782010000032
中帖子pi,当{Yi1=1,Yi2=0}时,该帖子为自残内容帖子,反之,当{Yi1=0,Yi2=1}时,该帖子为正常帖子;Further, in the establishment of the self-damage detection model in step 3, an efficient and robust feature selection method with strong robustness is used: first, use
Figure BDA0002250782010000031
to represent the annotation information available in the training data, where
Figure BDA0002250782010000032
Medium post p i , when {Y i1 = 1, Y i2 = 0}, the post is a post with self-harm content, otherwise, when {Y i1 = 0, Y i2 = 1}, the post is a normal post;

然后,使用

Figure BDA0002250782010000033
表示训练数据的数据矩阵,其中li为第i个异质信息源所抽取的特征数量;Then, use
Figure BDA0002250782010000033
represents the data matrix of training data, where li is the number of features extracted from the i -th heterogeneous information source;

最后,通过使用l2,1范数的loss函数和正则化项达到强鲁棒性特征选择的目的;所构建的监督模型为训练一个系数矩阵

Figure BDA0002250782010000041
将数据矩阵X映射到标注信息矩阵Y,训练方式为:Finally, the purpose of strong robust feature selection is achieved by using the loss function of the l 2,1 norm and the regularization term; the constructed supervised model is to train a coefficient matrix
Figure BDA0002250782010000041
Map the data matrix X to the annotation information matrix Y, and the training method is:

Figure BDA0002250782010000042
Figure BDA0002250782010000042

其中,

Figure BDA0002250782010000043
为正则化项的参数,具体的训练过程为:in,
Figure BDA0002250782010000043
is the parameter of the regularization term, and the specific training process is:

(1)构建矩阵

Figure BDA0002250782010000044
其中,
Figure BDA0002250782010000045
为单位矩阵,
Figure BDA0002250782010000046
同时,初始化矩阵
Figure BDA0002250782010000047
为单位矩阵,并设置训练过程收敛的终止阈值为∈;(1) Build a matrix
Figure BDA0002250782010000044
in,
Figure BDA0002250782010000045
is the identity matrix,
Figure BDA0002250782010000046
At the same time, initialize the matrix
Figure BDA0002250782010000047
is the identity matrix, and the termination threshold of the training process convergence is set to ∈;

(2)计算U=D-1AT(AD-1AT)-1Y;(2) Calculate U=D -1 A T (AD -1 A T ) -1 Y;

(3)更新对角矩阵D,对角元素为dii=1/(2‖ui2),其中,ui为U第i行;(3) Update the diagonal matrix D, and the diagonal elements are d ii =1/(2‖u i2 ), where ui is the ith row of U;

(4)构造W=(u1,u2,…,um-n),并判断目标函数下降幅度小于∈是否成立,若不成立,则返回过程(2)继续训练;否则,退出训练保存系数矩阵W。(4) Construct W=(u 1 , u 2 ,..., um mn ), and judge whether the decrease of the objective function is less than ∈ is true, if not, return to the process (2) to continue training; otherwise, exit the training and save the coefficient matrix W .

进一步的,步骤4自残内容检测中,对待检测的目标帖子p,通过学得的系数矩阵W可映射得到其指示向量y∈R^2,当y_1>y_2时,则判定该帖子为自残内容;否则,判定该帖子为正常内容,正常内容即为非自残内容。Further, in the self-mutilation content detection in step 4, the target post p to be detected can be mapped to obtain its indication vector y∈R^2 through the learned coefficient matrix W, and when y_1>y_2, it is determined that the post is self-mutilation content; Otherwise, the post is determined to be normal content, and normal content is non-self-harm content.

与现有技术相比,本发明有以下技术效果:Compared with the prior art, the present invention has the following technical effects:

(1)自残主体由于信任问题在物理世界很难寻得可靠的求助对象,因而更倾向于将倾诉求助放于具有相对匿名性的社交媒体上,使得本发明可以更广泛地接触到自残主体;(1) Since it is difficult for the self-mutilation subject to find a reliable person for help in the physical world due to trust issues, it is more inclined to confide in social media with relative anonymity, so that the present invention can reach the self-mutilation subject more widely;

(2)传统的自残行为研究存在样本少、跟踪观察周期长等不足,而社交媒体极大普及所积累的大量社交数据中包含大量的自残案例,使得本发明可以更深发掘和理解自残主体行为模式;(2) The traditional research on self-harm behavior has shortcomings such as few samples and long tracking and observation period, and the large amount of social data accumulated by the great popularity of social media contains a large number of self-harm cases, so that the present invention can further explore and understand the behavior patterns of self-harm subjects ;

(3)由于自残行为的隐蔽性,传统基于自残主体亲朋的发现方式困难且滞后,而基于社交媒体数据所构建的自残检测数学模型,使得本发明可以更及时和有效地发现自残行为。(3) Due to the concealment of self-mutilation behavior, the traditional method of discovering self-mutilation subjects' relatives and friends is difficult and lagging behind, while the self-mutilation detection mathematical model constructed based on social media data enables the present invention to detect self-mutilation behaviors more timely and effectively.

附图说明Description of drawings

图1是本发明基于强鲁棒性特征选择的社交媒体自残行为检测方法框图。FIG. 1 is a block diagram of the social media self-harm behavior detection method based on strong robust feature selection according to the present invention.

图2是数据采集过程的流程图。Figure 2 is a flow chart of the data acquisition process.

图3是特征分析过程的流程图。Figure 3 is a flow chart of the feature analysis process.

图4是检测模型训练过程流程图。Figure 4 is a flow chart of the detection model training process.

图5是自残检测过程的流程图。Figure 5 is a flow chart of the self-harm detection process.

图6是社交媒体上自残相关帖子实例。Figure 6 is an example of self-harm related posts on social media.

具体实施方式Detailed ways

以下结合附图及实施例对本发明的实施方式进行详细说明。需要说明的是,此处描述的实施例只用以解释本发明,并不用于限定本发明。此外,在不冲突的情况下,本发明中的实施例涉及的技术特征可以相互结合。The embodiments of the present invention will be described in detail below with reference to the accompanying drawings and examples. It should be noted that the embodiments described herein are only used to explain the present invention, and are not used to limit the present invention. In addition, the technical features involved in the embodiments of the present invention may be combined with each other without conflict.

本发明的具体实施过程包括数据采集过程、特征分析过程、模型建立过程、自残检测过程。图1是本发明基于强鲁棒性特征选择的社交媒体自残行为检测方法框图。The specific implementation process of the present invention includes a data acquisition process, a feature analysis process, a model establishment process, and a self-harm detection process. FIG. 1 is a block diagram of the social media self-harm behavior detection method based on strong robust feature selection according to the present invention.

1.数据采集过程1. Data collection process

图6为网络社交媒体数据实例。数据获取的具体过程如下:Figure 6 is an example of network social media data. The specific process of data acquisition is as follows:

(1)通过爬虫技术,根据社交媒体每条数据帖子的标签进行主题爬取。在爬取自残相关内容时,可使用如“selfharm”、“selfinjury”、“suicide”等自残相关标签进行爬取;在爬取正常帖子时,可无主题对目标网页进行超需求量的饱和爬取;(1) Through crawling technology, subject crawling is performed according to the tags of each data post in social media. When crawling self-harm related content, you can use self-harm related tags such as "selfharm", "selfinjury", "suicide" to crawl; when crawling normal posts, you can saturate the target web page without a theme Crawling;

(2)对自残相关帖子,考虑到不同标签返回的帖子集合可能存在重叠的部分,故对自残相关的帖子进行去重处理。之后,为了防止非自残用户偶然发布自残相关内容的情况,除去自残帖子所属的用户集合中帖子数少于5个的用户及其相关帖子;(2) For posts related to self-harm, considering that the sets of posts returned by different tags may overlap, the posts related to self-harm are de-duplicated. Afterwards, in order to prevent non-self-harm users from accidentally posting self-harm related content, remove users with less than 5 posts and their related posts in the user set to which the self-harm posts belong;

(3)对正常内容帖子,通过标签进行过滤,除去带有自残相关标签的帖子。之后,根据正常帖子的需求量进行随机采样使用。(3) For normal content posts, filter by tags to remove posts with self-harm related tags. After that, random sampling is performed according to the demand of normal posts.

以上的步骤流程如图2所示,从而得到帖子集合

Figure BDA0002250782010000051
其中,爬取的内容信息如图6所示,包括:The process of the above steps is shown in Figure 2, so as to obtain a set of posts
Figure BDA0002250782010000051
Among them, the crawled content information is shown in Figure 6, including:

(1)文本信息:获取帖子的标题、主题标签词列表、正文文本和其所含的所有评论文本;(1) Text information: Get the title of the post, the list of hashtags, the body text and all the comment texts it contains;

(2)用户行为信息:获取发帖用户的总发帖量、用户加入该社交媒体平台的时间、该用户的关注数和粉丝数;(2) User behavior information: obtain the total number of posts posted by the user, the time when the user joined the social media platform, the number of followers and followers of the user;

(3)时间信息:获取帖子的发布时间和帖子中图片的拍摄时间;(3) Time information: get the posting time of the post and the shooting time of the pictures in the post;

(4)图片信息:获取帖子中所附的所有图片。(4) Picture information: Get all pictures attached in the post.

2.特征分析过程2. Feature analysis process

对从数据采集过程中得到的帖子进特征分析和提取。从帖子pi(i=1,2,…,n)提取其4个异质信息源的特征,得到帖子特征向量fi={wi,ui,ti,pi}。其主要过程包括:Feature analysis and extraction are performed on the posts obtained from the data collection process. Extract the features of its 4 heterogeneous information sources from the post p i (i=1,2,...,n), and obtain the post feature vector f i ={ wi ,u i ,t i , pi }. Its main process includes:

(1)文本特征提取:文本词性分布特征,计算每条帖子文本内容中名词、动词、形容词和副词所占的比例,可使用如CMUTweetTagger这种面向社交媒体的文本分析工具进行计算;可读性特征,利用语言学中的可读性计算公式,对文本的可读性指数进行计算,例如Flesch可读性计算公式、Linsear Write可读性计算公式、Fog可读性计算公式和Dale-Chall可读性计算公式;情感倾向特征,利用文本情感分析判断帖子的情感倾向为积极、中性或消极,可使用语料库MPQA进行计算;文本的词向量表示,利用word2vec模型为每条帖子的文本计算其向量表示。以上的特征表示为wi={wling,wread,wsent,wvec};(1) Text feature extraction: text part-of-speech distribution features, calculate the proportion of nouns, verbs, adjectives and adverbs in the text content of each post, which can be calculated using social media-oriented text analysis tools such as CMUTweetTagger; readability Features, using the readability calculation formula in linguistics to calculate the readability index of text, such as Flesch readability calculation formula, Linsear Write readability calculation formula, Fog readability calculation formula and Dale-Chall readability calculation formula. Readability calculation formula; Sentiment tendency feature, use text sentiment analysis to determine whether the post's emotional tendency is positive, neutral or negative, and can use the corpus MPQA to calculate; the word vector representation of the text, use the word2vec model to calculate the text of each post. vector representation. The above features are expressed as w i ={w ling ,w read ,w sent ,w vec };

(2)用户行为特征提取:根据用户的总发帖量和使用该社交平台的时间,计算该用户的平均发帖量;利用用户的帖子总量和存在回复的帖子的数量,计算该用户帖子的平均回复率;再加上该用户的关注数和粉丝数,其特征可表示为ui={upost,urep,ufol,ufan};(2) User behavior feature extraction: Calculate the average posting volume of the user according to the total posting volume of the user and the time of using the social platform; use the total number of posts of the user and the number of replies to calculate the average posting volume of the user's posts Reply rate; plus the number of followers and fans of the user, its characteristics can be expressed as u i ={u post ,u rep ,u fol ,u fan };

(3)时间特征提取:将每天按小时划分为24个时间段,统计该帖子发布时间和所附图片的拍摄时间所在时间段,可使用{0,1}24的向量进行表征,其特征可表示为ti={tpost,tpic};(3) Temporal feature extraction: Divide each day into 24 time periods by hour, and count the time period of the posting time of the post and the shooting time of the attached picture, which can be characterized by a vector of {0,1} 24 , and its features can be Expressed as t i ={t post ,t pic };

(4)图片特征提取:对图片中的色彩模式里利用柱坐标色彩空间HSV进行表征,得到图片的色度(Hue)、色饱和度(Saturation)和亮度(Brightness)。同时,利用色彩信息对图片的情感维度进行定量分析,计算公式为:(4) Picture feature extraction: The color mode in the picture is characterized by the cylindrical coordinate color space HSV, and the chroma (Hue), color saturation (Saturation) and brightness (Brightness) of the picture are obtained. At the same time, using color information to quantitatively analyze the emotional dimension of the picture, the calculation formula is:

Figure BDA0002250782010000071
Figure BDA0002250782010000071

此外,使用图像处理中的SURF算法、LBP算法和GIST算法来对图片的局部特征进行提取,并用使用在ImageNet数据集上已预训练完成的AlexNet神经网络来对图片进行特征抽取。其特征可表示为pi={pcol,psent,plocal,pnet}。In addition, the SURF algorithm, LBP algorithm and GIST algorithm in image processing are used to extract the local features of the picture, and the AlexNet neural network that has been pre-trained on the ImageNet dataset is used to extract the features of the picture. Its characteristics can be expressed as p i ={p col , p sent , p local , p net }.

该过程的流程图如图3所示。此后,可分别构建自残帖子数据集和正常帖子数据集。A flowchart of the process is shown in Figure 3. After that, a dataset of self-harm posts and a dataset of normal posts can be constructed separately.

3.模型建立过程3. Model building process

定义训练数据集中的自残内容标注信息为

Figure BDA0002250782010000072
其中,对
Figure BDA0002250782010000073
中帖子pi,当{Yi1=1,Yi2=0}时,该帖子为自残内容帖子;反之,当{Yi1=0,Yi2=1}时,该帖子为正常帖子。使用训练数据的特征向量组成的数据矩阵
Figure BDA0002250782010000074
(li为第i个异质信息源所抽取的特征数量),所构建的基于强鲁棒性特征选择的有监督模型为训练一个系数矩阵
Figure BDA0002250782010000075
将数据矩阵X映射到标注信息矩阵Y,训练方式为:Define the self-harm content annotation information in the training data set as
Figure BDA0002250782010000072
Among them, to
Figure BDA0002250782010000073
Medium post p i , when {Y i1 =1,Y i2 =0}, the post is a post with self-harm content; on the contrary, when {Y i1 =0,Y i2 =1}, the post is a normal post. A data matrix composed of eigenvectors using the training data
Figure BDA0002250782010000074
(l i is the number of features extracted from the i-th heterogeneous information source), the constructed supervised model based on strong robust feature selection is to train a coefficient matrix
Figure BDA0002250782010000075
Map the data matrix X to the annotation information matrix Y, and the training method is:

Figure BDA0002250782010000076
Figure BDA0002250782010000076

其中,

Figure BDA0002250782010000077
为正则化项参数,具体的训练过程为:in,
Figure BDA0002250782010000077
is the regularization parameter, and the specific training process is as follows:

(1)构建矩阵

Figure BDA0002250782010000078
其中,
Figure BDA0002250782010000079
为单位矩阵,
Figure BDA00022507820100000710
同时,初始化矩阵
Figure BDA00022507820100000711
为单位矩阵,并设置训练收敛的终止条件为(‖XW-Y‖2,1+αW2,1<∈;(1) Build a matrix
Figure BDA0002250782010000078
in,
Figure BDA0002250782010000079
is the identity matrix,
Figure BDA00022507820100000710
At the same time, initialize the matrix
Figure BDA00022507820100000711
is the identity matrix, and set the termination condition of training convergence as (‖XW-Y‖ 2,1 +αW2,1<∈;

(2)计算U=D-1AT(AD-1AT)-1Y;(2) Calculate U=D -1 A T (AD -1 A T ) -1 Y;

(3)更新对角矩阵D,对角元素为dii=1/(2‖ui2),其中,ui为U第i行;(3) Update the diagonal matrix D, and the diagonal elements are d ii =1/(2‖u i2 ), where ui is the ith row of U;

(4)构造W=(u1,u2,…,um-n),并判断终止条件是否成立,若不成立,则返回过程(2)继续训练;否则,退出训练保存系数矩阵W。(4) Construct W=(u 1 , u 2 , ..., um mn ), and judge whether the termination condition is established, if not, return to process (2) to continue training; otherwise, exit the training and save the coefficient matrix W.

以上有监督模型的训练过程流程图如图4所示。The flow chart of the training process of the above supervised model is shown in Figure 4.

4.自残检测过程4. Self-harm detection process

对需要检测的目标帖子p,根据特征分析中特征提取方法构建其特征向量f,再将其特征向量f输入到模型建立过程中训练得到的检测模型中,判断其是否为自残相关帖子。通过有监督的自残检测模型系数矩阵W的映射,可得到目标帖子p的指示向量

Figure BDA00022507820100000712
当y1>y2时,则判定该帖子为自残内容;否则,判定该帖子为正常内容。该检测过程流程图如图5所示。For the target post p to be detected, construct its feature vector f according to the feature extraction method in the feature analysis, and then input its feature vector f into the detection model trained during the model establishment process to determine whether it is a self-harm related post. Through the mapping of the supervised self-harm detection model coefficient matrix W, the indicator vector of the target post p can be obtained
Figure BDA00022507820100000712
When y 1 >y 2 , the post is determined to be self-harm content; otherwise, the post is determined to be normal content. The flow chart of the detection process is shown in FIG. 5 .

Claims (1)

1. A social media self-disabling behavior detection method based on strong robustness feature selection is characterized by comprising the following steps:
step 1, social media data acquisition: taking historical data of a network social media website as a data source, and acquiring text information, user behavior information, time information and picture information of self-disabled related posts and non-self-disabled posts to obtain a post set consisting of a plurality of posts; post set composed of n posts
Figure FDA0003780948520000011
Step 2, data feature extraction and data set construction: for posts p obtained from data collection i Wherein i =1,2, ·, n; extracting the characteristics of 4 heterogeneous information sources to obtain a post characteristic vector f i ={w i ,u i ,t i ,p i In which w i Representing a text feature, u i Representing a user behavior feature, t i Representing the temporal characteristics of the post, p i Representing picture characteristics of the posts, and respectively constructing a self-disabled post data set and a normal post data set;
step 3, establishing a self-residual detection model: extracting training samples from the data set constructed in the step 2, and constructing and training a supervised self-residual detection model based on a target function selected by the strong robustness characteristics;
step 4, self-disabled content detection: constructing a feature vector f of a target post p to be detected according to the feature extraction method in the step 2, inputting the feature vector f into the detection model trained in the step 3 for feature selection, and judging whether the target post p is a self-mutilation related post;
in step 1, in social media data acquisition, through tag information of different social media posts, topic crawling of self-residual related posts and non-self-residual posts is performed by using an application program interface provided by a web crawler or a social media, and main contents acquired by each post comprise:
(1) Text information: acquiring a title, a subject label word list, a text and all comment texts contained in the text;
(2) User behavior information: acquiring the total posting volume of posting users, the time for the users to join the social media platform, and the attention number and the fan number of the users;
(3) Time information: acquiring the publishing time of the post and the shooting time of the picture in the post;
(4) Picture information: acquiring all pictures attached to the posts;
step 2, in feature extraction and data set construction, the method mainly comprises the following steps:
(1) Text characteristics: text part-of-speech distribution feature w ling Calculating the proportion of different parts of speech in the text content of each post; readability characteristic w read Calculating the readability index of the text by using a readability calculation formula in linguistics; emotional tendency characteristics w sent Judging whether the emotional tendency of the post is positive, neutral or negative by utilizing text emotional analysis; word vector representation w of text vec Calculating vector representation of the text of each post by using a depth model; the above feature is represented by w = { w = { [ w ] ling ,w read ,w sent ,w vec };
(2) The user behavior characteristics are as follows: calculating the average posting volume u of the user according to the total posting volume of the user and the time of using the network social media post (ii) a Calculating the average reply rate u of the user posts by using the total number of the posts of the user and the number of the posts with replies rep (ii) a Plus the user's attention number u fol Number u of vermicelli made from bean starch fan Its characteristic can be expressed as u = { u = { u = post ,u rep ,u fol ,u fan };
(3) Time characteristics: dividing each day into 24 time periods by hour, and counting the post publishing time t post And a time period t of the shooting time of the attached picture pic Characterized in that it can be represented as t = { t = { t = } post ,t pic };
(4) Picture characteristics: for color pattern p in picture col Representing and utilizing color information to carry out emotion dimension p on the picture sent Carrying out quantitative analysis; local characteristic p of picture according to algorithm in image processing local Extracting and representing the picture p by using a neural network net Its characteristics can be expressed as p = { p = col ,p sent ,p local ,p net };
Step 3, in the self-residual detection model establishment, a high-efficiency and steady feature selection method with strong robustness is used: first, use
Figure FDA0003780948520000029
To represent the annotation information available in the training data, wherein
Figure FDA00037809485200000210
Chinese post p i When { Y } i1 =1,Y i2 =0} the post is a self-spoiled content post, otherwise, when { Y } i1 =0,Y i2 =1}, the post is a normal post;
then, use
Figure FDA0003780948520000021
A data matrix representing training data, wherein i The feature quantity extracted for the ith heterogeneous information source;
finally, by using l 2,1 The loss function and the regularization term of the norm achieve the purpose of selecting the strong robustness characteristics; the constructed supervision model is used for training a coefficient matrix
Figure FDA0003780948520000022
Mapping the data matrix X to a labeling information matrix Y, wherein the training mode is as follows:
Figure FDA0003780948520000023
wherein,
Figure FDA0003780948520000024
for the parameters of the regularization term, the specific training process is:
(1) Constructing matrices
Figure FDA0003780948520000025
Wherein,
Figure FDA0003780948520000026
is a matrix of the units,
Figure FDA0003780948520000027
at the same time, the matrix is initialized
Figure FDA0003780948520000028
Setting a termination threshold value of the convergence of the training process as an element matrix;
(2) Computing
Figure FDA0003780948520000031
(3) Updating diagonal matrix D with diagonal elements D ii =1/(2||u i || 2 ) Wherein u is i Is Uth line;
(4) Configuration W = (u) 1 ,u 2 ,...,u m-n ) Judging whether the descending amplitude of the target function is less than the epsilon or not, if not, returning to the process (2) to continue training; otherwise, quitting training and saving the coefficient matrix W;
step 4, in the self-disabled content detection, a target post p to be detected can be mapped through a learned coefficient matrix W to obtain an indication vector y ∈ R ^2, and when y _1 is larger than y _2, the post is judged to be self-disabled content; otherwise, judging that the post is normal content, wherein the normal content is non-self-disabled content.
CN201911033392.0A 2019-10-28 2019-10-28 A social media self-harm behavior detection method based on strong robust feature selection Active CN110867250B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911033392.0A CN110867250B (en) 2019-10-28 2019-10-28 A social media self-harm behavior detection method based on strong robust feature selection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911033392.0A CN110867250B (en) 2019-10-28 2019-10-28 A social media self-harm behavior detection method based on strong robust feature selection

Publications (2)

Publication Number Publication Date
CN110867250A CN110867250A (en) 2020-03-06
CN110867250B true CN110867250B (en) 2022-10-25

Family

ID=69653491

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911033392.0A Active CN110867250B (en) 2019-10-28 2019-10-28 A social media self-harm behavior detection method based on strong robust feature selection

Country Status (1)

Country Link
CN (1) CN110867250B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107145524A (en) * 2017-04-12 2017-09-08 清华大学 Suicide risk checking method and system based on microblogging and Fuzzy Cognitive Map
CN109903851A (en) * 2019-01-24 2019-06-18 暨南大学 An Automatic Observation Technology of Psychological Abnormal Changes Based on Social Networks
CN110263620A (en) * 2019-05-06 2019-09-20 杭州电子科技大学 One kind being based on L2,1The age estimation method of label study partially

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107145524A (en) * 2017-04-12 2017-09-08 清华大学 Suicide risk checking method and system based on microblogging and Fuzzy Cognitive Map
CN109903851A (en) * 2019-01-24 2019-06-18 暨南大学 An Automatic Observation Technology of Psychological Abnormal Changes Based on Social Networks
CN110263620A (en) * 2019-05-06 2019-09-20 杭州电子科技大学 One kind being based on L2,1The age estimation method of label study partially

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Latent Suicide Risk Detection on Microblog via Suicide-OrientedWord Embeddings and Layered Attention;Lei Cao 等;《arXiv:1910.12038v1》;20191026;第1-11页 *
基于中文微博语言特征的自杀意念检测;许立鹏 等;《中北大学学报(自然科学版)》;20190611;第40卷(第4期);第350-357页 *

Also Published As

Publication number Publication date
CN110867250A (en) 2020-03-06

Similar Documents

Publication Publication Date Title
CN104281882B (en) The method and system of prediction social network information stream row degree based on user characteristics
CN109783614B (en) A differential privacy leak detection method and system for text to be published in social networks
US12287818B2 (en) Utilizing multiple knowledge bases to form a query response
CN112084373B (en) Graph embedding-based multi-source heterogeneous network user alignment method
CN111177559B (en) Cultural tourism service recommendation method, device, electronic equipment and storage medium
CN108733791B (en) Network event detection method
US10942919B2 (en) Generating further knowledge to process query
CN113422761B (en) Malicious social user detection method based on counterstudy
Guthier et al. Detection and visualization of emotions in an affect-aware city
CN109949174B (en) Heterogeneous social network user entity anchor link identification method
CN110196945B (en) A Weibo User Age Prediction Method Based on LSTM and LeNet Fusion
WO2017084205A1 (en) Network user identity authentication method and system
CN113095948A (en) Multi-source heterogeneous network user alignment method based on graph neural network
Ren et al. Anomaly detection in time series based on interval sets
Han et al. Linking social network accounts by modeling user spatiotemporal habits
CN111191039A (en) Knowledge graph creation method, knowledge graph creation device and computer readable storage medium
CN115982474B (en) Fashion personality prediction and clothing recommendation method and device based on social network
CN103678279A (en) Figure uniqueness recognition method based on heterogeneous network temporal semantic path similarity
CN108647800A (en) A kind of online social network user missing attribute forecast method based on node insertion
CN109086794A (en) A kind of driving behavior mode knowledge method based on T-LDA topic model
CN109918648B (en) A Rumor Depth Detection Method Based on Dynamic Sliding Window Feature Scoring
Wang et al. Analyzing image-based political propaganda in referendum campaigns: from elements to strategies
CN110851733A (en) Community Discovery and Sentiment Interpretation Methods Based on Network Topology and Document Content
CN111737594B (en) Virtual network role behavior modeling method based on unsupervised label generation
CN115131058B (en) Account identification method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant