CN110867250B

CN110867250B - A social media self-harm behavior detection method based on strong robust feature selection

Info

Publication number: CN110867250B
Application number: CN201911033392.0A
Authority: CN
Inventors: 罗敏楠; 董怡翔; 郑庆华; 秦涛
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2019-10-28
Filing date: 2019-10-28
Publication date: 2022-10-25
Anticipated expiration: 2039-10-28
Also published as: CN110867250A

Abstract

The invention discloses a social media self-harm behavior detection method based on strong robust feature selection, 1) obtaining multi-dimensional heterogeneous information from online social media websites; Perform feature extraction in each aspect to construct a dataset of self-mutilation content and a dataset of normal content; 3) Construct a supervised self-mutilation detection model based on strong robust feature selection through the loss function and regularization term of l_2,1 norm; 4) Feature extraction is performed on the target data to be detected, and the constructed detection model is used for self-harm detection. Compared with the traditional self-mutilation detection method, the social media-oriented self-mutilation detection method disclosed in the present invention can more widely contact the self-mutilation subjects, explore the behavior patterns of the self-mutilation subjects more deeply, and discover the self-mutilation behaviors more efficiently and in a timely manner. application advantages.

Description

A social media self-harm behavior detection method based on strong robust feature selection

技术领域technical field

本发明属于社交媒体数据挖掘领域，特别涉及一种基于强鲁棒性特征选择的社交媒体自残行为检测方法。The invention belongs to the field of social media data mining, in particular to a social media self-harm behavior detection method based on strong robust feature selection.

背景技术Background technique

近年来，自残行为逐渐成为社会公共卫生领域的一大挑战。能够及时有效的发现社会中的自残行为，是应对该挑战的现实需要。由于现有的基于自残主体自身及其家人朋友的传统自残发现策略，存在着执行困难、效率低下等缺陷，所以急需一种新的自残检测策略。随着网络社交媒体的普及，越来越多的人倾向于在社交媒体上发表想法和记录生活，因而使得利用社交媒体进行自残行为检测成为可能。相较传统的自残检测方法，利用社交媒体进行自残行为检测可以更高效的发现更多的自残行为。In recent years, self-harm has gradually become a major challenge in the field of social public health. Being able to detect self-harm in society in a timely and effective manner is a realistic need to meet this challenge. Since the existing traditional self-mutilation detection strategies based on self-mutilation subjects themselves and their family members and friends have shortcomings such as difficulty in execution and low efficiency, a new self-mutilation detection strategy is urgently needed. With the popularity of online social media, more and more people tend to post their thoughts and record their lives on social media, which makes it possible to use social media to detect self-harm behavior. Compared with traditional self-harm detection methods, self-harm behavior detection using social media can more efficiently detect more self-harm behaviors.

现在已经存在了大量利用社交媒体为主的各种网络数据来进行网络用户健康状况研究的工作。现有技术提出了一种基于心率和社交媒体微博的心理压力检测方法，来对目标个体的压力区间和压力源事件进行发现，其主要包括：首先，对个体的心率异常进行检测，来反映测试周期内个体的神经系统的紧张程度；然后，对个体微博的异常区间进行检测，来发现测试周期内用户发布积极微博频率的异常情况；最后，将心率异常和微博发布异常进行匹配，从而对压力区间进行确定，并通过微博数据进行压力源事件的发现。There has been a lot of research on the health status of network users using various network data mainly based on social media. The prior art proposes a psychological stress detection method based on heart rate and social media microblogs to discover the stress interval and stressor events of the target individual, which mainly includes: first, detecting the abnormal heart rate of the individual to reflect The nervous system tension level of the individual during the test period; then, the abnormal interval of individual microblogs is detected to find the abnormal situation of the frequency of users posting positive microblogs during the test period; finally, the abnormal heart rate and the abnormality of microblog posting are matched , so as to determine the pressure interval, and discover the pressure source events through the microblog data.

现有技术提出了一种社交媒体用户心理危机的预警方法，方法主要包括：首先，获取社交媒体上用户发布的文本数据，并对该数据进行预处理得到词语构成的数据集；然后，通过对负面词汇的词频统计来对文本进行量化的情感分析计算，得到用户发布文本的情感特征向量；最后，将得到的特征向量输入到神经网络中得到用户的负向情感强度，并对用户的心理状态进行评级。The prior art proposes an early warning method for the psychological crisis of social media users. The method mainly includes: first, acquiring text data published by users on social media, and preprocessing the data to obtain a data set composed of words; The word frequency statistics of negative words are used to quantify the sentiment analysis and calculation of the text, and the sentiment feature vector of the text published by the user is obtained; finally, the obtained feature vector is input into the neural network to obtain the user's negative emotional strength, and the user's psychological state. rating.

上述基于社交媒体的数据分析方法，都只选择使用了同质信息源，没有充分利用社交媒体上丰富的异质信息源来进行全面的数据挖掘。同时，以上方法的数据挖掘算法都过于简单，不能充分挖掘媒体数据中有价值的信息及适应实际应用中充满噪音的复杂数据。The above-mentioned data analysis methods based on social media only choose to use homogeneous information sources, and do not make full use of the rich heterogeneous information sources on social media for comprehensive data mining. At the same time, the data mining algorithms of the above methods are too simple, and cannot fully mine valuable information in media data and adapt to the complex data full of noise in practical applications.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于提供一种基于强鲁棒性特征选择的社交媒体自残行为检测方法，以解决上述问题。The purpose of the present invention is to provide a social media self-harm behavior detection method based on strong robust feature selection to solve the above problems.

为实现上述目的，本发明采用以下技术方案：To achieve the above object, the present invention adopts the following technical solutions:

一种基于强鲁棒性特征选择的社交媒体自残行为检测方法，包括以下步骤：A social media self-harm behavior detection method based on strong robust feature selection, comprising the following steps:

步骤1，社交媒体数据采集：以网络社交媒体网站历史数据为数据源，获取自残相关帖子和非自残帖子的文本信息、用户行为信息、时间信息和图片信息，得到由若干个帖子组成的帖子集合；记由n个帖子组成的帖子集合

Step 1, social media data collection: take the historical data of online social media sites as the data source, obtain text information, user behavior information, time information and picture information of self-harm related posts and non-self-harm posts, and obtain a post collection consisting of several posts. ; remember a post collection consisting of n posts

步骤2，数据特征提取及数据集构建：对从数据采集中得到的帖子p_i(i＝1,2,…,n)提取其4个异质信息源的特征，得到帖子特征向量fi_i＝{w_i,u_i,t_i,p_i}，其中，w_i表示文本特征，u_i表示用户行为特征，t_i表示帖子的时间特征，p_i表示帖子的图片特征，由此分别构建自残帖子数据集和正常帖子数据集；Step 2, data feature extraction and data set construction: extract the features of four heterogeneous information sources from the posts p _i (i=1, 2,..., n) obtained from the data collection, and obtain the post feature vector fi _i = { _wi , _ui , t _i , _pi }, where _wi represents text features, _ui represents user behavior features, _ti represents temporal features of posts, and _pi represents image features of posts, thus constructing self-mutilation features respectively. post dataset and normal post dataset;

步骤3，自残检测模型建立：从步骤2所构建的数据集中抽取训练样本，基于强鲁棒性特征选择的目标函数，构建和训练有监督的自残检测模型；Step 3, establish a self-harm detection model: extract training samples from the data set constructed in step 2, and build and train a supervised self-harm detection model based on the objective function selected by strong robust features;

步骤4，自残内容检测：对需要检测的目标帖子p，根据步骤2中特征提取方法构建其特征向量f，再将其特征向量f输入到步骤3所训练得到的检测模型中进行特征选择，同时判断其是否为自残相关帖子。Step 4, self-mutilation content detection: For the target post p to be detected, construct its feature vector f according to the feature extraction method in step 2, and then input its feature vector f into the detection model trained in step 3 for feature selection. Determine if it is a self-harm related post.

进一步的，步骤1社交媒体数据采集中，通过不同社交媒体帖子的标签信息，利用网络爬虫或社交媒体提供的应用程序接口进行自残相关和非自残帖子的主题爬取，对每条帖子获取的主要内容包括：Further, in the social media data collection in step 1, through the tag information of different social media posts, the web crawler or the application program interface provided by social media is used to crawl the subject of self-harm related and non-self-harm posts, and the main information obtained for each post is obtained. content include:

(1)文本信息：获取帖子的标题、主题标签词列表、正文文本和其所含的所有评论文本；(1) Text information: Get the title of the post, the list of hashtags, the body text and all the comment texts it contains;

(2)用户行为信息：获取发帖用户的总发帖量、用户加入该社交媒体平台的时间、该用户的关注数和粉丝数；(2) User behavior information: obtain the total number of posts posted by the user, the time when the user joined the social media platform, the number of followers and followers of the user;

(3)时间信息：获取帖子的发布时间和帖子中图片的拍摄时间；(3) Time information: get the posting time of the post and the shooting time of the pictures in the post;

(4)图片信息：获取帖子中所附的所有图片。(4) Picture information: Get all pictures attached in the post.

进一步的，步骤2特征提取及数据集构建中，主要包括：Further, in step 2 feature extraction and data set construction, it mainly includes:

(1)文本特征：文本词性分布特征，计算每条帖子文本内容中不同词性所占比例；可读性特征，利用语言学中的可读性计算公式，对文本的可读性指数进行计算；情感倾向特征，利用文本情感分析判断帖子的情感倾向为积极、中性或消极；文本的词向量表示，利用深度模型为每条帖子的文本计算其向量表示；以上的特征表示为w＝{w_ling,w_read,w_sent,w_vec}；(1) Text features: distribution features of text parts of speech, calculating the proportion of different parts of speech in the text content of each post; readability features, using the readability calculation formula in linguistics to calculate the readability index of the text; Sentiment tendency feature, using text sentiment analysis to judge whether the sentiment tendency of the post is positive, neutral or negative; the word vector representation of the text, using the deep model to calculate its vector representation for the text of each post; the above features are expressed as w={w _ling ,w _read ,w _sent ,w _vec };

(2)用户行为特征：根据用户的总发帖量和使用该社交平台的时间，计算该用户的平均发帖量；利用用户的帖子总量和存在回复的帖子的数量，计算该用户帖子的平均回复率；再加上该用户的关注数和粉丝数，其特征可表示为u＝{u_post,u_rep,u_fol,u_fan}；(2) User behavior characteristics: Calculate the user's average posting volume according to the user's total posting volume and the time of using the social platform; use the user's total number of posts and the number of replies to calculate the average response to the user's posts. rate; plus the number of followers and fans of the user, its characteristics can be expressed as u={u _post , u _rep , u _fol , u _fan };

(3)时间特征：将每天按小时划分为24个时间段，统计该帖子发布时间和所附图片的拍摄时间所在时间段，其特征可表示为t＝{t_post,t_pic}；(3) Time characteristics: Divide each day into 24 time periods by hour, and count the time period of the posting time of the post and the shooting time of the attached picture, and its characteristics can be expressed as t={t _post , t _pic };

(4)图片特征：对图片中的色彩模式进行表征，同时利用色彩信息对图片的情感维度进行定量分析；根据图像处理中的算法对图片的局部特征进行提取并用神经网络对图片表征，其特征可表示为p＝{p_col,p_sent,p_local,p_net}。(4) Picture features: characterize the color pattern in the picture, and use color information to quantitatively analyze the emotional dimension of the picture; extract the local features of the picture according to the algorithm in the image processing and use the neural network to characterize the picture. It can be expressed as p={p _col , p _sent , p _local , p _net }.

进一步的，步骤3自残检测模型建立中，使用了强鲁棒性的高效且稳健的特征选择方法：首先，使用

来表示训练数据中可用的标注信息，其中，对

中帖子p_i，当{Y_i1＝1,Y_i2＝0}时，该帖子为自残内容帖子，反之，当{Y_i1＝0,Y_i2＝1}时，该帖子为正常帖子；Further, in the establishment of the self-damage detection model in step 3, an efficient and robust feature selection method with strong robustness is used: first, use

to represent the annotation information available in the training data, where

Medium post p _i , when {Y _i1 = 1, Y _i2 = 0}, the post is a post with self-harm content, otherwise, when {Y _i1 = 0, Y _i2 = 1}, the post is a normal post;

然后，使用

表示训练数据的数据矩阵，其中l_i为第i个异质信息源所抽取的特征数量；Then, use

represents the data matrix of training data, where li is the number of features extracted from the _i -th heterogeneous information source;

最后，通过使用l_2,1范数的loss函数和正则化项达到强鲁棒性特征选择的目的；所构建的监督模型为训练一个系数矩阵

将数据矩阵X映射到标注信息矩阵Y，训练方式为：Finally, the purpose of strong robust feature selection is achieved by using the loss function of the l _2,1 norm and the regularization term; the constructed supervised model is to train a coefficient matrix

Map the data matrix X to the annotation information matrix Y, and the training method is:

其中，

为正则化项的参数，具体的训练过程为：in,

is the parameter of the regularization term, and the specific training process is:

(1)构建矩阵

其中，

为单位矩阵，

同时，初始化矩阵

为单位矩阵，并设置训练过程收敛的终止阈值为∈；(1) Build a matrix

in,

is the identity matrix,

At the same time, initialize the matrix

is the identity matrix, and the termination threshold of the training process convergence is set to ∈;

(2)计算U＝D^-1A^T(AD^-1A^T)^-1Y；(2) Calculate U=D ^-1 A ^T (AD ^-1 A ^T ) ^-1 Y;

(3)更新对角矩阵D，对角元素为d_ii＝1/(2‖u_i‖₂)，其中，u_i为U第i行；(3) Update the diagonal matrix D, and the diagonal elements are d _ii =1/(2‖u _i ‖ ₂ ), where ui is the _ith row of U;

(4)构造W＝(u₁,u₂,…,u_m-n)，并判断目标函数下降幅度小于∈是否成立，若不成立，则返回过程(2)继续训练；否则，退出训练保存系数矩阵W。(4) Construct W=(u ₁ , u ₂ ,..., _{um mn} ), and judge whether the decrease of the objective function is less than ∈ is true, if not, return to the process (2) to continue training; otherwise, exit the training and save the coefficient matrix W .

进一步的，步骤4自残内容检测中，对待检测的目标帖子p，通过学得的系数矩阵W可映射得到其指示向量y∈R^2，当y_1>y_2时，则判定该帖子为自残内容；否则，判定该帖子为正常内容，正常内容即为非自残内容。Further, in the self-mutilation content detection in step 4, the target post p to be detected can be mapped to obtain its indication vector y∈R^2 through the learned coefficient matrix W, and when y_1>y_2, it is determined that the post is self-mutilation content; Otherwise, the post is determined to be normal content, and normal content is non-self-harm content.

与现有技术相比，本发明有以下技术效果：Compared with the prior art, the present invention has the following technical effects:

(1)自残主体由于信任问题在物理世界很难寻得可靠的求助对象，因而更倾向于将倾诉求助放于具有相对匿名性的社交媒体上，使得本发明可以更广泛地接触到自残主体；(1) Since it is difficult for the self-mutilation subject to find a reliable person for help in the physical world due to trust issues, it is more inclined to confide in social media with relative anonymity, so that the present invention can reach the self-mutilation subject more widely;

(2)传统的自残行为研究存在样本少、跟踪观察周期长等不足，而社交媒体极大普及所积累的大量社交数据中包含大量的自残案例，使得本发明可以更深发掘和理解自残主体行为模式；(2) The traditional research on self-harm behavior has shortcomings such as few samples and long tracking and observation period, and the large amount of social data accumulated by the great popularity of social media contains a large number of self-harm cases, so that the present invention can further explore and understand the behavior patterns of self-harm subjects ;

(3)由于自残行为的隐蔽性，传统基于自残主体亲朋的发现方式困难且滞后，而基于社交媒体数据所构建的自残检测数学模型，使得本发明可以更及时和有效地发现自残行为。(3) Due to the concealment of self-mutilation behavior, the traditional method of discovering self-mutilation subjects' relatives and friends is difficult and lagging behind, while the self-mutilation detection mathematical model constructed based on social media data enables the present invention to detect self-mutilation behaviors more timely and effectively.

附图说明Description of drawings

图1是本发明基于强鲁棒性特征选择的社交媒体自残行为检测方法框图。FIG. 1 is a block diagram of the social media self-harm behavior detection method based on strong robust feature selection according to the present invention.

图2是数据采集过程的流程图。Figure 2 is a flow chart of the data acquisition process.

图3是特征分析过程的流程图。Figure 3 is a flow chart of the feature analysis process.

图4是检测模型训练过程流程图。Figure 4 is a flow chart of the detection model training process.

图5是自残检测过程的流程图。Figure 5 is a flow chart of the self-harm detection process.

图6是社交媒体上自残相关帖子实例。Figure 6 is an example of self-harm related posts on social media.

具体实施方式Detailed ways

以下结合附图及实施例对本发明的实施方式进行详细说明。需要说明的是，此处描述的实施例只用以解释本发明，并不用于限定本发明。此外，在不冲突的情况下，本发明中的实施例涉及的技术特征可以相互结合。The embodiments of the present invention will be described in detail below with reference to the accompanying drawings and examples. It should be noted that the embodiments described herein are only used to explain the present invention, and are not used to limit the present invention. In addition, the technical features involved in the embodiments of the present invention may be combined with each other without conflict.

本发明的具体实施过程包括数据采集过程、特征分析过程、模型建立过程、自残检测过程。图1是本发明基于强鲁棒性特征选择的社交媒体自残行为检测方法框图。The specific implementation process of the present invention includes a data acquisition process, a feature analysis process, a model establishment process, and a self-harm detection process. FIG. 1 is a block diagram of the social media self-harm behavior detection method based on strong robust feature selection according to the present invention.

1.数据采集过程1. Data collection process

图6为网络社交媒体数据实例。数据获取的具体过程如下：Figure 6 is an example of network social media data. The specific process of data acquisition is as follows:

(1)通过爬虫技术，根据社交媒体每条数据帖子的标签进行主题爬取。在爬取自残相关内容时，可使用如“selfharm”、“selfinjury”、“suicide”等自残相关标签进行爬取；在爬取正常帖子时，可无主题对目标网页进行超需求量的饱和爬取；(1) Through crawling technology, subject crawling is performed according to the tags of each data post in social media. When crawling self-harm related content, you can use self-harm related tags such as "selfharm", "selfinjury", "suicide" to crawl; when crawling normal posts, you can saturate the target web page without a theme Crawling;

(2)对自残相关帖子，考虑到不同标签返回的帖子集合可能存在重叠的部分，故对自残相关的帖子进行去重处理。之后，为了防止非自残用户偶然发布自残相关内容的情况，除去自残帖子所属的用户集合中帖子数少于5个的用户及其相关帖子；(2) For posts related to self-harm, considering that the sets of posts returned by different tags may overlap, the posts related to self-harm are de-duplicated. Afterwards, in order to prevent non-self-harm users from accidentally posting self-harm related content, remove users with less than 5 posts and their related posts in the user set to which the self-harm posts belong;

(3)对正常内容帖子，通过标签进行过滤，除去带有自残相关标签的帖子。之后，根据正常帖子的需求量进行随机采样使用。(3) For normal content posts, filter by tags to remove posts with self-harm related tags. After that, random sampling is performed according to the demand of normal posts.

以上的步骤流程如图2所示，从而得到帖子集合

其中，爬取的内容信息如图6所示，包括：The process of the above steps is shown in Figure 2, so as to obtain a set of posts

Among them, the crawled content information is shown in Figure 6, including:

2.特征分析过程2. Feature analysis process

对从数据采集过程中得到的帖子进特征分析和提取。从帖子p_i(i＝1,2,…,n)提取其4个异质信息源的特征，得到帖子特征向量f_i＝{w_i,u_i,t_i,p_i}。其主要过程包括：Feature analysis and extraction are performed on the posts obtained from the data collection process. Extract the features of its 4 heterogeneous information sources from the post p _i (i=1,2,...,n), and obtain the post feature vector f _i ={ _wi ,u _i ,t _i , _pi }. Its main process includes:

(1)文本特征提取：文本词性分布特征，计算每条帖子文本内容中名词、动词、形容词和副词所占的比例，可使用如CMUTweetTagger这种面向社交媒体的文本分析工具进行计算；可读性特征，利用语言学中的可读性计算公式，对文本的可读性指数进行计算，例如Flesch可读性计算公式、Linsear Write可读性计算公式、Fog可读性计算公式和Dale-Chall可读性计算公式；情感倾向特征，利用文本情感分析判断帖子的情感倾向为积极、中性或消极，可使用语料库MPQA进行计算；文本的词向量表示，利用word2vec模型为每条帖子的文本计算其向量表示。以上的特征表示为w_i＝{w_ling,w_read,w_sent,w_vec}；(1) Text feature extraction: text part-of-speech distribution features, calculate the proportion of nouns, verbs, adjectives and adverbs in the text content of each post, which can be calculated using social media-oriented text analysis tools such as CMUTweetTagger; readability Features, using the readability calculation formula in linguistics to calculate the readability index of text, such as Flesch readability calculation formula, Linsear Write readability calculation formula, Fog readability calculation formula and Dale-Chall readability calculation formula. Readability calculation formula; Sentiment tendency feature, use text sentiment analysis to determine whether the post's emotional tendency is positive, neutral or negative, and can use the corpus MPQA to calculate; the word vector representation of the text, use the word2vec model to calculate the text of each post. vector representation. The above features are expressed as w _i ={w _ling ,w _read ,w _sent ,w _vec };

(2)用户行为特征提取：根据用户的总发帖量和使用该社交平台的时间，计算该用户的平均发帖量；利用用户的帖子总量和存在回复的帖子的数量，计算该用户帖子的平均回复率；再加上该用户的关注数和粉丝数，其特征可表示为u_i＝{u_post,u_rep,u_fol,u_fan}；(2) User behavior feature extraction: Calculate the average posting volume of the user according to the total posting volume of the user and the time of using the social platform; use the total number of posts of the user and the number of replies to calculate the average posting volume of the user's posts Reply rate; plus the number of followers and fans of the user, its characteristics can be expressed as u _i ={u _post ,u _rep ,u _fol ,u _fan };

(3)时间特征提取：将每天按小时划分为24个时间段，统计该帖子发布时间和所附图片的拍摄时间所在时间段，可使用{0,1}²⁴的向量进行表征，其特征可表示为t_i＝{t_post,t_pic}；(3) Temporal feature extraction: Divide each day into 24 time periods by hour, and count the time period of the posting time of the post and the shooting time of the attached picture, which can be characterized by a vector of {0,1} ²⁴ , and its features can be Expressed as t _i ={t _post ,t _pic };

(4)图片特征提取：对图片中的色彩模式里利用柱坐标色彩空间HSV进行表征，得到图片的色度(Hue)、色饱和度(Saturation)和亮度(Brightness)。同时，利用色彩信息对图片的情感维度进行定量分析，计算公式为：(4) Picture feature extraction: The color mode in the picture is characterized by the cylindrical coordinate color space HSV, and the chroma (Hue), color saturation (Saturation) and brightness (Brightness) of the picture are obtained. At the same time, using color information to quantitatively analyze the emotional dimension of the picture, the calculation formula is:

此外，使用图像处理中的SURF算法、LBP算法和GIST算法来对图片的局部特征进行提取，并用使用在ImageNet数据集上已预训练完成的AlexNet神经网络来对图片进行特征抽取。其特征可表示为p_i＝{p_col,p_sent,p_local,p_net}。In addition, the SURF algorithm, LBP algorithm and GIST algorithm in image processing are used to extract the local features of the picture, and the AlexNet neural network that has been pre-trained on the ImageNet dataset is used to extract the features of the picture. Its characteristics can be expressed as p _i ={p _col , p _sent , p _local , p _net }.

该过程的流程图如图3所示。此后，可分别构建自残帖子数据集和正常帖子数据集。A flowchart of the process is shown in Figure 3. After that, a dataset of self-harm posts and a dataset of normal posts can be constructed separately.

3.模型建立过程3. Model building process

定义训练数据集中的自残内容标注信息为

其中，对

中帖子p_i，当{Y_i1＝1,Y_i2＝0}时，该帖子为自残内容帖子；反之，当{Y_i1＝0,Y_i2＝1}时，该帖子为正常帖子。使用训练数据的特征向量组成的数据矩阵

(l_i为第i个异质信息源所抽取的特征数量)，所构建的基于强鲁棒性特征选择的有监督模型为训练一个系数矩阵

将数据矩阵X映射到标注信息矩阵Y，训练方式为：Define the self-harm content annotation information in the training data set as

Among them, to

Medium post p _i , when {Y _i1 =1,Y _i2 =0}, the post is a post with self-harm content; on the contrary, when {Y _i1 =0,Y _i2 =1}, the post is a normal post. A data matrix composed of eigenvectors using the training data

(l _i is the number of features extracted from the i-th heterogeneous information source), the constructed supervised model based on strong robust feature selection is to train a coefficient matrix

其中，

为正则化项参数，具体的训练过程为：in,

is the regularization parameter, and the specific training process is as follows:

(1)构建矩阵

其中，

为单位矩阵，

同时，初始化矩阵

为单位矩阵，并设置训练收敛的终止条件为(‖XW-Y‖_2,1+αW2,1<∈；(1) Build a matrix

in,

is the identity matrix,

At the same time, initialize the matrix

is the identity matrix, and set the termination condition of training convergence as (‖XW-Y‖ _2,1 +αW2,1<∈;

(4)构造W＝(u₁,u₂,…,u_m-n)，并判断终止条件是否成立，若不成立，则返回过程(2)继续训练；否则，退出训练保存系数矩阵W。(4) Construct W=(u ₁ , u ₂ , ..., _{um mn} ), and judge whether the termination condition is established, if not, return to process (2) to continue training; otherwise, exit the training and save the coefficient matrix W.

以上有监督模型的训练过程流程图如图4所示。The flow chart of the training process of the above supervised model is shown in Figure 4.

4.自残检测过程4. Self-harm detection process

对需要检测的目标帖子p，根据特征分析中特征提取方法构建其特征向量f，再将其特征向量f输入到模型建立过程中训练得到的检测模型中，判断其是否为自残相关帖子。通过有监督的自残检测模型系数矩阵W的映射，可得到目标帖子p的指示向量

当y₁>y₂时，则判定该帖子为自残内容；否则，判定该帖子为正常内容。该检测过程流程图如图5所示。For the target post p to be detected, construct its feature vector f according to the feature extraction method in the feature analysis, and then input its feature vector f into the detection model trained during the model establishment process to determine whether it is a self-harm related post. Through the mapping of the supervised self-harm detection model coefficient matrix W, the indicator vector of the target post p can be obtained

When y ₁ >y ₂ , the post is determined to be self-harm content; otherwise, the post is determined to be normal content. The flow chart of the detection process is shown in FIG. 5 .

Claims

1. A social media self-disabling behavior detection method based on strong robustness feature selection is characterized by comprising the following steps:

step 1, social media data acquisition: taking historical data of a network social media website as a data source, and acquiring text information, user behavior information, time information and picture information of self-disabled related posts and non-self-disabled posts to obtain a post set consisting of a plurality of posts; post set composed of n posts

Step 2, data feature extraction and data set construction: for posts p obtained from data collection _i Wherein i =1,2, ·, n; extracting the characteristics of 4 heterogeneous information sources to obtain a post characteristic vector f _i ＝{w _i ，u _i ，t _i ，p _i In which w _i Representing a text feature, u _i Representing a user behavior feature, t _i Representing the temporal characteristics of the post, p _i Representing picture characteristics of the posts, and respectively constructing a self-disabled post data set and a normal post data set;

step 3, establishing a self-residual detection model: extracting training samples from the data set constructed in the step 2, and constructing and training a supervised self-residual detection model based on a target function selected by the strong robustness characteristics;

step 4, self-disabled content detection: constructing a feature vector f of a target post p to be detected according to the feature extraction method in the step 2, inputting the feature vector f into the detection model trained in the step 3 for feature selection, and judging whether the target post p is a self-mutilation related post;

in step 1, in social media data acquisition, through tag information of different social media posts, topic crawling of self-residual related posts and non-self-residual posts is performed by using an application program interface provided by a web crawler or a social media, and main contents acquired by each post comprise:

(1) Text information: acquiring a title, a subject label word list, a text and all comment texts contained in the text;

(2) User behavior information: acquiring the total posting volume of posting users, the time for the users to join the social media platform, and the attention number and the fan number of the users;

(3) Time information: acquiring the publishing time of the post and the shooting time of the picture in the post;

(4) Picture information: acquiring all pictures attached to the posts;

step 2, in feature extraction and data set construction, the method mainly comprises the following steps:

(1) Text characteristics: text part-of-speech distribution feature w _ling Calculating the proportion of different parts of speech in the text content of each post; readability characteristic w _read Calculating the readability index of the text by using a readability calculation formula in linguistics; emotional tendency characteristics w _sent Judging whether the emotional tendency of the post is positive, neutral or negative by utilizing text emotional analysis; word vector representation w of text _vec Calculating vector representation of the text of each post by using a depth model; the above feature is represented by w = { w = { [ w ] _ling ，w _read ，w _sent ，w _vec }；

(2) The user behavior characteristics are as follows: calculating the average posting volume u of the user according to the total posting volume of the user and the time of using the network social media _post (ii) a Calculating the average reply rate u of the user posts by using the total number of the posts of the user and the number of the posts with replies _rep (ii) a Plus the user's attention number u _fol Number u of vermicelli made from bean starch _fan Its characteristic can be expressed as u = { u = { u = _post ，u _rep ，u _fol ，u _fan }；

(3) Time characteristics: dividing each day into 24 time periods by hour, and counting the post publishing time t _post And a time period t of the shooting time of the attached picture _pic Characterized in that it can be represented as t = { t = { t = } _post ，t _pic }；

(4) Picture characteristics: for color pattern p in picture _col Representing and utilizing color information to carry out emotion dimension p on the picture _sent Carrying out quantitative analysis; local characteristic p of picture according to algorithm in image processing _local Extracting and representing the picture p by using a neural network _net Its characteristics can be expressed as p = { p = _col ，p _sent ，p _local ，p _net }；

Step 3, in the self-residual detection model establishment, a high-efficiency and steady feature selection method with strong robustness is used: first, use

To represent the annotation information available in the training data, wherein

Chinese post p _i When { Y } _i1 ＝1，Y _i2 =0} the post is a self-spoiled content post, otherwise, when { Y } _i1 ＝0，Y _i2 =1}, the post is a normal post;

then, use

A data matrix representing training data, wherein _i The feature quantity extracted for the ith heterogeneous information source;

finally, by using l _2，1 The loss function and the regularization term of the norm achieve the purpose of selecting the strong robustness characteristics; the constructed supervision model is used for training a coefficient matrix

Mapping the data matrix X to a labeling information matrix Y, wherein the training mode is as follows:

wherein,

for the parameters of the regularization term, the specific training process is:

(1) Constructing matrices

Wherein,

is a matrix of the units,

at the same time, the matrix is initialized

Setting a termination threshold value of the convergence of the training process as an element matrix;

(2) Computing

(3) Updating diagonal matrix D with diagonal elements D _ii ＝1/(2||u _i || ₂ ) Wherein u is _i Is Uth line;

(4) Configuration W = (u) ₁ ，u ₂ ，...，u _m-n ) Judging whether the descending amplitude of the target function is less than the epsilon or not, if not, returning to the process (2) to continue training; otherwise, quitting training and saving the coefficient matrix W;

step 4, in the self-disabled content detection, a target post p to be detected can be mapped through a learned coefficient matrix W to obtain an indication vector y ∈ R ^2, and when y _1 is larger than y _2, the post is judged to be self-disabled content; otherwise, judging that the post is normal content, wherein the normal content is non-self-disabled content.