CN102521220A

CN102521220A - Method for recognizing network suicide note

Info

Publication number: CN102521220A
Application number: CN201110386606XA
Authority: CN
Inventors: 王泰; 徐薇; 李隆; 刘三女牙
Original assignee: Huazhong Normal University
Current assignee: Huazhong Normal University
Priority date: 2011-11-29
Filing date: 2011-11-29
Publication date: 2012-06-27
Anticipated expiration: 2031-11-29
Also published as: CN102521220B

Abstract

The invention provides a method for automatically recognizing a suicide note appearing on the Internet, belonging to the technical fields of Chinese text information processing and applied psychology, and realizes the technical effect of in the automatic discovery of network suicide notes. The method adopts a recognition method in which core words are bound to feature sentences and is divided into two stages, i.e. feature extraction and feature recognition. The core words are extracted; the suicidal tendency value of a text to be detected is calculated according to factors, such as the maximum value of the similar degree between sentences where the core words are located and feature sentences of the core words, and the like; and then, whether the text to be detected is the suicide note or not is judged. With the adoption of the method, the network suicide notes can be automatically recognized, individuals suffering from psychological crisis can be early warned, and a basis for intervention and treatment implemented by a psychological counseling and guidance department and the like, is provided. The method has the advantages of simplicity and easiness in implementation, avoidance of negative effect arising from segmentation defects, strong compatibility with newly-added samples, high rate of accuracy in recognition and low omission factor.

Description

A kind of recognition methods of network suicide posthumous papers

Technical field

The invention belongs to Chinese text information processing and application of psychological techniques field, be specifically related to a kind of recognition methods of network suicide posthumous papers.

Background technology

Commit suiside become the dead head of China 15-34 year crowd because of, the research statistics is arranged, have 28.1% people to leave the words of the deceased, posthumous papers in the suicide case.In recent years, there is the netizen before suicide, its dying words to be puted up on the internet.Because the generation of tragedy has finally all been avoided in the timely intervention of the earnest netizen and the police.

This shows that develop a kind of method of automatic recognition network suicide posthumous papers, the life that has suicide idea for timely redemption has important practical significance undoubtedly.

Although the research to the suicide posthumous papers is very abundant, these researchs mainly concentrate on to recall through posthumous papers and cause the aspects such as factor of committing suiside.At present, the research of the automatic classification of relevant suicide posthumous papers also is in the starting stage in the world.Proposing to discern automatically the method for putting up suicide posthumous papers on the internet first then is just to occur in 2007; Yen-Pei Huang, Tiong Goh, Chern Li Liew; Hunting Suicide Notes in Web 2.0-Prel iminary Findings, in Proc.of IEEE 9 ^ThInt ' 1.Symp.On Multimedia 2007,517-521.This method is given a text scoring undetermined according to the frequency of occurrences of keyword or phrase, and the degree of the high more then doubtful suicide of mark is also high more.Although this method is very simple, accuracy rate is lower.2008,2009 continuous 2 years in biological natural language processing scientific seminar; The scholar of the children of U.S. University of Cincinnati medical centre and Polish Nicholas Copernius university proposes to have significantly improved accuracy rate with having the machine learning method (sequence minimum optimization method) of supervision and unsupervised machine learning method (order information bottleneck method) to discern the suicide posthumous papers in succession.

At present, the domestic automatic classification achievement that does not also have the relevant Chinese suicide posthumous papers of open source literature report.The automatic classification of Chinese suicide posthumous papers can not be transplanted the suicide posthumous papers automatic classification method that is applicable to Romance simply.This be because: the first, with separating naturally according to the space between speech in the English and the speech different be, in a subordinate sentence of Chinese; Word and word are closely arranged, and automatically extract keyword, and not cause ambiguity; Indulging has comparatively ripe Chinese Automatic Word Segmentation assembly, still has certain difficulty; The second, the expression of Chinese is more implicit, in posthumous papers, often " suicide " occur unlike English that kind, wordings such as " killed myself ", and normal " death ", word or the phrases such as " leaving this world " of using bluntly; The 3rd, as basis of characterization, " men's football of China Team is sorted into the group of death in the qualifying match of World Cup of South Africa " this sports news also might be mistaken for the suicide posthumous papers to iff employing high frequency words so like " death ", " world " etc.

The weak point of prior art is in the process that machine is discerned automatically, more in depth not use for reference human reading regularity.In general, human when reading one piece of text, successively experienced bottom-up and by the top following two process of cognitions, promptly understand earlier speech then conjunction form a complete sentence (bottom-up), sentence justice is more complete than the meaning of a word, more specifically; Based on context and autoscopia after reading fully a piece of writing,, form understanding, particularly to the deep memory (descending) of certain speech in the important sentences by the top to sentence importance.

Summary of the invention

Above-mentioned deficiency to prior art; And consider that the suicide posthumous papers are class descriptions certain fix and the text of concrete idea, the present invention proposes the network suicide posthumous papers recognition methods that a kind of core word is bound characteristic sentence, should method is simple; Evaded the negative effect of participle defective; Compatibility to newly-increased sample is strong, and recognition accuracy is higher, and loss is lower.

Specifically, the recognition methods of a kind of network suicide of the present invention posthumous papers is divided into feature extraction and two stages of feature identification.

Said feature extraction phases was divided into for three steps, and is as shown in Figure 1.

The first step; From the suicide posthumous papers sample of the sufficient amount collected, select the sentence that best embodies author's suicide idea; If promptly leave out this sentence, then these posthumous papers can only be considered to confess or complain that such mood leads off, and these sentences that are selected are called as characteristic sentence; If this subordinate sentence then only got in the subordinate sentence in certain sentence.

Second step; In these characteristic sentences; Select the core word that can express author's suicide idea, each characteristic sentence limit is selected a core word, and then that core word is identical characteristic sentence is included into the characteristic sentence storehouse of this core word; The synonym B of core word A also is regarded as core word, and the characteristic sentence storehouse that the characteristic sentence at this synonym B place also is included into core word A is gone.

The 3rd step, select the least possible core word to cover suicide posthumous papers sample as much as possible, the first round chooses the core word of cover-most sample the number of samples that promptly comprises this speech maximum earlier; Later on every take turns all chosen core word that can cover-most residue sample come, if such core word surpasses 1, then selects the highest that of the frequency of occurrences; Repeat said process, up to accumulative total cover number of samples surpass sample total 95% till; Through above process, obtained " core word---characteristic sentence storehouse " table of comparisons.

The feature identification stage altogether in two steps, and is as shown in Figure 2.

The first step scans text to be checked, if core word do not occur, then differentiating is non-suicide posthumous papers.If core word then carried out for second step.

Second goes on foot, and establish among the text T to be checked to have occurred core word N time, and the core word note that occurs for the j time is made W _j, j=1,2,3 ..., N, N are natural number.

With W among the T _jThe subordinate sentence S at place _jExtracts comes out, and calculates sentence S to be checked _jWith W _jEach characteristic sentence C (W _j, statement similarity A (S i) _j, C (W _j, i)), i=1 wherein, 2 ..., L (W _j), L (W _j) be W in " core word---characteristic sentence storehouse " table of comparisons _jThe number of pairing characteristic sentence.

Sentence S to be checked _jThe introgression value

M (S_{j}) = \underset{i = 1,2, L (W_{j})}{Max} A (S_{j}, C (W_{j}, i)) .

The introgression value of sample T to be checked

M (T) = \frac{1}{N} Σ_{j = 1}^{N} M (S_{j}) .

Whether compare the magnitude relationship of M (T) and setting threshold then, making is the judgement of suicide posthumous papers, if M (T) judges then that more than or equal to this threshold value text to be checked is the suicide posthumous papers, if M (T) judges then that less than this threshold value text to be checked is non-suicide posthumous papers.

Calculating two statement S ₁And S ₂Similarity A (S ₁, S ₂) time, calculate respectively " matching degree of word " and " matching degree of word string ", adopt linear weighted function then, obtain statement similarity.The concrete computing method of " matching degree of word ", " matching degree of word string ", statement similarity are described below.

The matching degree of word

The matching degree of word string, word string promptly are a string continuous words, and the centre does not have separator

Statement similarity

The matching degree of the matching degree+α of statement similarity=β * word * word string

Above-mentioned β=0.5, α=0.7, threshold value gets 0.425.

In test process, if find to have the sample of omission to exist, then, get into feature extraction phases again, with the loss of this method of further reduction when detection is newly inspected sample by ready samples with other suicide posthumous papers samples of new collection.

The recognition methods of a kind of network suicide of the present invention posthumous papers; Bind the automatic recognition network suicide of the mode posthumous papers of characteristic sentence through core word; Can carry out early warning to the individuality that mental crisis occurs, foundation is provided for departments such as psychological consultation and guidance implement to intervene with treatment.The present invention is simple and easy to do, has evaded the negative effect of participle defective, and strong to the compatibility of newly-increased sample, recognition accuracy is high, and loss is low.

Description of drawings

Fig. 1 is the flow chart of steps of feature extraction phases in the inventive method.

Fig. 2 is the flow chart of steps in feature identification stage in the inventive method.

Embodiment

Below in conjunction with accompanying drawing and embodiment the present invention is done further description.

At first, collect 52 pieces of suicide posthumous papers from the internet, and examine with well-known forum, to determine whether that its thing is really arranged with certain review mechanism according to the formal newpapers and periodicals that publish and distribute.Choose 25 pieces in these suicide posthumous papers samples at present, list the source, as shown in table 1.

Table 1 part suicide posthumous papers sample source inventory

With 33 pieces in these 52 pieces of suicide posthumous papers as training sample, with remaining 19 pieces with other 29 pieces depressed but be not that the network character of suicide posthumous papers is as test sample book to be checked.

Carry out feature extraction phases, be divided into for three steps, as shown in Figure 1.

The first step; From 33 pieces of suicide posthumous papers training samples, select the sentence that best embodies author's suicide idea; If promptly leave out this sentence, then these posthumous papers can only be considered to confess or complain that such mood leads off, and these sentences that are selected are called as characteristic sentence; If this subordinate sentence then only got in the subordinate sentence in certain sentence.

The 3rd step, select the least possible core word to cover suicide posthumous papers sample as much as possible, the first round chooses the core word of cover-most sample the number of samples that promptly comprises this speech maximum earlier; Later on every take turns all chosen core word that can cover-most residue sample come, if such core word surpasses 1, then selects the highest that of the frequency of occurrences; Repeat said process, up to accumulative total cover number of samples surpass sample total 95% till; Through above process, " core word---the characteristic sentence storehouse " table of comparisons that from training sample, has obtained, as shown in table 2.

Table 2 core word---the characteristic sentence storehouse table of comparisons

The 3rd step of this feature extraction phases can be drawn following form when implementing, and is as shown in table 3.

State recording when the 3rd step of table 3 feature extraction phases implements

	Leave	Tired out	Desperate	I'm sorry	Walk	Extremely	Live	Next life
									1	1	1
2			1	1	1
									3					1
4						1
									5					1	1
6				1			1
									7							1
8
									9
10						1	1
									11			1		1
12	1			1	1
									13			1	1
14				1	1
									15			1
16				1
									17					1
18				1
									19			1		1
20	1

This table top line is the candidate word of the core word that occurs in the sample, and Far Left one row are sample number.The pairing candidate word of the numeral of ranks infall " 1 " expression occurred in its pairing certain numbering sample.Listing existing numeral 1 expression such as the 2nd row the 2nd is numbered and candidate word has occurred in 2 the sample and " leave ".When in candidate word, selecting core word, select the maximum row of numeral " 1 " occurrence number earlier, choose this speech, remove the sample that comprises this speech, in remaining sample, find to contain the maximum speech of numeral " 1 " as core word, by that analogy as core word.

Carry out the feature identification stage, altogether in two steps, as shown in Figure 2.

The first step scans test sample book to be checked, if core word do not occur, then differentiating is non-suicide posthumous papers.If core word then carried out for second step.

Second goes on foot, and establish to have occurred N time core word among the test sample book T to be checked, and the core word note that occurs for the j time is made W _j, j=1,2,3 ..., N.

Sentence S to be checked _jThe introgression value

M (S_{j}) = \underset{i = 1,2, L (W_{j})}{Max} A (S_{j}, C (W_{j}, i)) .

The introgression value of test sample book T to be checked

M (T) = \frac{1}{N} Σ_{j = 1}^{N} M (S_{j}) .

The matching degree of word

Statement similarity

Through repetition test, β=0.5, α=0.7, when threshold value gets 0.425, best to the discrimination of training sample.When this recognition methods is applied to test sample book, if find to have the sample of omission to exist, then, get into feature extraction phases again, with the loss of this method of further reduction when detection is newly inspected sample by ready samples with other suicide posthumous papers samples of new collection.

Claims

1. the recognition methods of network suicide posthumous papers is characterized in that: this method was made up of feature extraction and two stages of feature identification,

Said feature extraction phases is used to obtain required " core word---characteristic sentence storehouse " table of comparisons of feature identification stage; In this stage; At first from the suicide posthumous papers sample of the sufficient amount collected, select the subordinate sentence that best embodies author's suicide idea and be called characteristic sentence; In these characteristic sentences, select the core word that can express author's suicide idea then, each characteristic sentence limit is selected a core word; The characteristic sentence that core word is identical is included into the characteristic sentence storehouse of this core word; The synonym B of core word A also is regarded as core word, and the characteristic sentence storehouse that the characteristic sentence at this synonym B place also is included into core word A is gone; At last, adopt didactic algorithm to select the least possible core word covering suicide posthumous papers sample as much as possible, thereby set up " core word---characteristic sentence storehouse " table of comparisons;

The said feature identification stage is used for according to " core word---characteristic sentence storehouse " table of comparisons, and whether text to be checked is judged for the suicide posthumous papers; Detailed process is if core word does not appear in the text, then differentiates to be non-suicide posthumous papers; Otherwise; Compare with the corresponding characteristic sentence of this core word in all subordinate sentences that core word occurred and " core word---characteristic sentence storehouse " table of comparisons; The introgression value of the maximal value of the statement similarity that in comparison procedure, obtains as this sentence to be checked, the mean value of all sentence introgression values to be checked is exactly the introgression value of this text to be checked, and is last; Its introgression value and setting threshold are compared, judge whether it is the suicide posthumous papers.

2. the recognition methods of network suicide posthumous papers according to claim 1; It is characterized in that: when feature identification is calculated the similarity of two statements in the stage; Calculate the matching degree of word and the matching degree of word string respectively, carry out linear combination then, obtain the similarity of two statements.

3. the recognition methods of network suicide posthumous papers according to claim 1 is characterized in that the concrete steps of said feature extraction phases are following:

The first step; From the suicide posthumous papers sample of the sufficient amount collected, select the sentence that best embodies author's suicide idea; If promptly leave out this sentence, then these posthumous papers can only be considered to confess or complain that such mood leads off, and these sentences that are selected are called as characteristic sentence; If this subordinate sentence then only got in the subordinate sentence in certain sentence;

Second step; In these characteristic sentences; Select the core word that can express author's suicide idea, each characteristic sentence limit is selected a core word, and then that core word is identical characteristic sentence is included into the characteristic sentence storehouse of this core word; The synonym B of core word A also is regarded as core word, and the characteristic sentence storehouse that the characteristic sentence at this synonym B place also is included into core word A is gone;

4. the recognition methods of network suicide posthumous papers according to claim 1 is characterized in that the concrete steps in feature identification stage are following:

The first step scans text to be checked, if core word do not occur, then differentiating is non-suicide posthumous papers, if core word then carried out for second step;

Second goes on foot, and establish among the text T to be checked to have occurred core word N time, and the core word note that occurs for the j time is made W _j, j=1,2,3 ..., N, N are natural number;

With W among the T _jThe subordinate sentence S at place _jExtracts comes out, and calculates sentence S to be checked _jWith W _jEach characteristic sentence C (W _j, statement similarity A (S i) _j, C (W _j, i)), i=1 wherein, 2 ..., L (W _j), L (W _j) be W in " core word---characteristic sentence storehouse " table of comparisons _jThe number of pairing characteristic sentence;

Sentence S to be checked _jThe introgression value

M (S_{j}) = \underset{i = 1,2, L (W_{j})}{Max} A (S_{j}, C (W_{j}, i));

The introgression value of sample T to be checked

M (T) = \frac{1}{N} Σ_{j = 1}^{N} M (S_{j});

Whether compare the magnitude relationship of M (T) and setting threshold then, making is the judgement of suicide posthumous papers, if M (T) judges then that more than or equal to this threshold value text to be checked is the suicide posthumous papers, if M (T) judges then that less than this threshold value text to be checked is non-suicide posthumous papers;

Calculating two statement S ₁And S ₂Similarity A (S ₁, S ₂) time, calculate respectively " matching degree of word " and " matching degree of word string ", adopt linear weighted function then, obtain statement similarity; The concrete computing method of " matching degree of word ", " matching degree of word string ", statement similarity are following

The matching degree of word

Statement similarity

Above-mentioned β=0.5, α=0.7, threshold value gets 0.425.