CN105373528A - Method and device for analyzing sensitivity of text contents - Google Patents

Method and device for analyzing sensitivity of text contents Download PDF

Info

Publication number
CN105373528A
CN105373528A CN201510509318.7A CN201510509318A CN105373528A CN 105373528 A CN105373528 A CN 105373528A CN 201510509318 A CN201510509318 A CN 201510509318A CN 105373528 A CN105373528 A CN 105373528A
Authority
CN
China
Prior art keywords
words
word
sensitive word
stages
sensitive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510509318.7A
Other languages
Chinese (zh)
Other versions
CN105373528B (en
Inventor
秦玉芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
XINHUA NETWORK CO Ltd
Original Assignee
XINHUA NETWORK CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by XINHUA NETWORK CO Ltd filed Critical XINHUA NETWORK CO Ltd
Priority to CN201510509318.7A priority Critical patent/CN105373528B/en
Publication of CN105373528A publication Critical patent/CN105373528A/en
Application granted granted Critical
Publication of CN105373528B publication Critical patent/CN105373528B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides a method and a device for analyzing sensitivity of text contents. The method comprises the following steps: carrying out sensitivity labeling on each sensitive word; obtaining a current to-be-audited text content; carrying out word segmentation on the text content to obtain a word group; searching a sensitive word from the obtained word group; when finding out the sensitive word, labeling the found sensitive word and recording the position of the first word of the sensitive word in the word group length; dividing the word group length into N sub-intervals according to the highest sensitivity level X permitted by the text content, N=2<X>; calculating the sensitivity P<i> of each sub-interval by a formula as shown in the specification; and calculating the sensitivity E of the text content by the formula as shown in the specification. According to the method and the device, the function of automatically analyzing the sensitivity of the text contents by a manuscript dispatching system is achieved; and sensitivity analysis on all to-be-released text contents by an editor is not needed, so that the workload of the editor is greatly reduced; and the manuscript release efficiency is improved.

Description

A kind of text content sensitive analytical approach and device
Technical field
The present invention relates to text information processing technical field, more particularly, relate to a kind of text content sensitive analytical approach and device.
Background technology
The major way that Internet news information has become the daily obtaining information of people is obtained by news portal website.Every news item that news portal website is delivered mainly through originally to deliver, the mode such as reprinting issues.
For ensureing the Press release quality that news portal website sends, before Press release is published, the susceptibility of editor to Press release to be released is needed to audit.If the susceptibility of the Press release of examination & verification is lower, directly can issue, if the susceptibility of the Press release of examination & verification is higher, then issue again after needing editor to update.
And in information development today so rapidly, decide Press release by the susceptibility of manual examination and verification Press release to be released and whether can issue, add a large amount of human resources undoubtedly, and inefficiency.
Summary of the invention
In view of this, the invention provides a kind of text content sensitive analytical approach and device, what cause with the susceptibility solving in prior art the Press release needing manual examination and verification to be released adds a large amount of human resources, the problem of inefficiency.Technical scheme is as follows:
Based on an aspect of of the present present invention, the invention provides a kind of text content sensitive analytical approach, in advance susceptibility mark is carried out to each sensitive word; Described method comprises:
Obtain current pending content of text;
Carry out word segmentation processing to described content of text, obtain a words group, described words group comprises at least one words;
Sensitive word is searched from the described words group obtained;
When finding sensitive word, the described sensitive word found being marked, recording the position of lead-in in words group length of described sensitive word; Described words group length is the number of all words in described words group;
According to the most high sensitive grade X that described content of text allows, described words group length is divided into N number of by stages, N=2 x; N, X are positive integer;
Utilize formula calculate the susceptibility p of each by stages i; Wherein i is be less than or equal to N just whole, for representing i-th by stages, e smoothfor the smoothing factor of entropy, e smoothbe greater than 0, for avoiding the p when not having sensitive word in by stages iequal the number that 0, M is sensitive word in by stages, W mlevelfor the susceptibility of the sensitive word of lead-in in i-th by stages of sensitive word;
Utilize formula P i = p i &Sigma; i = 1 N p i , E = - &Sigma; i = 1 N ( P i &times; log 2 P i ) , Calculate the susceptibility E of described content of text.
Preferably, describedly carry out word segmentation processing to described content of text, after obtaining a words group, described method also comprises:
Stop words in the words group obtained after removing word segmentation processing.
Preferably, describedly from the described words group obtained, search sensitive word comprise:
Words in described words group is compared with the words in sensitive word dictionary one by one; Described sensitive word dictionary is for storing sensitive word.
Preferably, the most high sensitive grade X that described content of text allows equals 5.
Based on another aspect of the present invention, the present invention also provides a kind of text content sensitive analytical equipment, comprising:
Susceptibility mark unit, for carrying out susceptibility mark to each sensitive word;
Acquiring unit, for obtaining current pending content of text;
Word segmentation processing unit, for carrying out word segmentation processing to described content of text, obtain a words group, described words group comprises at least one words;
Search unit, for searching sensitive word from the described words group obtained;
Label record unit, for when described in search unit find sensitive word time, the described sensitive word found is marked, records the position of lead-in in words group length of described sensitive word; Described words group length is the number of all words in described words group;
By stages division unit, for the most high sensitive grade X allowed according to described content of text, is divided into N number of by stages, N=2 by described words group length x; N, X are positive integer;
First computing unit, for utilizing formula calculate the susceptibility p of each by stages i; Wherein i is be less than or equal to N just whole, for representing i-th by stages, e smoothfor the smoothing factor of entropy, e smoothbe greater than 0, for avoiding the p when not having sensitive word in by stages iequal the number that 0, M is sensitive word in by stages, W mlevelfor the susceptibility of the sensitive word of lead-in in i-th by stages of sensitive word;
Second computing unit, for utilizing formula P i = p i &Sigma; i = 1 N p i , E = - &Sigma; i = 1 N ( P i &times; log 2 P i ) , Calculate the susceptibility E of described content of text.
Preferably, also comprise:
Stop words processing unit, for the stop words in the words group that obtains after removing word segmentation processing.
Preferably, described in search unit specifically for, the words in described words group is compared with the words in sensitive word dictionary one by one; Described sensitive word dictionary is for storing sensitive word.
Preferably, the most high sensitive grade X that described content of text allows equals 5.
Apply technique scheme of the present invention, in text content sensitive analytical approach provided by the invention, carry out susceptibility mark to each sensitive word in advance, method specifically comprises: obtain current pending content of text; Carry out word segmentation processing to described content of text, obtain a words group, described words group comprises at least one words; Sensitive word is searched from the described words group obtained; When finding sensitive word, the described sensitive word found being marked, recording the position of lead-in in words group length of described sensitive word; Described words group length is the number of all Chinese characters in described words group;
According to the most high sensitive grade X that described content of text allows, described words group length is divided into N number of by stages, N=2 x; N, X are positive integer;
Utilize formula calculate the susceptibility p of each by stages i; Wherein i is be less than or equal to N just whole, for representing i-th by stages, e smoothfor the smoothing factor of entropy, e smoothbe greater than 0, for avoiding the p when not having sensitive word in by stages iequal the number that 0, M is sensitive word in by stages, W mlevelfor the susceptibility of the sensitive word of lead-in in i-th by stages of sensitive word;
Finally utilize formula P i = p i &Sigma; i = 1 N p i , E = - &Sigma; i = 1 N ( P i &times; log 2 P i ) , Calculate the susceptibility E of content of text.
Therefore, present invention achieves the function of the susceptibility of system automatic analysis content of text of distributing new dispatchs, when the susceptibility that systematic analysis of distributing new dispatchs obtains content of text to be released is lower, then directly issue text content, when the susceptibility that systematic analysis of distributing new dispatchs obtains content of text to be released is higher, then forward it to process place of editor or mark out, being done by editor and audit editing further.Therefore the present invention carries out sensitivity analysis without the need to editor to all content of text to be released, greatly reduce the workload of editor, decrease a large amount of human resources, and the processing capacity of system automation of distributing new dispatchs substantially increases the efficiency for issuing of contribution.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is only embodiments of the invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to the accompanying drawing provided.
Fig. 1 is the process flow diagram of a kind of text content sensitive analytical approach provided by the invention;
Fig. 2 is another process flow diagram of a kind of text content sensitive analytical approach provided by the invention;
Fig. 3 is the structural representation of a kind of text content sensitive analytical equipment provided by the invention;
Fig. 4 is another structural representation of a kind of text content sensitive analytical equipment provided by the invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.
Refer to Fig. 1, it illustrates the process flow diagram of a kind of text content sensitive analytical approach provided by the invention, comprising:
Step 101, carries out susceptibility mark to each sensitive word in advance.
In the present invention, sensitive word refers to the word of unhealthy color or uncivil language, also comprises number of site according to own actual situation, and some of setting are only applicable to the special sensitive word of this website.And sensitive word is only for what word, generally can arrange one in order to record the sensitive word dictionary of sensitive word, in sensitive word dictionary, whether judge whether this word is sensitive word by comparison words.
Therefore, all sensitive words recorded in sensitive word dictionary in advance according to sensitive word dictionary, can be carried out susceptibility mark by the present invention.Such as, the susceptibility marking sensitive word A is in advance 0.1, the susceptibility of sensitive word B is 0.2, the susceptibility of sensitive word C is 0.3 etc., the present invention is directed to heterogeneity, the sensitive word of different sensitivity marks its susceptibility respectively.
Step 102, obtains current pending content of text.
Such as when for issuing a certain Press release, determine that this Press release is current pending text, first the present invention obtains the content of text of the Press release that this wish is issued.
Wherein content of text specifically refers to the word content in Press release.
Step 103, carries out word segmentation processing to described content of text, obtains a words group, and described words group comprises at least one words.
In the present embodiment, the present invention carries out word segmentation processing to content of text, obtains the words group that comprises multiple words.Wherein preferably, the present invention is carrying out word segmentation processing to content of text, and after obtaining a words group, method can further include, as shown in Figure 2: step 1031, and the stop words in the words group obtained after removing word segmentation processing.The words group now finally obtained is specially, and obtains the words group after a removal stop words.
Wherein, stop words refers to some insignificant words and some function words, such as punctuation mark, " ", " ", "Yes" etc.
Concrete example as, be that " the little lakeside ripples at dusk in midsummer are overflowing, and fresh breeze is slowly for one section of content of text.Zhang little Ming and grandson madam little red lakeside meet Zhao little Si and its madam Lee little by five.Two couples of friend Mr. and Mrs cordiality exchanges, and greets each other, strolls subsequently and pick up rank, comes little dummy and acts as a guest and play." text; the present invention carries out word segmentation processing to text content, and removes the words group obtained after stop words and be: " the overflowing fresh breeze of lakelet at dusk in midsummer limit ripples blow slowly Zhang little Ming and grandson madam little red lakeside meet the little five two teams friend Mr. and Mrs of Zhao little Si and its madam Lee warm exchange that greeting each other strolls subsequently and pick up rank come little dummy act as a guest play ".Wherein obvious, invention removes " ", punctuation mark ", " and ".”。
Step 104, searches sensitive word from the described words group obtained.
Still for above-mentioned, the present invention searches sensitive word successively by from the words group obtained.
Particularly, the words in words group is compared with the words in sensitive word dictionary by the present invention one by one.When the words in words group is consistent with the words in sensitive word dictionary, determine that the words in this words group is sensitive word.Wherein, sensitive word dictionary is for storing sensitive word.
Step 105, when finding sensitive word, marks the described sensitive word found, and records the position of lead-in in words group length of described sensitive word; Described words group length is the number of all words in described words group.
In the present invention, when finding sensitive word, the described sensitive word found is marked, and record the position of lead-in in words group length of this sensitive word simultaneously.Wherein words group length is the number of all words in described words group.
Still for above-mentioned, for words group " the overflowing fresh breeze of lakelet at dusk in midsummer limit ripples blow slowly Zhang little Ming and grandson madam little red lakeside meet Zhao little Si and its madam Lee little 5 two concerning friend Mr. and Mrs cordiality exchange that greeting each other strolls subsequently and pick up rank come little dummy act as a guest play ", suppose its " Zhang little Ming " of comprising, " Sun little Hong ", " Zhao little Si ", " its madam Lee is little by five " is sensitive word, so when finding " Zhang little Ming ", " Zhang little Ming " is marked, and record the position of " opening " word in this words group length, and obviously, this words group comprises 68 words altogether, namely this words group length is 68, now " opening " position of word in this words group length is 15.In like manner, when finding " Sun little Hong ", " Sun little Hong " is marked, and record the position 22 of " grandson " word in this words group length, when finding " Zhao little Si ", " Zhao little Si " is marked, and record the position 30 of " Zhao " word in this words group length, and when finding " its madam Lee is little by five ", " its madam Lee is little by five " marked, and record the position 34 of " its " word in this words group length.
Step 106, the most high sensitive grade X allowed according to described content of text, is divided into N number of by stages, N=2 by described words group length x.Wherein, N, X are positive integer.
In actual application, every part for the contribution issued have corresponding most high sensitive grade X, the X arranged be greater than 0 number.Under theoretical case, this X value can be arranged arbitrarily, but in the ordinary course of things, and the most high sensitive grade X generally arranging that described content of text allows equals 5.Therefore, the present invention equals 5 for most high sensitive grade X and is described.
When X equals 5, the words group length obtained can be divided into N number of by stages, N=2 by the present invention x, i.e. N=2 5.
In the present embodiment, generally, each by stages of division at least comprises a words, if words group leader spends short certainly, namely when the word number in words group is less, and when the N number of number in the by stages simultaneously divided is more, some by stages may not have words.For the by stages not having words, the present invention calculates the susceptibility p of its by stages iequal preset value.
Step 107, utilizes formula calculate the susceptibility p of each by stages i.
Wherein, i is be less than or equal to N just whole, for representing i-th by stages, e smoothfor the smoothing factor of entropy, e smoothbe greater than 0, for avoiding the p when not having sensitive word in by stages iequal the number that 0, M is sensitive word in by stages, W mlevelfor the susceptibility of the sensitive word of lead-in in i-th by stages of sensitive word.
The present invention after having divided and having obtained N number of by stages, to the susceptibility p of each by stages icalculate.
Wherein it should be noted that, the present invention, when dividing words group, is the division that is averaged by all words in words group.Dividing in each by stages obtained, its each words might not be a complete word, and such as, the words of the 4th by stages may be " Yan fresh breeze ", the words of the 5th by stages may be " slowly ", and the words of the 6th by stages may be " Xiao Ming with " etc.
The present invention is calculating the susceptibility p of each by stages itime, be the cumulative summation of the susceptibility of the lead-in at least one sensitive word comprised in this by stages.Words such as the 5th by stages is " slowly opening ", and the lead-in of its sensitive word " Zhang little Ming " " is opened " in this by stages, then the susceptibility p of this by stages iequal the susceptibility of " Zhang little Ming ".And for the words " Xiao Ming and " of the 6th by stages, the lead-in due to sensitive word " Zhang little Ming " " is opened " the 5th by stages, although so " Xiao Ming " occurs the 6th by stages, the present invention is at the susceptibility p of calculating the 6th by stages itime, do not need at the susceptibility calculating " Zhang little Ming " yet.
Certainly, the present invention divides the cut-point of each by stages not necessarily at integer place, and be such as the words group of 68 for above-mentioned words group length, be divided into 16 by stages, its each by stages will comprise 4.25 words.Now be the 0th to the 4.25th word for the word in first by stages, the word in second by stages is the 4.26th to the 8.50th word, and the word in the 3rd by stages is the 8.51st to the 12.76th word, by that analogy.For this situation, the present invention also only need judge the lead-in of sensitive word is in which by stages, thus calculate should the susceptibility p of which by stages i.
For the by stages not having sensitive word in the present invention, the susceptibility p of its by stages iequal e smooth.
Step 108, utilizes formula P i = p i &Sigma; i = 1 N p i , E = - &Sigma; i = 1 N ( P i &times; log 2 P i ) , Calculate the susceptibility E of described content of text.
The present invention is calculating the susceptibility p of each by stages iafter, utilize formulae discovery each by stages successively P i = p i &Sigma; i = 1 N p i , And then utilize formula E = - &Sigma; i = 1 N ( P i &times; log 2 P i ) Calculate the susceptibility E of content of text.
Therefore apply technique scheme of the present invention, in text content sensitive analytical approach provided by the invention, in advance susceptibility mark is carried out to each sensitive word, and then obtain current pending content of text; Carry out word segmentation processing to described content of text, obtain a words group, described words group comprises at least one words; Sensitive word is searched from the described words group obtained; When finding sensitive word, the described sensitive word found being marked, recording the position of lead-in in words group length of described sensitive word; Described words group length is the number of all Chinese characters in described words group;
According to the most high sensitive grade X that described content of text allows, described words group length is divided into N number of by stages, N=2 x; N, X are positive integer;
Utilize formula calculate the susceptibility p of each by stages i; Wherein i is be less than or equal to N just whole, for representing i-th by stages, e smoothfor the smoothing factor of entropy, e smoothbe greater than 0, for avoiding the p when not having sensitive word in by stages iequal the number that 0, M is sensitive word in by stages, W mlevelfor the susceptibility of the sensitive word of lead-in in i-th by stages of sensitive word;
Finally utilize formula P i = p i &Sigma; i = 1 N p i , E = - &Sigma; i = 1 N ( P i &times; log 2 P i ) , Calculate the susceptibility E of content of text.
Therefore, present invention achieves the function of the susceptibility of system automatic analysis content of text of distributing new dispatchs, when the susceptibility that systematic analysis of distributing new dispatchs obtains content of text to be released is lower, then directly issue text content, when the susceptibility that systematic analysis of distributing new dispatchs obtains content of text to be released is higher, then forward it to process place of editor or mark out, being done by editor and audit editing further.Therefore the present invention carries out sensitivity analysis without the need to editor to all content of text to be released, greatly reduce the workload of editor, decrease a large amount of human resources, and the processing capacity of system automation of distributing new dispatchs substantially increases the efficiency for issuing of contribution.
Based on a kind of text content sensitive analytical approach provided by the invention above, the present invention also provides a kind of text content sensitive analytical equipment, comprising: susceptibility marks unit 100, acquiring unit 200, word segmentation processing unit 300, searches unit 400, label record unit 500, by stages division unit 600, first computing unit 700 and the second computing unit 800.Wherein,
Susceptibility mark unit 100, for carrying out susceptibility mark to each sensitive word.
In the present invention, sensitive word refers to the word of unhealthy color or uncivil language, also comprises number of site according to own actual situation, and some of setting are only applicable to the special sensitive word of this website.And sensitive word is only for what word, generally can arrange one in order to record the sensitive word dictionary of sensitive word, in sensitive word dictionary, whether judge whether this word is sensitive word by comparison words.
Therefore, all sensitive words recorded in sensitive word dictionary in advance according to sensitive word dictionary, can be carried out susceptibility mark by the susceptibility mark unit 100 in the present invention.Such as, the susceptibility marking sensitive word A is in advance 0.1, the susceptibility of sensitive word B is 0.2, the susceptibility of sensitive word C is 0.3 etc., the present invention is directed to heterogeneity, the sensitive word of different sensitivity marks its susceptibility respectively.
Acquiring unit 200, for obtaining current pending content of text.
Such as when for issuing a certain Press release, determine that this Press release is current pending text, first the present invention utilizes acquiring unit 200 to obtain the content of text of the Press release that this wish is issued.
Wherein content of text specifically refers to the word content in Press release.
Word segmentation processing unit 300, for carrying out word segmentation processing to described content of text, obtain a words group, described words group comprises at least one words.
In the present embodiment, word segmentation processing unit 300 pairs of content of text carry out word segmentation processing, obtain the words group that comprises multiple words.Wherein preferably, the text content sensitive analytical equipment of the present invention's protection can further include after word segmentation processing unit 300, as shown in Figure 4:
Stop words processing unit 900, for the stop words in the words group that obtains after removing word segmentation processing.
The words group that now the present invention finally obtains is specially, and obtains the words group after a removal stop words.
Wherein, stop words refers to some insignificant words and some function words, such as punctuation mark, " ", " ", "Yes" etc.
Concrete example as, be that " the little lakeside ripples at dusk in midsummer are overflowing, and fresh breeze is slowly for one section of content of text.Zhang little Ming and grandson madam little red lakeside meet Zhao little Si and its madam Lee little by five.Two couples of friend Mr. and Mrs cordiality exchanges, and greets each other, strolls subsequently and pick up rank, comes little dummy and acts as a guest and play." text; word segmentation processing unit 300 pairs of text contents carry out word segmentation processing; after obtaining a words group, and the words group obtained after the stop words in the words group that stop words processing unit 900 obtains after removing word segmentation processing is: " the overflowing fresh breeze of lakelet at dusk in midsummer limit ripples blow slowly Zhang little Ming and grandson madam little red lakeside meet the little 5 two couples of friend Mr. and Mrs cordiality of Zhao little Si and its madam Lee to exchange greeting each other strolls subsequently pick up rank come little dummy act as a guest play ".Wherein obvious, invention removes " ", punctuation mark ", " and ".”。
Search unit 400, for searching sensitive word from the described words group obtained.
Particularly, search unit 400 specifically for, the words in described words group is compared with the words in sensitive word dictionary one by one; Described sensitive word dictionary is for storing sensitive word.
Label record unit 500, for when described in search unit 400 find sensitive word time, the described sensitive word found is marked, records the position of lead-in in words group length of described sensitive word; Described words group length is the number of all words in described words group.
Still for above-mentioned, for words group " the overflowing fresh breeze of lakelet at dusk in midsummer limit ripples blow slowly Zhang little Ming and grandson madam little red lakeside meet Zhao little Si and its madam Lee little 5 two concerning friend Mr. and Mrs cordiality exchange that greeting each other strolls subsequently and pick up rank come little dummy act as a guest play ", suppose its " Zhang little Ming " of comprising, " Sun little Hong ", " Zhao little Si ", " its madam Lee is little by five " is sensitive word, so when searching unit 400 and finding " Zhang little Ming ", " Zhang little Ming " marks by label record unit 500, and record the position of " opening " word in this words group length, and obviously, this words group comprises 68 words altogether, namely this words group length is 68, now " opening " position of word in this words group length is 15.In like manner, when searching unit 400 and finding " Sun little Hong ", " Sun little Hong " marks by label record unit 500, and record the position 22 of " grandson " word in this words group length, search unit 400 when finding " Zhao little Si ", " Zhao little Si " marks by label record unit 500, and record the position 30 of " Zhao " word in this words group length, and search unit 400 when finding " its madam Lee is little by five ", " its madam Lee is little by five " marks by label record unit 500, and record the position 34 of " its " word in this words group length.
By stages division unit 600, for the most high sensitive grade X allowed according to described content of text, is divided into N number of by stages, N=2 by described words group length x; N, X are positive integer.
In actual application, every part for the contribution issued have corresponding most high sensitive grade X, the X arranged be greater than 0 number.Under theoretical case, this X value can be arranged arbitrarily, but in the ordinary course of things, the most high sensitive grade X generally arranging that described content of text allows equals 5.Therefore, the present invention equals 5 for most high sensitive grade X and is described.
When X equals 5, the words group length obtained can be divided into N number of by stages, N=2 by the present invention x, i.e. N=2 5.
In the present embodiment, generally, each by stages of division at least comprises a words, if words group leader spends short certainly, namely when the word number in words group is less, and when the N number of number in the by stages simultaneously divided is more, some by stages may not have words.For the by stages not having words, the present invention calculates the susceptibility p of its by stages iequal preset value.
First computing unit 700, for utilizing formula calculate the susceptibility p of each by stages i.
Wherein i is be less than or equal to N just whole, for representing i-th by stages, e smoothfor the smoothing factor of entropy, e smoothbe greater than 0, for avoiding the p when not having sensitive word in by stages iequal the number that 0, M is sensitive word in by stages, W mlevelfor the susceptibility of the sensitive word of lead-in in i-th by stages of sensitive word.
By stages division unit 600 in the present invention is after having divided and having obtained N number of by stages, and the first computing unit 700 is to the susceptibility p of each by stages icalculate.
Wherein it should be noted that, the by stages division unit 600 in the present invention, when dividing words group, is the division that is averaged by all words in words group.Dividing in each by stages obtained, its each words might not be a complete word, and such as, the words of the 4th by stages may be " Yan fresh breeze ", the words of the 5th by stages may be " slowly ", and the words of the 6th by stages may be " Xiao Ming with " etc.
In the present invention, the first computing unit 700 is calculating the susceptibility p of each by stages itime, be the cumulative summation of the susceptibility of the lead-in at least one sensitive word comprised in this by stages.Words such as the 5th by stages is " slowly opening ", and the lead-in of its sensitive word " Zhang little Ming " " is opened " in this by stages, then the susceptibility p of this by stages iequal the susceptibility of " Zhang little Ming ".And for the words " Xiao Ming and " of the 6th by stages, the lead-in due to sensitive word " Zhang little Ming " " is opened " the 5th by stages, although so " Xiao Ming " occurs the 6th by stages, the present invention is at the susceptibility p of calculating the 6th by stages itime, do not need at the susceptibility calculating " Zhang little Ming " yet.
Certainly, the by stages division unit 600 in the present invention divides the cut-point of each by stages not necessarily at integer place, and be such as the words group of 68 for above-mentioned words group length, be divided into 16 by stages, its each by stages will comprise 4.25 words.Now be the 0th to the 4.25th word for the word in first by stages, the word in second by stages is the 4.26th to the 8.50th word, and the word in the 3rd by stages is the 8.51st to the 12.76th word, by that analogy.For this situation, the present invention also only need judge the lead-in of sensitive word is in which by stages, thus calculate should the susceptibility p of which by stages i.
For the by stages not having sensitive word in the present invention, the susceptibility p of its by stages iequal e smooth.
Second computing unit 800, for utilizing formula P i = p i &Sigma; i = 1 N p i , E = - &Sigma; i = 1 N ( P i &times; log 2 P i ) , Calculate the susceptibility E of described content of text.
The first computing unit 700 in the present invention is calculating the susceptibility p of each by stages iafter, the second computing unit 800 utilizes formulae discovery each by stages successively and then utilize formula E = - &Sigma; i = 1 N ( P i &times; log 2 P i ) Calculate the susceptibility E of content of text.
It should be noted that, in this article, the such as relational terms of first and second grades and so on is only used for an entity or operation to separate with another entity or operational zone, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, article or equipment and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, article or equipment.When not more restrictions, the key element limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment comprising described key element and also there is other identical element.
Above a kind of text content sensitive analytical approach provided by the present invention and device are described in detail, apply specific case herein to set forth principle of the present invention and embodiment, the explanation of above embodiment just understands method of the present invention and core concept thereof for helping; Meanwhile, for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims (8)

1. a text content sensitive analytical approach, is characterized in that, carries out susceptibility mark in advance to each sensitive word; Described method comprises:
Obtain current pending content of text;
Carry out word segmentation processing to described content of text, obtain a words group, described words group comprises at least one words;
Sensitive word is searched from the described words group obtained;
When finding sensitive word, the described sensitive word found being marked, recording the position of lead-in in words group length of described sensitive word; Described words group length is the number of all words in described words group;
According to the most high sensitive grade X that described content of text allows, described words group length is divided into N number of by stages, N=2 x; N, X are positive integer;
Utilize formula calculate the susceptibility p of each by stages i; Wherein i is be less than or equal to N just whole, for representing i-th by stages, e smoothfor the smoothing factor of entropy, e smoothbe greater than 0, for avoiding the p when not having sensitive word in by stages iequal the number that 0, M is sensitive word in by stages, W mlevelfor the susceptibility of the sensitive word of lead-in in i-th by stages of sensitive word;
Utilize formula calculate the susceptibility E of described content of text.
2. method according to claim 1, is characterized in that, describedly carries out word segmentation processing to described content of text, and after obtaining a words group, described method also comprises:
Stop words in the words group obtained after removing word segmentation processing.
3. method according to claim 1, is characterized in that, describedly from the described words group obtained, searches sensitive word comprise:
Words in described words group is compared with the words in sensitive word dictionary one by one; Described sensitive word dictionary is for storing sensitive word.
4. the method according to any one of claim 1-3, is characterized in that, the most high sensitive grade X that described content of text allows equals 5.
5. a text content sensitive analytical equipment, is characterized in that, comprising:
Susceptibility mark unit, for carrying out susceptibility mark to each sensitive word;
Acquiring unit, for obtaining current pending content of text;
Word segmentation processing unit, for carrying out word segmentation processing to described content of text, obtain a words group, described words group comprises at least one words;
Search unit, for searching sensitive word from the described words group obtained;
Label record unit, for when described in search unit find sensitive word time, the described sensitive word found is marked, records the position of lead-in in words group length of described sensitive word; Described words group length is the number of all words in described words group;
By stages division unit, for the most high sensitive grade X allowed according to described content of text, is divided into N number of by stages, N=2 by described words group length x; N, X are positive integer;
First computing unit, for utilizing formula calculate the susceptibility p of each by stages i; Wherein i is be less than or equal to N just whole, for representing i-th by stages, e smoothfor the smoothing factor of entropy, e smoothbe greater than 0, for avoiding the p when not having sensitive word in by stages iequal the number that 0, M is sensitive word in by stages, W mlevelfor the susceptibility of the sensitive word of lead-in in i-th by stages of sensitive word;
Second computing unit, for utilizing formula calculate the susceptibility E of described content of text.
6. device according to claim 5, is characterized in that, also comprises:
Stop words processing unit, for the stop words in the words group that obtains after removing word segmentation processing.
7. device according to claim 5, is characterized in that, described in search unit specifically for, the words in described words group is compared with the words in sensitive word dictionary one by one; Described sensitive word dictionary is for storing sensitive word.
8. the device according to any one of claim 5-7, is characterized in that, the most high sensitive grade X that described content of text allows equals 5.
CN201510509318.7A 2015-08-18 2015-08-18 A kind of text content sensitive analysis method and device Active CN105373528B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510509318.7A CN105373528B (en) 2015-08-18 2015-08-18 A kind of text content sensitive analysis method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510509318.7A CN105373528B (en) 2015-08-18 2015-08-18 A kind of text content sensitive analysis method and device

Publications (2)

Publication Number Publication Date
CN105373528A true CN105373528A (en) 2016-03-02
CN105373528B CN105373528B (en) 2019-03-12

Family

ID=55375736

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510509318.7A Active CN105373528B (en) 2015-08-18 2015-08-18 A kind of text content sensitive analysis method and device

Country Status (1)

Country Link
CN (1) CN105373528B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106503160A (en) * 2016-10-31 2017-03-15 电信科学技术第五研究所 A kind of method and device that is realized based on big data platform to news management and control
CN107870945A (en) * 2016-09-28 2018-04-03 腾讯科技(深圳)有限公司 Content classification method and apparatus
CN109472152A (en) * 2017-09-07 2019-03-15 中国移动通信集团广东有限公司 A kind of detection method and server of data sensitive
CN111586421A (en) * 2020-01-20 2020-08-25 全息空间(深圳)智能科技有限公司 Method, system and storage medium for auditing live broadcast platform information
CN114021564A (en) * 2022-01-06 2022-02-08 成都无糖信息技术有限公司 Segmentation word-taking method and system for social text

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102880636A (en) * 2012-08-03 2013-01-16 深圳证券信息有限公司 Bad information detection method and server
CN103455639A (en) * 2013-09-27 2013-12-18 清华大学 Method and device for recognizing microblog burst hotspot events
CN103678274A (en) * 2013-04-15 2014-03-26 南京邮电大学 Feature extraction method for text categorization based on improved mutual information and entropy
CN104731797A (en) * 2013-12-19 2015-06-24 北京新媒传信科技有限公司 Keyword extracting method and keyword extracting device
JP5754018B2 (en) * 2011-07-11 2015-07-22 日本電気株式会社 Polysemy extraction system, polysemy extraction method, and program

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5754018B2 (en) * 2011-07-11 2015-07-22 日本電気株式会社 Polysemy extraction system, polysemy extraction method, and program
CN102880636A (en) * 2012-08-03 2013-01-16 深圳证券信息有限公司 Bad information detection method and server
CN103678274A (en) * 2013-04-15 2014-03-26 南京邮电大学 Feature extraction method for text categorization based on improved mutual information and entropy
CN103455639A (en) * 2013-09-27 2013-12-18 清华大学 Method and device for recognizing microblog burst hotspot events
CN104731797A (en) * 2013-12-19 2015-06-24 北京新媒传信科技有限公司 Keyword extracting method and keyword extracting device

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107870945A (en) * 2016-09-28 2018-04-03 腾讯科技(深圳)有限公司 Content classification method and apparatus
CN107870945B (en) * 2016-09-28 2020-10-02 腾讯科技(深圳)有限公司 Content rating method and apparatus
CN106503160A (en) * 2016-10-31 2017-03-15 电信科学技术第五研究所 A kind of method and device that is realized based on big data platform to news management and control
CN109472152A (en) * 2017-09-07 2019-03-15 中国移动通信集团广东有限公司 A kind of detection method and server of data sensitive
CN109472152B (en) * 2017-09-07 2020-11-06 中国移动通信集团广东有限公司 Data sensitivity detection method and server
CN111586421A (en) * 2020-01-20 2020-08-25 全息空间(深圳)智能科技有限公司 Method, system and storage medium for auditing live broadcast platform information
CN114021564A (en) * 2022-01-06 2022-02-08 成都无糖信息技术有限公司 Segmentation word-taking method and system for social text
CN114021564B (en) * 2022-01-06 2022-04-01 成都无糖信息技术有限公司 Segmentation word-taking method and system for social text

Also Published As

Publication number Publication date
CN105373528B (en) 2019-03-12

Similar Documents

Publication Publication Date Title
US9904694B2 (en) NoSQL relational database (RDB) data movement
US10051030B2 (en) Interactive searching and recommanding method and apparatus
CN105373528A (en) Method and device for analyzing sensitivity of text contents
US20180373692A1 (en) Method for parsing query based on artificial intelligence and computer device
CN106970912A (en) Chinese sentence similarity calculating method, computing device and computer-readable storage medium
CN103020295B (en) A kind of problem label for labelling method and device
CN106033416A (en) A string processing method and device
CN111008321A (en) Recommendation method and device based on logistic regression, computing equipment and readable storage medium
CN109472017B (en) Method and device for obtaining relevant information of text court deeds of referee to be generated
CN109635260B (en) Method, device, equipment and storage medium for generating article template
CN109684402A (en) One kind being based on big data platform metadata genetic connection implementation method
CN112199937B (en) Short text similarity analysis method and system, computer equipment and medium thereof
CN110046278B (en) Video classification method and device, terminal equipment and storage medium
CN106355450B (en) User behavior analysis system and method
CN111680506A (en) External key mapping method and device of database table, electronic equipment and storage medium
CN113448869A (en) Method and device for generating test case, electronic equipment and computer readable medium
CN105447073A (en) Tag adding apparatus and tag adding method
CN110110153A (en) A kind of method and apparatus of node searching
CN111666278B (en) Data storage method, data retrieval method, electronic device and storage medium
CN108255891A (en) A kind of method and device for differentiating type of webpage
CN104298786B (en) A kind of image search method and device
CN105786929B (en) A kind of information monitoring method and device
CN108520012B (en) Mobile internet user comment mining method based on machine learning
CN105512270A (en) Method and device for determining related objects
CN110659540A (en) Traffic light detection method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant