CN105373528B - A kind of text content sensitive analysis method and device - Google Patents

A kind of text content sensitive analysis method and device Download PDF

Info

Publication number
CN105373528B
CN105373528B CN201510509318.7A CN201510509318A CN105373528B CN 105373528 B CN105373528 B CN 105373528B CN 201510509318 A CN201510509318 A CN 201510509318A CN 105373528 B CN105373528 B CN 105373528B
Authority
CN
China
Prior art keywords
stages
sensitive word
words
text
susceptibility
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510509318.7A
Other languages
Chinese (zh)
Other versions
CN105373528A (en
Inventor
秦玉芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
XINHUA NETWORK CO Ltd
Original Assignee
XINHUA NETWORK CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by XINHUA NETWORK CO Ltd filed Critical XINHUA NETWORK CO Ltd
Priority to CN201510509318.7A priority Critical patent/CN105373528B/en
Publication of CN105373528A publication Critical patent/CN105373528A/en
Application granted granted Critical
Publication of CN105373528B publication Critical patent/CN105373528B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The present invention provides a kind of text content sensitive analysis method and device.Method includes: to carry out susceptibility mark to each sensitive word in advance;Obtain current pending content of text;Word segmentation processing is carried out to the content of text, obtains a words group;Sensitive word is searched from the obtained words group;When finding sensitive word, the sensitive word found is marked, records position of the lead-in of the sensitive word in words group length;According to the highest sensitivity grade X that the content of text allows, the words group length is divided into N number of by stages, N=2X;Utilize formulaCalculate the susceptibility p of each by stagesi;Utilize formulaCalculate the susceptibility E of the content of text.The present invention realizes the function that the system of distributing new dispatchs automatically analyzes the susceptibility of content of text, carries out sensitivity analysis to all content of text to be released without editor, greatly reduces the workload of editor, improve the efficiency for issuing of contribution.

Description

A kind of text content sensitive analysis method and device
Technical field
The present invention relates to text information processing technical fields, analyze more specifically to a kind of text content sensitive Method and apparatus.
Background technique
The major way that Internet news information has become the daily acquisition information of people is obtained by news portal website.News The each news delivered in portal website mainly the modes such as is delivered, is reprinted by original and issued.
Press release quality to guarantee that news portal website issues needs editor couple before Press release is published The susceptibility of Press release to be released is audited.If the susceptibility of the Press release of audit is lower, can directly issue, If the susceptibility of the Press release of audit is higher, issued again after needing editor to update.
And in information development so rapid today, it is determined by the susceptibility of manual examination and verification Press release to be released Whether Press release can be issued, and a large amount of human resources, and inefficiency are undoubtedly increased.
Summary of the invention
In view of this, the present invention provides a kind of text content sensitive analysis method and device, to solve in the prior art A large amount of human resources are increased caused by the susceptibility for the Press release for needing manual examination and verification to be released, inefficiency is asked Topic.Technical solution is as follows:
Based on an aspect of of the present present invention, the present invention provides a kind of text content sensitive analysis method, in advance to each quick Feel word and carries out susceptibility mark;The described method includes:
Obtain current pending content of text;
Word segmentation processing is carried out to the content of text, obtains a words group, the words group includes at least one words;
Sensitive word is searched from the obtained words group;
When finding sensitive word, the sensitive word found is marked, the lead-in for recording the sensitive word exists Position in words group length;The words group length is the number of all texts in the words group;
According to the highest sensitivity grade X that the content of text allows, the words group length is divided into N number of subregion Between, N=2X;N, X is positive integer;
Utilize formulaCalculate the susceptibility p of each by stagesi;Wherein i is less than or equal to N Positive integer, for indicate i-th of by stages, esmoothFor the smoothing factor of entropy, esmoothGreater than 0, work as by stages for avoiding In p when not having a sensitive wordiEqual to the number that 0, M is sensitive word in by stages, WmlevelFor sensitive word lead-in in i-th of subregion Between in sensitive word susceptibility;
Utilize formulaCalculate the susceptibility E of the content of text.
Preferably, described that word segmentation processing is carried out to the content of text, after obtaining a words group, the method also includes:
The stop words in words group obtained after removal word segmentation processing.
Preferably, the sensitive word of searching from the obtained words group includes:
Words in the words group is compared with the words in sensitive word dictionary one by one;The sensitive word dictionary is used In storage sensitive word.
Preferably, the highest sensitivity grade X that the content of text allows is equal to 5.
Based on another aspect of the present invention, the present invention also provides a kind of text content sensitive analytical equipments, comprising:
Susceptibility marks unit, for carrying out susceptibility mark to each sensitive word;
Acquiring unit, for obtaining current pending content of text;
Word segmentation processing unit obtains a words group, the words group packet for carrying out word segmentation processing to the content of text Include at least one words;
Searching unit, for searching sensitive word from the obtained words group;
Mark recording unit, for when the searching unit finds sensitive word, by the sensitive word found into Line flag records position of the lead-in of the sensitive word in words group length;The words group length is in the words group The number of all texts;
By stages division unit, the highest sensitivity grade X for allowing according to the content of text, by the words group Length is divided into N number of by stages, N=2X;N, X is positive integer;
First computing unit, for utilizing formulaCalculate the susceptibility p of each by stagesi; Wherein i is the positive integer less than or equal to N, for indicating i-th of by stages, esmoothFor the smoothing factor of entropy, esmoothGreater than 0, For avoiding the p when not having sensitive word in by stagesiEqual to the number that 0, M is sensitive word in by stages, WmlevelFor sensitive word The susceptibility of sensitive word of the lead-in in i-th of by stages;
Second computing unit, for utilizing formulaCalculate the text The susceptibility E of this content.
Preferably, further includes:
Stop words processing unit, for removing the stop words in the words group obtained after word segmentation processing.
Preferably, the searching unit is specifically used for, by the words in the words group one by one and in sensitive word dictionary Words is compared;The sensitive word dictionary is for storing sensitive word.
Preferably, the highest sensitivity grade X that the content of text allows is equal to 5.
It is right in advance in text content sensitive analysis method provided by the invention using above-mentioned technical proposal of the invention Each sensitive word carries out susceptibility mark, and method specifically includes: obtaining current pending content of text;To the content of text Word segmentation processing is carried out, obtains a words group, the words group includes at least one words;It is searched from the obtained words group Sensitive word;When finding sensitive word, the sensitive word found is marked, records the lead-in of the sensitive word in word Position in phrase length;The words group length is the number of all Chinese characters in the words group;
According to the highest sensitivity grade X that the content of text allows, the words group length is divided into N number of subregion Between, N=2X;N, X is positive integer;
Utilize formulaCalculate the susceptibility p of each by stagesi;Wherein i is less than or equal to N Positive integer, for indicate i-th of by stages, esmoothFor the smoothing factor of entropy, esmoothGreater than 0, work as by stages for avoiding In p when not having a sensitive wordiEqual to the number that 0, M is sensitive word in by stages, WmlevelFor sensitive word lead-in in i-th of subregion Between in sensitive word susceptibility;
Finally utilize formulaCalculate the susceptibility E of content of text.
Therefore, the present invention realizes the function that the system of distributing new dispatchs automatically analyzes the susceptibility of content of text, when the system of distributing new dispatchs point Analysis obtain content of text to be released susceptibility it is lower when, then directly publication text content, when distributing new dispatchs, network analysis is obtained When the susceptibility of content of text to be released is higher, then forwards it at the processing of editor or mark out and, by editor Do further audit editing.Therefore the present invention carries out sensibility point to all content of text to be released without editor Analysis, greatly reduces the workload of editor, reduces a large amount of human resources, and the processing function for system automation of distributing new dispatchs is big The efficiency for issuing of contribution is improved greatly.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The embodiment of invention for those of ordinary skill in the art without creative efforts, can also basis The attached drawing of offer obtains other attached drawings.
Fig. 1 is a kind of flow chart of text content sensitive analysis method provided by the invention;
Fig. 2 is a kind of another flow chart of text content sensitive analysis method provided by the invention;
Fig. 3 is a kind of structural schematic diagram of text content sensitive analytical equipment provided by the invention;
Fig. 4 is a kind of another structural schematic diagram of text content sensitive analytical equipment provided by the invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
Referring to Fig. 1, it illustrates a kind of flow chart of text content sensitive analysis method provided by the invention, packet It includes:
Step 101, susceptibility mark is carried out to each sensitive word in advance.
In the present invention, sensitive word refers to the word or uncivil language of unhealthy color, also include number of site according to Own actual situation, some special sensitive words for being only applicable to this website of setting.And sensitive word is only for what word, generally The one sensitive word dictionary to record sensitive word can be set, by comparing whether words in sensitive word dictionary judges that the word is No is sensitive word.
Therefore, the present invention can carry out all sensitive words recorded in sensitive word dictionary in advance according to sensitive word dictionary Susceptibility mark.For example, in advance mark sensitive word A susceptibility be 0.1, the susceptibility of sensitive word B is 0.2, sensitive word C it is quick Sensitivity is 0.3 etc., and the present invention marks its susceptibility for the sensitive word of heterogeneity, different sensitivitys respectively.
Step 102, current pending content of text is obtained.
Such as when being intended to issue a certain Press release, determine that the Press release is current pending text, the present invention is first First obtain the content of text of the Press release to be issued.
Wherein content of text refers specifically to the word content in Press release.
Step 103, word segmentation processing is carried out to the content of text, obtains a words group, the words group includes at least one A words.
In the present embodiment, the present invention carries out word segmentation processing to content of text, obtains a words group including multiple words. Wherein preferably, the present invention is carrying out word segmentation processing to content of text, and after obtaining a words group, method can also be wrapped further It includes, as shown in Figure 2: step 1031, removing the stop words in the words group obtained after word segmentation processing.The words finally obtained at this time Group is specifically, obtain the words group after a removal stop words.
Wherein, stop words refers to some meaningless words and some function words, for example, punctuation mark, " ", " ", "Yes" Deng.
Specifically for example, being that " the small lakeside ripples at dusk in midsummer are overflowing, and fresh breeze blows slowly for a content of text.Zhang little Ming Zhao little Si and its madam Li little Wu are met in lakeside with madam Sun little Hong.Two pairs of friend's Mr. and Mrs' cordiality exchanges, greet each other, then It strolls and picks up rank, come small dummy and act as a guest and play." text for, the present invention carries out word segmentation processing to text content, and goes Except the words group obtained after stop words are as follows: " the overflowing fresh breeze of dusk in the midsummer lakelet side ripples Zhang little Ming and madam Sun little Hong that blows slowly exists Lakeside meets Zhao little Si to exchange to greet each other then to stroll with its madam Li little Wu two teams friend's Mr. and Mrs' cordiality and pick up rank and come Xiao Ming Family, which acts as a guest, plays ".Wherein it will be evident that invention removes " ", punctuation mark ", " and ".".
Step 104, sensitive word is searched from the obtained words group.
Still for above-mentioned, the present invention will successively search sensitive word from obtained words group.
Specifically, the words in words group is compared with the words in sensitive word dictionary the present invention one by one.Work as words When words in group is consistent with the words in sensitive word dictionary, determine that the words in the words group is sensitive word.Wherein, sensitive word Dictionary is for storing sensitive word.
Step 105, when finding sensitive word, the sensitive word found is marked, the sensitive word is recorded Position of the lead-in in words group length;The words group length is the number of all texts in the words group.
In the present invention, when finding sensitive word, the sensitive word found is marked, and records simultaneously Position of the lead-in of the sensitive word in words group length.Wherein words group length is of all texts in the words group Number.
Still for above-mentioned, for words group, " ripples overflowing fresh breeze in dusk in midsummer lakelet side blows slowly Zhang little Ming and grandson madam It is small red to meet Zhao little Si to exchange with the small 5 two couples of friends Mr. and Mrs' cordiality of its madam Lee to greet each other then to stroll and pick up rank in lakeside Act as a guest to small dummy and play " for, it is assumed that it includes " Zhang little Ming ", " Sun little Hong ", " Zhao little Si ", " its madam Li little Wu " be Sensitive word then " Zhang little Ming " is marked when finding " Zhang little Ming ", and records " opening " word in the words group length In position, and obvious, which includes 68 texts altogether, i.e., the words group length is 68, " opens " word at this time in the words Position in group length is 15.Similarly, when finding " Sun little Hong ", " Sun little Hong " is marked, and records " grandson " word " Zhao little Si " is marked when finding " Zhao little Si " for position 22 in the words group length, and records " Zhao " word and exist Position 30 in the words group length, and when finding " its madam Li little Wu ", " its madam Li little Wu " is marked, and Record position 34 of " its " word in the words group length.
Step 106, the highest sensitivity grade X allowed according to the content of text, is divided into N for the words group length A by stages, N=2X.Wherein, N, X are positive integer.
In actual application, it is big that the every contribution to be issued, which has the highest sensitivity grade X, X that are correspondingly arranged, In 0 number.In theoretical situation, which can be arbitrarily arranged, but under normal circumstances, and the content of text, which is generally arranged, to be allowed Highest sensitivity grade X be equal to 5.Therefore, the present invention is illustrated so that highest sensitivity grade X is equal to 5 as an example.
When X is equal to 5, obtained words group length can be divided into N number of by stages, N=2 by the present inventionX, i.e. N=25
In the present embodiment, under normal circumstances, each by stages of division includes at least a words, if word certainly Phrase length is too short, i.e., when the text number in words group is less, and when the N number of number in by stages that divides simultaneously is more, some point Section may not have words.For the by stages of not words, the present invention calculates the susceptibility p of its by stagesiEqual to preset value.
Step 107, formula is utilizedCalculate the susceptibility p of each by stagesi
Wherein, i is the positive integer less than or equal to N, for indicating i-th of by stages, esmoothFor the smoothing factor of entropy, esmoothGreater than 0, for avoiding the p when not having sensitive word in by stagesiIt is the number of sensitive word in by stages equal to 0, M, WmlevelFor the susceptibility of sensitive word of the lead-in in i-th of by stages of sensitive word.
The present invention is after having divided to obtain N number of by stages, to the susceptibility p of each by stagesiIt is calculated.
It is that all texts in words group are averaged wherein it should be noted that the present invention is when dividing words group It divides.In dividing obtained each by stages, each words not necessarily complete word, for example, 4th point The words in section may be " Yan fresh breeze ", and the words of the 5th by stages may be " opening slowly ", and the words of the 6th by stages can It can be " Xiao Ming and " etc..
The present invention is in the susceptibility p for calculating each by stagesiWhen, it is at least one sensitive word for including in the by stages Lead-in susceptibility cumulative summation.Such as the words of the 5th by stages is " opening slowly ", sensitive word " The lead-in " opening " of Xiao Ming " is in the by stages, then the susceptibility p of the by stagesiEqual to the susceptibility of " Zhang little Ming ".And for For the words " Xiao Ming and " of 6 by stages, since the lead-in " opening " of sensitive word " Zhang little Ming " is the 5th by stages, then most Pipe " Xiao Ming " occurs the 6th by stages, and the present invention is in the susceptibility p for calculating the 6th by stagesiWhen, it does not need calculating yet The susceptibility of " Zhang little Ming ".
Certainly, the present invention divides the cut-point of each by stages not necessarily at integer, such as above-mentioned words group leader The words group that degree is 68, is divided into for 16 by stages, each of which by stages will include 4.25 texts.At this point for Text in first by stages is the 0th to the 4.25th text, and the text in second by stages is the 4.26th to the 8.50 texts, the text in third by stages are the 8.51st to the 12.76th text, and so on.In this case, The present invention also need to only judge the lead-in of sensitive word in which by stages, thus calculate should which by stages susceptibility pi
For the by stages for not having sensitive word in the present invention, the susceptibility p of by stagesiEqual to esmooth
Step 108, formula is utilizedCalculate the quick of the content of text Sensitivity E.
The present invention is in the susceptibility p that each by stages is calculatediAfterwards, each by stages successively is calculated using formulaAnd then utilize formulaCalculate the susceptibility E of content of text.
Therefore above-mentioned technical proposal of the invention is applied, in text content sensitive analysis method provided by the invention, in advance Susceptibility mark first is carried out to each sensitive word, and then obtains current pending content of text;The content of text is carried out Word segmentation processing, obtains a words group, and the words group includes at least one words;It is searched from the obtained words group sensitive Word;When finding sensitive word, the sensitive word found is marked, records the lead-in of the sensitive word in words group Position in length;The words group length is the number of all Chinese characters in the words group;
According to the highest sensitivity grade X that the content of text allows, the words group length is divided into N number of subregion Between, N=2X;N, X is positive integer;
Utilize formulaCalculate the susceptibility p of each by stagesi;Wherein i is less than or equal to N Positive integer, for indicate i-th of by stages, esmoothFor the smoothing factor of entropy, esmoothGreater than 0, work as by stages for avoiding In p when not having a sensitive wordiEqual to the number that 0, M is sensitive word in by stages, WmlevelFor sensitive word lead-in in i-th of subregion Between in sensitive word susceptibility;
Finally utilize formulaCalculate the susceptibility E of content of text.
Therefore, the present invention realizes the function that the system of distributing new dispatchs automatically analyzes the susceptibility of content of text, when the system of distributing new dispatchs point Analysis obtain content of text to be released susceptibility it is lower when, then directly publication text content, when distributing new dispatchs, network analysis is obtained When the susceptibility of content of text to be released is higher, then forwards it at the processing of editor or mark out and, by editor Do further audit editing.Therefore the present invention carries out sensibility point to all content of text to be released without editor Analysis, greatly reduces the workload of editor, reduces a large amount of human resources, and the processing function for system automation of distributing new dispatchs is big The efficiency for issuing of contribution is improved greatly.
Based on a kind of text content sensitive analysis method provided by the invention above, the present invention also provides in a kind of text Hold sensitivity analysis device, comprising: susceptibility marks unit 100, acquiring unit 200, word segmentation processing unit 300, searching unit 400, recording unit 500, by stages division unit 600, the first computing unit 700 and the second computing unit 800 are marked.Wherein,
Susceptibility marks unit 100, for carrying out susceptibility mark to each sensitive word.
In the present invention, sensitive word refers to the word or uncivil language of unhealthy color, also includes number of site according to certainly Body actual conditions, some special sensitive words for being only applicable to this website of setting.And sensitive word is only for what word, general meeting The one sensitive word dictionary to record sensitive word is set, by comparing whether words in sensitive word dictionary judge the word For sensitive word.
Therefore, the susceptibility mark unit 100 in the present invention can be in advance according to sensitive word dictionary, will be in sensitive word dictionary All sensitive words of record carry out susceptibility mark.For example, the susceptibility of mark sensitive word A is the quick of 0.1, sensitive word B in advance Sensitivity is 0.2, the susceptibility of sensitive word C is 0.3 etc., and the present invention is for heterogeneity, the sensitive word difference of different sensitivitys Mark its susceptibility.
Acquiring unit 200, for obtaining current pending content of text.
Such as when being intended to issue a certain Press release, determine that the Press release is current pending text, the present invention is first The content of text of the Press release to be issued is obtained first with acquiring unit 200.
Wherein content of text refers specifically to the word content in Press release.
Word segmentation processing unit 300 obtains a words group, the words for carrying out word segmentation processing to the content of text Group includes at least one words.
In the present embodiment, word segmentation processing unit 300 carries out word segmentation processing to content of text, and obtaining one includes multiple words Words group.Wherein preferably, the text content sensitive analytical equipment that the present invention protects may be used also after word segmentation processing unit 300 To further comprise, as shown in Figure 4:
Stop words processing unit 900, for removing the stop words in the words group obtained after word segmentation processing.
The words group that the present invention finally obtains at this time is specifically, obtain the words group after a removal stop words.
Wherein, stop words refers to some meaningless words and some function words, for example, punctuation mark, " ", " ", "Yes" Deng.
Specifically for example, being that " the small lakeside ripples at dusk in midsummer are overflowing, and fresh breeze blows slowly for a content of text.Zhang little Ming Zhao little Si and its madam Li little Wu are met in lakeside with madam Sun little Hong.Two pairs of friend's Mr. and Mrs' cordiality exchanges, greet each other, then It strolls and picks up rank, come small dummy and act as a guest and play." text for, word segmentation processing unit 300 carries out at participle text content Reason, after obtaining a words group, stop words processing unit 900 obtains after removing the stop words in the words group obtained after word segmentation processing Words group are as follows: " the overflowing fresh breeze of dusk in the midsummer lakelet side ripples Zhang little Ming and madam Sun little Hong that blows slowly in lakeside meets Zhao little Si It exchanges to greet each other then to stroll to pick up rank and come small dummy and act as a guest with the small 5 two couples of friends Mr. and Mrs' cordiality of its madam Lee and play ".Its In it will be evident that invention removes " ", punctuation mark ", " and ".".
Searching unit 400, for searching sensitive word from the obtained words group.
Specifically, searching unit 400 is specifically used for, by the words in the words group one by one with the word in sensitive word dictionary Word is compared;The sensitive word dictionary is for storing sensitive word.
Mark recording unit 500, for when the searching unit 400 finds sensitive word, by it is described find it is quick Sense word is marked, and records position of the lead-in of the sensitive word in words group length;The words group length is the word The number of all texts in phrase.
Still for above-mentioned, for words group, " ripples overflowing fresh breeze in dusk in midsummer lakelet side blows slowly Zhang little Ming and grandson madam It is small red to meet Zhao little Si to exchange with the small 5 two couples of friends Mr. and Mrs' cordiality of its madam Lee to greet each other then to stroll and pick up rank in lakeside Act as a guest to small dummy and play " for, it is assumed that it includes " Zhang little Ming ", " Sun little Hong ", " Zhao little Si ", " its madam Li little Wu " be Sensitive word, then " Zhang little Ming " is marked label recording unit 500 when searching unit 400 finds " Zhang little Ming ", and Position of " opening " word in the words group length is recorded, and it is obvious, and which includes 68 texts altogether, i.e. words group leader Degree is 68, and " opening " position of the word in the words group length at this time is 15.Similarly, when searching unit 400 finds " Sun little Hong " When, " Sun little Hong " is marked label recording unit 500, and records position 22 of " grandson " word in the words group length, When searching unit 400 finds " Zhao little Si ", " Zhao little Si " is marked label recording unit 500, and records " Zhao " word When position 30 and searching unit 400 in the words group length find " its madam Li little Wu ", recording unit is marked 500 " its madam Li little Wu " is marked, and records position 34 of " its " word in the words group length.
By stages division unit 600, the highest sensitivity grade X for allowing according to the content of text, by the word Phrase length is divided into N number of by stages, N=2X;N, X is positive integer.
In actual application, it is big that the every contribution to be issued, which has the highest sensitivity grade X, X that are correspondingly arranged, In 0 number.In theoretical situation, which can be arbitrarily arranged, but under normal circumstances, the content of text, which is generally arranged, to be allowed Highest sensitivity grade X be equal to 5.Therefore, the present invention is illustrated so that highest sensitivity grade X is equal to 5 as an example.
When X is equal to 5, obtained words group length can be divided into N number of by stages, N=2 by the present inventionX, i.e. N=25
In the present embodiment, under normal circumstances, each by stages of division includes at least a words, if word certainly Phrase length is too short, i.e., when the text number in words group is less, and when the N number of number in by stages that divides simultaneously is more, some point Section may not have words.For the by stages of not words, the present invention calculates the susceptibility p of its by stagesiEqual to preset value.
First computing unit 700, for utilizing formulaCalculate the sensitivity of each by stages Spend pi
Wherein i is the positive integer less than or equal to N, for indicating i-th of by stages, esmoothFor the smoothing factor of entropy, esmoothGreater than 0, for avoiding the p when not having sensitive word in by stagesiIt is the number of sensitive word in by stages equal to 0, M, WmlevelFor the susceptibility of sensitive word of the lead-in in i-th of by stages of sensitive word.
For by stages division unit 600 in the present invention after having divided to obtain N number of by stages, the first computing unit 700 is right The susceptibility p of each by stagesiIt is calculated.
It is by words group wherein it should be noted that the by stages division unit 600 in the present invention is when dividing words group In all texts carry out average division.In dividing obtained each by stages, each words not necessarily one complete Whole word, for example, the words of the 4th by stages may be " Yan fresh breeze ", the words of the 5th by stages may be for " slowly ", the words of the 6th by stages may be " Xiao Ming and " etc..
The first computing unit 700 is in the susceptibility p for calculating each by stages in the present inventioniWhen, it is to by stages Nei Bao The cumulative summation of the susceptibility of the lead-in of at least one sensitive word included.It such as is " slowly for the words of the 5th by stages " for, the lead-in " opening " of sensitive word " Zhang little Ming " is in the by stages, then the susceptibility p of the by stagesiEqual to " small It is bright " susceptibility.And for the words of the 6th by stages " Xiao Ming and ", due to the lead-in " opening " of sensitive word " Zhang little Ming " The 5th by stages, then while " Xiao Ming " occurs the 6th by stages, the present invention is in the susceptibility for calculating the 6th by stages piWhen, do not need the susceptibility in calculating " Zhang little Ming " yet.
Certainly, the by stages division unit 600 in the present invention divides the cut-point of each by stages not necessarily at integer, Such as the words group for being 68 for above-mentioned words group length, it is divided into for 16 by stages, each of which by stages Jiang Bao Include 4.25 texts.It is the 0th to the 4.25th text at this point for the text in first by stages, in second by stages Text be the 4.26th to the 8.50th text, the text in third by stages is the 8.51st to the 12.76th text, with this Analogize.In this case, the present invention also need to only judge the lead-in of sensitive word in which by stages, so that calculating should which subregion Between susceptibility pi
For the by stages for not having sensitive word in the present invention, the susceptibility p of by stagesiEqual to esmooth
Second computing unit 800, for utilizing formulaDescribed in calculating The susceptibility E of content of text.
The first computing unit 700 in the present invention is in the susceptibility p that each by stages is calculatediAfterwards, second list is calculated Member 800 successively using formula calculate each by stages so that using formula Calculate the susceptibility E of content of text.
It should be noted that, in this document, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in process, method, article or equipment including the element.
A kind of text content sensitive analysis method provided by the present invention and device are described in detail above, this Apply that a specific example illustrates the principle and implementation of the invention in text, the explanation of above example is only intended to It facilitates the understanding of the method and its core concept of the invention;At the same time, for those skilled in the art, think of according to the present invention Think, there will be changes in the specific implementation manner and application range, in conclusion the content of the present specification should not be construed as pair Limitation of the invention.

Claims (8)

1. a kind of text content sensitive analysis method, which is characterized in that carry out susceptibility mark to each sensitive word in advance;Institute The method of stating includes:
Obtain current pending content of text;
Word segmentation processing is carried out to the content of text, obtains a words group, the words group includes at least one words;
Sensitive word is searched from the obtained words group;
When finding sensitive word, the sensitive word found is marked, records the lead-in of the sensitive word in words Position in group length;The words group length is the number of all texts in the words group;
According to the highest sensitivity grade X that the content of text allows, the words group length is averagely divided into N number of subregion Between, N=2X;N, X is positive integer;The N number of by stages wherein obtained for division, wherein not including sensitivity in some by stages Word includes at least one sensitive word in some by stages, and for each sensitive word, which is integrally present in a by stages In or a part of the sensitive word be present in a by stages, another part is present in another by stages;
Utilize formulaCalculate the susceptibility p of each by stagesi;Wherein i is just less than or equal to N Integer, for indicating i-th of by stages, esmoothFor the smoothing factor of entropy, esmoothGreater than 0, for avoiding not having in by stages P when having sensitive wordiEqual to the number that 0, M is sensitive word in by stages, WmlevelFor sensitive word lead-in in i-th of by stages Sensitive word susceptibility;Wherein in the susceptibility p for calculating each by stagesiWhen, it is to include at least one in the by stages The cumulative summation of the susceptibility of the lead-in of a sensitive word;
Utilize formulaCalculate the susceptibility E of the content of text.
2. being obtained the method according to claim 1, wherein described carry out word segmentation processing to the content of text After one words group, the method also includes:
The stop words in words group obtained after removal word segmentation processing.
3. the method according to claim 1, wherein described search sensitive word packet from the obtained words group It includes:
Words in the words group is compared with the words in sensitive word dictionary one by one;The sensitive word dictionary is for depositing Store up sensitive word.
4. method according to claim 1-3, which is characterized in that the highest susceptibility that the content of text allows Grade X is equal to 5.
5. a kind of text content sensitive analytical equipment characterized by comprising
Susceptibility marks unit, for carrying out susceptibility mark to each sensitive word;
Acquiring unit, for obtaining current pending content of text;
Word segmentation processing unit obtains a words group, the words group includes extremely for carrying out word segmentation processing to the content of text A few words;
Searching unit, for searching sensitive word from the obtained words group;
Recording unit is marked, for when the searching unit finds sensitive word, the sensitive word found to be marked Note, records position of the lead-in of the sensitive word in words group length;The words group length is to own in the words group The number of text;
By stages division unit, the highest sensitivity grade X for allowing according to the content of text, by the words group length Averagely it is divided into N number of by stages, N=2X;N, X is positive integer;The N number of by stages wherein obtained for division, wherein have Do not include sensitive word in by stages, includes at least one sensitive word in some by stages, for each sensitive word, the sensitive word is whole Body is present in a by stages or a part of the sensitive word is present in a by stages, and another part is present in another In a by stages;
First computing unit, for utilizing formulaCalculate the susceptibility p of each by stagesi;Wherein I is the positive integer less than or equal to N, for indicating i-th of by stages, esmoothFor the smoothing factor of entropy, esmoothGreater than 0, it is used for Avoid the p when not having sensitive word in by stagesiEqual to the number that 0, M is sensitive word in by stages, WmlevelFor the lead-in of sensitive word The susceptibility of sensitive word in i-th of by stages;Wherein in the susceptibility p for calculating each by stagesiWhen, it is to the by stages The cumulative summation of the susceptibility of the lead-in at least one sensitive word for inside including;
Second computing unit, for utilizing formulaCalculate the content of text Susceptibility E.
6. device according to claim 5, which is characterized in that further include:
Stop words processing unit, for removing the stop words in the words group obtained after word segmentation processing.
7. device according to claim 5, which is characterized in that the searching unit is specifically used for, will be in the words group Words be compared one by one with the words in sensitive word dictionary;The sensitive word dictionary is for storing sensitive word.
8. according to the described in any item devices of claim 5-7, which is characterized in that the highest susceptibility that the content of text allows Grade X is equal to 5.
CN201510509318.7A 2015-08-18 2015-08-18 A kind of text content sensitive analysis method and device Active CN105373528B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510509318.7A CN105373528B (en) 2015-08-18 2015-08-18 A kind of text content sensitive analysis method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510509318.7A CN105373528B (en) 2015-08-18 2015-08-18 A kind of text content sensitive analysis method and device

Publications (2)

Publication Number Publication Date
CN105373528A CN105373528A (en) 2016-03-02
CN105373528B true CN105373528B (en) 2019-03-12

Family

ID=55375736

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510509318.7A Active CN105373528B (en) 2015-08-18 2015-08-18 A kind of text content sensitive analysis method and device

Country Status (1)

Country Link
CN (1) CN105373528B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107870945B (en) * 2016-09-28 2020-10-02 腾讯科技(深圳)有限公司 Content rating method and apparatus
CN106503160A (en) * 2016-10-31 2017-03-15 电信科学技术第五研究所 A kind of method and device that is realized based on big data platform to news management and control
CN109472152B (en) * 2017-09-07 2020-11-06 中国移动通信集团广东有限公司 Data sensitivity detection method and server
CN111586421A (en) * 2020-01-20 2020-08-25 全息空间(深圳)智能科技有限公司 Method, system and storage medium for auditing live broadcast platform information
CN114021564B (en) * 2022-01-06 2022-04-01 成都无糖信息技术有限公司 Segmentation word-taking method and system for social text

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102880636A (en) * 2012-08-03 2013-01-16 深圳证券信息有限公司 Bad information detection method and server
CN103455639A (en) * 2013-09-27 2013-12-18 清华大学 Method and device for recognizing microblog burst hotspot events
CN103678274A (en) * 2013-04-15 2014-03-26 南京邮电大学 Feature extraction method for text categorization based on improved mutual information and entropy
CN104731797A (en) * 2013-12-19 2015-06-24 北京新媒传信科技有限公司 Keyword extracting method and keyword extracting device
JP5754018B2 (en) * 2011-07-11 2015-07-22 日本電気株式会社 Polysemy extraction system, polysemy extraction method, and program

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5754018B2 (en) * 2011-07-11 2015-07-22 日本電気株式会社 Polysemy extraction system, polysemy extraction method, and program
CN102880636A (en) * 2012-08-03 2013-01-16 深圳证券信息有限公司 Bad information detection method and server
CN103678274A (en) * 2013-04-15 2014-03-26 南京邮电大学 Feature extraction method for text categorization based on improved mutual information and entropy
CN103455639A (en) * 2013-09-27 2013-12-18 清华大学 Method and device for recognizing microblog burst hotspot events
CN104731797A (en) * 2013-12-19 2015-06-24 北京新媒传信科技有限公司 Keyword extracting method and keyword extracting device

Also Published As

Publication number Publication date
CN105373528A (en) 2016-03-02

Similar Documents

Publication Publication Date Title
CN105373528B (en) A kind of text content sensitive analysis method and device
CN104462547B (en) A kind of method and system of configurable collecting webpage data
CN103123618B (en) Text similarity acquisition methods and device
CN105893478A (en) Tag extraction method and equipment
US10747769B2 (en) Text representation method and apparatus
BR112012011091B1 (en) method and apparatus for extracting and evaluating word quality
CN105608113B (en) Judge the method and device of POI data in text
CN105205180A (en) Knowledge map evaluation method and device
CN106528894B (en) The method and device of label information is set
CN105095091B (en) A kind of software defect code file localization method based on Inverted Index Technique
CN103559313B (en) Searching method and device
CN104899335A (en) Method for performing sentiment classification on network public sentiment of information
CN102855317A (en) Multimode indexing method and system based on demonstration video
CN106503256B (en) A kind of hot information method for digging based on social networks document
CN106469187A (en) The extracting method of key word and device
Koike et al. Time series topic modeling and bursty topic detection of correlated news and twitter
CN110287405A (en) The method, apparatus and storage medium of sentiment analysis
CN106933799A (en) A kind of Chinese word cutting method and device of point of interest POI titles
CN103999079A (en) Aligning annotation of fields of documents
CN105989033A (en) Information duplication eliminating method based on information fingerprints
CN104156458B (en) The extracting method and device of a kind of information
CN108009152A (en) A kind of data processing method and device of the text similarity analysis based on Spark-Streaming
CN108701126A (en) Theme estimating device, theme presumption method and storage medium
CN105786929B (en) A kind of information monitoring method and device
CN106770308A (en) A kind of streamline quality inspection system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant