CN105373528B - A kind of text content sensitive analysis method and device - Google Patents
A kind of text content sensitive analysis method and device Download PDFInfo
- Publication number
- CN105373528B CN105373528B CN201510509318.7A CN201510509318A CN105373528B CN 105373528 B CN105373528 B CN 105373528B CN 201510509318 A CN201510509318 A CN 201510509318A CN 105373528 B CN105373528 B CN 105373528B
- Authority
- CN
- China
- Prior art keywords
- stages
- sensitive word
- words
- text
- susceptibility
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Machine Translation (AREA)
Abstract
The present invention provides a kind of text content sensitive analysis method and device.Method includes: to carry out susceptibility mark to each sensitive word in advance;Obtain current pending content of text;Word segmentation processing is carried out to the content of text, obtains a words group;Sensitive word is searched from the obtained words group;When finding sensitive word, the sensitive word found is marked, records position of the lead-in of the sensitive word in words group length;According to the highest sensitivity grade X that the content of text allows, the words group length is divided into N number of by stages, N=2X;Utilize formulaCalculate the susceptibility p of each by stagesi;Utilize formulaCalculate the susceptibility E of the content of text.The present invention realizes the function that the system of distributing new dispatchs automatically analyzes the susceptibility of content of text, carries out sensitivity analysis to all content of text to be released without editor, greatly reduces the workload of editor, improve the efficiency for issuing of contribution.
Description
Technical field
The present invention relates to text information processing technical fields, analyze more specifically to a kind of text content sensitive
Method and apparatus.
Background technique
The major way that Internet news information has become the daily acquisition information of people is obtained by news portal website.News
The each news delivered in portal website mainly the modes such as is delivered, is reprinted by original and issued.
Press release quality to guarantee that news portal website issues needs editor couple before Press release is published
The susceptibility of Press release to be released is audited.If the susceptibility of the Press release of audit is lower, can directly issue,
If the susceptibility of the Press release of audit is higher, issued again after needing editor to update.
And in information development so rapid today, it is determined by the susceptibility of manual examination and verification Press release to be released
Whether Press release can be issued, and a large amount of human resources, and inefficiency are undoubtedly increased.
Summary of the invention
In view of this, the present invention provides a kind of text content sensitive analysis method and device, to solve in the prior art
A large amount of human resources are increased caused by the susceptibility for the Press release for needing manual examination and verification to be released, inefficiency is asked
Topic.Technical solution is as follows:
Based on an aspect of of the present present invention, the present invention provides a kind of text content sensitive analysis method, in advance to each quick
Feel word and carries out susceptibility mark;The described method includes:
Obtain current pending content of text;
Word segmentation processing is carried out to the content of text, obtains a words group, the words group includes at least one words;
Sensitive word is searched from the obtained words group;
When finding sensitive word, the sensitive word found is marked, the lead-in for recording the sensitive word exists
Position in words group length;The words group length is the number of all texts in the words group;
According to the highest sensitivity grade X that the content of text allows, the words group length is divided into N number of subregion
Between, N=2X;N, X is positive integer;
Utilize formulaCalculate the susceptibility p of each by stagesi;Wherein i is less than or equal to N
Positive integer, for indicate i-th of by stages, esmoothFor the smoothing factor of entropy, esmoothGreater than 0, work as by stages for avoiding
In p when not having a sensitive wordiEqual to the number that 0, M is sensitive word in by stages, WmlevelFor sensitive word lead-in in i-th of subregion
Between in sensitive word susceptibility;
Utilize formulaCalculate the susceptibility E of the content of text.
Preferably, described that word segmentation processing is carried out to the content of text, after obtaining a words group, the method also includes:
The stop words in words group obtained after removal word segmentation processing.
Preferably, the sensitive word of searching from the obtained words group includes:
Words in the words group is compared with the words in sensitive word dictionary one by one;The sensitive word dictionary is used
In storage sensitive word.
Preferably, the highest sensitivity grade X that the content of text allows is equal to 5.
Based on another aspect of the present invention, the present invention also provides a kind of text content sensitive analytical equipments, comprising:
Susceptibility marks unit, for carrying out susceptibility mark to each sensitive word;
Acquiring unit, for obtaining current pending content of text;
Word segmentation processing unit obtains a words group, the words group packet for carrying out word segmentation processing to the content of text
Include at least one words;
Searching unit, for searching sensitive word from the obtained words group;
Mark recording unit, for when the searching unit finds sensitive word, by the sensitive word found into
Line flag records position of the lead-in of the sensitive word in words group length;The words group length is in the words group
The number of all texts;
By stages division unit, the highest sensitivity grade X for allowing according to the content of text, by the words group
Length is divided into N number of by stages, N=2X;N, X is positive integer;
First computing unit, for utilizing formulaCalculate the susceptibility p of each by stagesi;
Wherein i is the positive integer less than or equal to N, for indicating i-th of by stages, esmoothFor the smoothing factor of entropy, esmoothGreater than 0,
For avoiding the p when not having sensitive word in by stagesiEqual to the number that 0, M is sensitive word in by stages, WmlevelFor sensitive word
The susceptibility of sensitive word of the lead-in in i-th of by stages;
Second computing unit, for utilizing formulaCalculate the text
The susceptibility E of this content.
Preferably, further includes:
Stop words processing unit, for removing the stop words in the words group obtained after word segmentation processing.
Preferably, the searching unit is specifically used for, by the words in the words group one by one and in sensitive word dictionary
Words is compared;The sensitive word dictionary is for storing sensitive word.
Preferably, the highest sensitivity grade X that the content of text allows is equal to 5.
It is right in advance in text content sensitive analysis method provided by the invention using above-mentioned technical proposal of the invention
Each sensitive word carries out susceptibility mark, and method specifically includes: obtaining current pending content of text;To the content of text
Word segmentation processing is carried out, obtains a words group, the words group includes at least one words;It is searched from the obtained words group
Sensitive word;When finding sensitive word, the sensitive word found is marked, records the lead-in of the sensitive word in word
Position in phrase length;The words group length is the number of all Chinese characters in the words group;
According to the highest sensitivity grade X that the content of text allows, the words group length is divided into N number of subregion
Between, N=2X;N, X is positive integer;
Utilize formulaCalculate the susceptibility p of each by stagesi;Wherein i is less than or equal to N
Positive integer, for indicate i-th of by stages, esmoothFor the smoothing factor of entropy, esmoothGreater than 0, work as by stages for avoiding
In p when not having a sensitive wordiEqual to the number that 0, M is sensitive word in by stages, WmlevelFor sensitive word lead-in in i-th of subregion
Between in sensitive word susceptibility;
Finally utilize formulaCalculate the susceptibility E of content of text.
Therefore, the present invention realizes the function that the system of distributing new dispatchs automatically analyzes the susceptibility of content of text, when the system of distributing new dispatchs point
Analysis obtain content of text to be released susceptibility it is lower when, then directly publication text content, when distributing new dispatchs, network analysis is obtained
When the susceptibility of content of text to be released is higher, then forwards it at the processing of editor or mark out and, by editor
Do further audit editing.Therefore the present invention carries out sensibility point to all content of text to be released without editor
Analysis, greatly reduces the workload of editor, reduces a large amount of human resources, and the processing function for system automation of distributing new dispatchs is big
The efficiency for issuing of contribution is improved greatly.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
The embodiment of invention for those of ordinary skill in the art without creative efforts, can also basis
The attached drawing of offer obtains other attached drawings.
Fig. 1 is a kind of flow chart of text content sensitive analysis method provided by the invention;
Fig. 2 is a kind of another flow chart of text content sensitive analysis method provided by the invention;
Fig. 3 is a kind of structural schematic diagram of text content sensitive analytical equipment provided by the invention;
Fig. 4 is a kind of another structural schematic diagram of text content sensitive analytical equipment provided by the invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
Referring to Fig. 1, it illustrates a kind of flow chart of text content sensitive analysis method provided by the invention, packet
It includes:
Step 101, susceptibility mark is carried out to each sensitive word in advance.
In the present invention, sensitive word refers to the word or uncivil language of unhealthy color, also include number of site according to
Own actual situation, some special sensitive words for being only applicable to this website of setting.And sensitive word is only for what word, generally
The one sensitive word dictionary to record sensitive word can be set, by comparing whether words in sensitive word dictionary judges that the word is
No is sensitive word.
Therefore, the present invention can carry out all sensitive words recorded in sensitive word dictionary in advance according to sensitive word dictionary
Susceptibility mark.For example, in advance mark sensitive word A susceptibility be 0.1, the susceptibility of sensitive word B is 0.2, sensitive word C it is quick
Sensitivity is 0.3 etc., and the present invention marks its susceptibility for the sensitive word of heterogeneity, different sensitivitys respectively.
Step 102, current pending content of text is obtained.
Such as when being intended to issue a certain Press release, determine that the Press release is current pending text, the present invention is first
First obtain the content of text of the Press release to be issued.
Wherein content of text refers specifically to the word content in Press release.
Step 103, word segmentation processing is carried out to the content of text, obtains a words group, the words group includes at least one
A words.
In the present embodiment, the present invention carries out word segmentation processing to content of text, obtains a words group including multiple words.
Wherein preferably, the present invention is carrying out word segmentation processing to content of text, and after obtaining a words group, method can also be wrapped further
It includes, as shown in Figure 2: step 1031, removing the stop words in the words group obtained after word segmentation processing.The words finally obtained at this time
Group is specifically, obtain the words group after a removal stop words.
Wherein, stop words refers to some meaningless words and some function words, for example, punctuation mark, " ", " ", "Yes"
Deng.
Specifically for example, being that " the small lakeside ripples at dusk in midsummer are overflowing, and fresh breeze blows slowly for a content of text.Zhang little Ming
Zhao little Si and its madam Li little Wu are met in lakeside with madam Sun little Hong.Two pairs of friend's Mr. and Mrs' cordiality exchanges, greet each other, then
It strolls and picks up rank, come small dummy and act as a guest and play." text for, the present invention carries out word segmentation processing to text content, and goes
Except the words group obtained after stop words are as follows: " the overflowing fresh breeze of dusk in the midsummer lakelet side ripples Zhang little Ming and madam Sun little Hong that blows slowly exists
Lakeside meets Zhao little Si to exchange to greet each other then to stroll with its madam Li little Wu two teams friend's Mr. and Mrs' cordiality and pick up rank and come Xiao Ming
Family, which acts as a guest, plays ".Wherein it will be evident that invention removes " ", punctuation mark ", " and ".".
Step 104, sensitive word is searched from the obtained words group.
Still for above-mentioned, the present invention will successively search sensitive word from obtained words group.
Specifically, the words in words group is compared with the words in sensitive word dictionary the present invention one by one.Work as words
When words in group is consistent with the words in sensitive word dictionary, determine that the words in the words group is sensitive word.Wherein, sensitive word
Dictionary is for storing sensitive word.
Step 105, when finding sensitive word, the sensitive word found is marked, the sensitive word is recorded
Position of the lead-in in words group length;The words group length is the number of all texts in the words group.
In the present invention, when finding sensitive word, the sensitive word found is marked, and records simultaneously
Position of the lead-in of the sensitive word in words group length.Wherein words group length is of all texts in the words group
Number.
Still for above-mentioned, for words group, " ripples overflowing fresh breeze in dusk in midsummer lakelet side blows slowly Zhang little Ming and grandson madam
It is small red to meet Zhao little Si to exchange with the small 5 two couples of friends Mr. and Mrs' cordiality of its madam Lee to greet each other then to stroll and pick up rank in lakeside
Act as a guest to small dummy and play " for, it is assumed that it includes " Zhang little Ming ", " Sun little Hong ", " Zhao little Si ", " its madam Li little Wu " be
Sensitive word then " Zhang little Ming " is marked when finding " Zhang little Ming ", and records " opening " word in the words group length
In position, and obvious, which includes 68 texts altogether, i.e., the words group length is 68, " opens " word at this time in the words
Position in group length is 15.Similarly, when finding " Sun little Hong ", " Sun little Hong " is marked, and records " grandson " word
" Zhao little Si " is marked when finding " Zhao little Si " for position 22 in the words group length, and records " Zhao " word and exist
Position 30 in the words group length, and when finding " its madam Li little Wu ", " its madam Li little Wu " is marked, and
Record position 34 of " its " word in the words group length.
Step 106, the highest sensitivity grade X allowed according to the content of text, is divided into N for the words group length
A by stages, N=2X.Wherein, N, X are positive integer.
In actual application, it is big that the every contribution to be issued, which has the highest sensitivity grade X, X that are correspondingly arranged,
In 0 number.In theoretical situation, which can be arbitrarily arranged, but under normal circumstances, and the content of text, which is generally arranged, to be allowed
Highest sensitivity grade X be equal to 5.Therefore, the present invention is illustrated so that highest sensitivity grade X is equal to 5 as an example.
When X is equal to 5, obtained words group length can be divided into N number of by stages, N=2 by the present inventionX, i.e. N=25。
In the present embodiment, under normal circumstances, each by stages of division includes at least a words, if word certainly
Phrase length is too short, i.e., when the text number in words group is less, and when the N number of number in by stages that divides simultaneously is more, some point
Section may not have words.For the by stages of not words, the present invention calculates the susceptibility p of its by stagesiEqual to preset value.
Step 107, formula is utilizedCalculate the susceptibility p of each by stagesi。
Wherein, i is the positive integer less than or equal to N, for indicating i-th of by stages, esmoothFor the smoothing factor of entropy,
esmoothGreater than 0, for avoiding the p when not having sensitive word in by stagesiIt is the number of sensitive word in by stages equal to 0, M,
WmlevelFor the susceptibility of sensitive word of the lead-in in i-th of by stages of sensitive word.
The present invention is after having divided to obtain N number of by stages, to the susceptibility p of each by stagesiIt is calculated.
It is that all texts in words group are averaged wherein it should be noted that the present invention is when dividing words group
It divides.In dividing obtained each by stages, each words not necessarily complete word, for example, 4th point
The words in section may be " Yan fresh breeze ", and the words of the 5th by stages may be " opening slowly ", and the words of the 6th by stages can
It can be " Xiao Ming and " etc..
The present invention is in the susceptibility p for calculating each by stagesiWhen, it is at least one sensitive word for including in the by stages
Lead-in susceptibility cumulative summation.Such as the words of the 5th by stages is " opening slowly ", sensitive word "
The lead-in " opening " of Xiao Ming " is in the by stages, then the susceptibility p of the by stagesiEqual to the susceptibility of " Zhang little Ming ".And for
For the words " Xiao Ming and " of 6 by stages, since the lead-in " opening " of sensitive word " Zhang little Ming " is the 5th by stages, then most
Pipe " Xiao Ming " occurs the 6th by stages, and the present invention is in the susceptibility p for calculating the 6th by stagesiWhen, it does not need calculating yet
The susceptibility of " Zhang little Ming ".
Certainly, the present invention divides the cut-point of each by stages not necessarily at integer, such as above-mentioned words group leader
The words group that degree is 68, is divided into for 16 by stages, each of which by stages will include 4.25 texts.At this point for
Text in first by stages is the 0th to the 4.25th text, and the text in second by stages is the 4.26th to the
8.50 texts, the text in third by stages are the 8.51st to the 12.76th text, and so on.In this case,
The present invention also need to only judge the lead-in of sensitive word in which by stages, thus calculate should which by stages susceptibility pi。
For the by stages for not having sensitive word in the present invention, the susceptibility p of by stagesiEqual to esmooth。
Step 108, formula is utilizedCalculate the quick of the content of text
Sensitivity E.
The present invention is in the susceptibility p that each by stages is calculatediAfterwards, each by stages successively is calculated using formulaAnd then utilize formulaCalculate the susceptibility E of content of text.
Therefore above-mentioned technical proposal of the invention is applied, in text content sensitive analysis method provided by the invention, in advance
Susceptibility mark first is carried out to each sensitive word, and then obtains current pending content of text;The content of text is carried out
Word segmentation processing, obtains a words group, and the words group includes at least one words;It is searched from the obtained words group sensitive
Word;When finding sensitive word, the sensitive word found is marked, records the lead-in of the sensitive word in words group
Position in length;The words group length is the number of all Chinese characters in the words group;
According to the highest sensitivity grade X that the content of text allows, the words group length is divided into N number of subregion
Between, N=2X;N, X is positive integer;
Utilize formulaCalculate the susceptibility p of each by stagesi;Wherein i is less than or equal to N
Positive integer, for indicate i-th of by stages, esmoothFor the smoothing factor of entropy, esmoothGreater than 0, work as by stages for avoiding
In p when not having a sensitive wordiEqual to the number that 0, M is sensitive word in by stages, WmlevelFor sensitive word lead-in in i-th of subregion
Between in sensitive word susceptibility;
Finally utilize formulaCalculate the susceptibility E of content of text.
Therefore, the present invention realizes the function that the system of distributing new dispatchs automatically analyzes the susceptibility of content of text, when the system of distributing new dispatchs point
Analysis obtain content of text to be released susceptibility it is lower when, then directly publication text content, when distributing new dispatchs, network analysis is obtained
When the susceptibility of content of text to be released is higher, then forwards it at the processing of editor or mark out and, by editor
Do further audit editing.Therefore the present invention carries out sensibility point to all content of text to be released without editor
Analysis, greatly reduces the workload of editor, reduces a large amount of human resources, and the processing function for system automation of distributing new dispatchs is big
The efficiency for issuing of contribution is improved greatly.
Based on a kind of text content sensitive analysis method provided by the invention above, the present invention also provides in a kind of text
Hold sensitivity analysis device, comprising: susceptibility marks unit 100, acquiring unit 200, word segmentation processing unit 300, searching unit
400, recording unit 500, by stages division unit 600, the first computing unit 700 and the second computing unit 800 are marked.Wherein,
Susceptibility marks unit 100, for carrying out susceptibility mark to each sensitive word.
In the present invention, sensitive word refers to the word or uncivil language of unhealthy color, also includes number of site according to certainly
Body actual conditions, some special sensitive words for being only applicable to this website of setting.And sensitive word is only for what word, general meeting
The one sensitive word dictionary to record sensitive word is set, by comparing whether words in sensitive word dictionary judge the word
For sensitive word.
Therefore, the susceptibility mark unit 100 in the present invention can be in advance according to sensitive word dictionary, will be in sensitive word dictionary
All sensitive words of record carry out susceptibility mark.For example, the susceptibility of mark sensitive word A is the quick of 0.1, sensitive word B in advance
Sensitivity is 0.2, the susceptibility of sensitive word C is 0.3 etc., and the present invention is for heterogeneity, the sensitive word difference of different sensitivitys
Mark its susceptibility.
Acquiring unit 200, for obtaining current pending content of text.
Such as when being intended to issue a certain Press release, determine that the Press release is current pending text, the present invention is first
The content of text of the Press release to be issued is obtained first with acquiring unit 200.
Wherein content of text refers specifically to the word content in Press release.
Word segmentation processing unit 300 obtains a words group, the words for carrying out word segmentation processing to the content of text
Group includes at least one words.
In the present embodiment, word segmentation processing unit 300 carries out word segmentation processing to content of text, and obtaining one includes multiple words
Words group.Wherein preferably, the text content sensitive analytical equipment that the present invention protects may be used also after word segmentation processing unit 300
To further comprise, as shown in Figure 4:
Stop words processing unit 900, for removing the stop words in the words group obtained after word segmentation processing.
The words group that the present invention finally obtains at this time is specifically, obtain the words group after a removal stop words.
Wherein, stop words refers to some meaningless words and some function words, for example, punctuation mark, " ", " ", "Yes"
Deng.
Specifically for example, being that " the small lakeside ripples at dusk in midsummer are overflowing, and fresh breeze blows slowly for a content of text.Zhang little Ming
Zhao little Si and its madam Li little Wu are met in lakeside with madam Sun little Hong.Two pairs of friend's Mr. and Mrs' cordiality exchanges, greet each other, then
It strolls and picks up rank, come small dummy and act as a guest and play." text for, word segmentation processing unit 300 carries out at participle text content
Reason, after obtaining a words group, stop words processing unit 900 obtains after removing the stop words in the words group obtained after word segmentation processing
Words group are as follows: " the overflowing fresh breeze of dusk in the midsummer lakelet side ripples Zhang little Ming and madam Sun little Hong that blows slowly in lakeside meets Zhao little Si
It exchanges to greet each other then to stroll to pick up rank and come small dummy and act as a guest with the small 5 two couples of friends Mr. and Mrs' cordiality of its madam Lee and play ".Its
In it will be evident that invention removes " ", punctuation mark ", " and ".".
Searching unit 400, for searching sensitive word from the obtained words group.
Specifically, searching unit 400 is specifically used for, by the words in the words group one by one with the word in sensitive word dictionary
Word is compared;The sensitive word dictionary is for storing sensitive word.
Mark recording unit 500, for when the searching unit 400 finds sensitive word, by it is described find it is quick
Sense word is marked, and records position of the lead-in of the sensitive word in words group length;The words group length is the word
The number of all texts in phrase.
Still for above-mentioned, for words group, " ripples overflowing fresh breeze in dusk in midsummer lakelet side blows slowly Zhang little Ming and grandson madam
It is small red to meet Zhao little Si to exchange with the small 5 two couples of friends Mr. and Mrs' cordiality of its madam Lee to greet each other then to stroll and pick up rank in lakeside
Act as a guest to small dummy and play " for, it is assumed that it includes " Zhang little Ming ", " Sun little Hong ", " Zhao little Si ", " its madam Li little Wu " be
Sensitive word, then " Zhang little Ming " is marked label recording unit 500 when searching unit 400 finds " Zhang little Ming ", and
Position of " opening " word in the words group length is recorded, and it is obvious, and which includes 68 texts altogether, i.e. words group leader
Degree is 68, and " opening " position of the word in the words group length at this time is 15.Similarly, when searching unit 400 finds " Sun little Hong "
When, " Sun little Hong " is marked label recording unit 500, and records position 22 of " grandson " word in the words group length,
When searching unit 400 finds " Zhao little Si ", " Zhao little Si " is marked label recording unit 500, and records " Zhao " word
When position 30 and searching unit 400 in the words group length find " its madam Li little Wu ", recording unit is marked
500 " its madam Li little Wu " is marked, and records position 34 of " its " word in the words group length.
By stages division unit 600, the highest sensitivity grade X for allowing according to the content of text, by the word
Phrase length is divided into N number of by stages, N=2X;N, X is positive integer.
In actual application, it is big that the every contribution to be issued, which has the highest sensitivity grade X, X that are correspondingly arranged,
In 0 number.In theoretical situation, which can be arbitrarily arranged, but under normal circumstances, the content of text, which is generally arranged, to be allowed
Highest sensitivity grade X be equal to 5.Therefore, the present invention is illustrated so that highest sensitivity grade X is equal to 5 as an example.
When X is equal to 5, obtained words group length can be divided into N number of by stages, N=2 by the present inventionX, i.e. N=25。
In the present embodiment, under normal circumstances, each by stages of division includes at least a words, if word certainly
Phrase length is too short, i.e., when the text number in words group is less, and when the N number of number in by stages that divides simultaneously is more, some point
Section may not have words.For the by stages of not words, the present invention calculates the susceptibility p of its by stagesiEqual to preset value.
First computing unit 700, for utilizing formulaCalculate the sensitivity of each by stages
Spend pi。
Wherein i is the positive integer less than or equal to N, for indicating i-th of by stages, esmoothFor the smoothing factor of entropy,
esmoothGreater than 0, for avoiding the p when not having sensitive word in by stagesiIt is the number of sensitive word in by stages equal to 0, M,
WmlevelFor the susceptibility of sensitive word of the lead-in in i-th of by stages of sensitive word.
For by stages division unit 600 in the present invention after having divided to obtain N number of by stages, the first computing unit 700 is right
The susceptibility p of each by stagesiIt is calculated.
It is by words group wherein it should be noted that the by stages division unit 600 in the present invention is when dividing words group
In all texts carry out average division.In dividing obtained each by stages, each words not necessarily one complete
Whole word, for example, the words of the 4th by stages may be " Yan fresh breeze ", the words of the 5th by stages may be for " slowly
", the words of the 6th by stages may be " Xiao Ming and " etc..
The first computing unit 700 is in the susceptibility p for calculating each by stages in the present inventioniWhen, it is to by stages Nei Bao
The cumulative summation of the susceptibility of the lead-in of at least one sensitive word included.It such as is " slowly for the words of the 5th by stages
" for, the lead-in " opening " of sensitive word " Zhang little Ming " is in the by stages, then the susceptibility p of the by stagesiEqual to " small
It is bright " susceptibility.And for the words of the 6th by stages " Xiao Ming and ", due to the lead-in " opening " of sensitive word " Zhang little Ming "
The 5th by stages, then while " Xiao Ming " occurs the 6th by stages, the present invention is in the susceptibility for calculating the 6th by stages
piWhen, do not need the susceptibility in calculating " Zhang little Ming " yet.
Certainly, the by stages division unit 600 in the present invention divides the cut-point of each by stages not necessarily at integer,
Such as the words group for being 68 for above-mentioned words group length, it is divided into for 16 by stages, each of which by stages Jiang Bao
Include 4.25 texts.It is the 0th to the 4.25th text at this point for the text in first by stages, in second by stages
Text be the 4.26th to the 8.50th text, the text in third by stages is the 8.51st to the 12.76th text, with this
Analogize.In this case, the present invention also need to only judge the lead-in of sensitive word in which by stages, so that calculating should which subregion
Between susceptibility pi。
For the by stages for not having sensitive word in the present invention, the susceptibility p of by stagesiEqual to esmooth。
Second computing unit 800, for utilizing formulaDescribed in calculating
The susceptibility E of content of text.
The first computing unit 700 in the present invention is in the susceptibility p that each by stages is calculatediAfterwards, second list is calculated
Member 800 successively using formula calculate each by stages so that using formula
Calculate the susceptibility E of content of text.
It should be noted that, in this document, relational terms such as first and second and the like are used merely to a reality
Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation
In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to
Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those
Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment
Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that
There is also other identical elements in process, method, article or equipment including the element.
A kind of text content sensitive analysis method provided by the present invention and device are described in detail above, this
Apply that a specific example illustrates the principle and implementation of the invention in text, the explanation of above example is only intended to
It facilitates the understanding of the method and its core concept of the invention;At the same time, for those skilled in the art, think of according to the present invention
Think, there will be changes in the specific implementation manner and application range, in conclusion the content of the present specification should not be construed as pair
Limitation of the invention.
Claims (8)
1. a kind of text content sensitive analysis method, which is characterized in that carry out susceptibility mark to each sensitive word in advance;Institute
The method of stating includes:
Obtain current pending content of text;
Word segmentation processing is carried out to the content of text, obtains a words group, the words group includes at least one words;
Sensitive word is searched from the obtained words group;
When finding sensitive word, the sensitive word found is marked, records the lead-in of the sensitive word in words
Position in group length;The words group length is the number of all texts in the words group;
According to the highest sensitivity grade X that the content of text allows, the words group length is averagely divided into N number of subregion
Between, N=2X;N, X is positive integer;The N number of by stages wherein obtained for division, wherein not including sensitivity in some by stages
Word includes at least one sensitive word in some by stages, and for each sensitive word, which is integrally present in a by stages
In or a part of the sensitive word be present in a by stages, another part is present in another by stages;
Utilize formulaCalculate the susceptibility p of each by stagesi;Wherein i is just less than or equal to N
Integer, for indicating i-th of by stages, esmoothFor the smoothing factor of entropy, esmoothGreater than 0, for avoiding not having in by stages
P when having sensitive wordiEqual to the number that 0, M is sensitive word in by stages, WmlevelFor sensitive word lead-in in i-th of by stages
Sensitive word susceptibility;Wherein in the susceptibility p for calculating each by stagesiWhen, it is to include at least one in the by stages
The cumulative summation of the susceptibility of the lead-in of a sensitive word;
Utilize formulaCalculate the susceptibility E of the content of text.
2. being obtained the method according to claim 1, wherein described carry out word segmentation processing to the content of text
After one words group, the method also includes:
The stop words in words group obtained after removal word segmentation processing.
3. the method according to claim 1, wherein described search sensitive word packet from the obtained words group
It includes:
Words in the words group is compared with the words in sensitive word dictionary one by one;The sensitive word dictionary is for depositing
Store up sensitive word.
4. method according to claim 1-3, which is characterized in that the highest susceptibility that the content of text allows
Grade X is equal to 5.
5. a kind of text content sensitive analytical equipment characterized by comprising
Susceptibility marks unit, for carrying out susceptibility mark to each sensitive word;
Acquiring unit, for obtaining current pending content of text;
Word segmentation processing unit obtains a words group, the words group includes extremely for carrying out word segmentation processing to the content of text
A few words;
Searching unit, for searching sensitive word from the obtained words group;
Recording unit is marked, for when the searching unit finds sensitive word, the sensitive word found to be marked
Note, records position of the lead-in of the sensitive word in words group length;The words group length is to own in the words group
The number of text;
By stages division unit, the highest sensitivity grade X for allowing according to the content of text, by the words group length
Averagely it is divided into N number of by stages, N=2X;N, X is positive integer;The N number of by stages wherein obtained for division, wherein have
Do not include sensitive word in by stages, includes at least one sensitive word in some by stages, for each sensitive word, the sensitive word is whole
Body is present in a by stages or a part of the sensitive word is present in a by stages, and another part is present in another
In a by stages;
First computing unit, for utilizing formulaCalculate the susceptibility p of each by stagesi;Wherein
I is the positive integer less than or equal to N, for indicating i-th of by stages, esmoothFor the smoothing factor of entropy, esmoothGreater than 0, it is used for
Avoid the p when not having sensitive word in by stagesiEqual to the number that 0, M is sensitive word in by stages, WmlevelFor the lead-in of sensitive word
The susceptibility of sensitive word in i-th of by stages;Wherein in the susceptibility p for calculating each by stagesiWhen, it is to the by stages
The cumulative summation of the susceptibility of the lead-in at least one sensitive word for inside including;
Second computing unit, for utilizing formulaCalculate the content of text
Susceptibility E.
6. device according to claim 5, which is characterized in that further include:
Stop words processing unit, for removing the stop words in the words group obtained after word segmentation processing.
7. device according to claim 5, which is characterized in that the searching unit is specifically used for, will be in the words group
Words be compared one by one with the words in sensitive word dictionary;The sensitive word dictionary is for storing sensitive word.
8. according to the described in any item devices of claim 5-7, which is characterized in that the highest susceptibility that the content of text allows
Grade X is equal to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510509318.7A CN105373528B (en) | 2015-08-18 | 2015-08-18 | A kind of text content sensitive analysis method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510509318.7A CN105373528B (en) | 2015-08-18 | 2015-08-18 | A kind of text content sensitive analysis method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105373528A CN105373528A (en) | 2016-03-02 |
CN105373528B true CN105373528B (en) | 2019-03-12 |
Family
ID=55375736
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510509318.7A Active CN105373528B (en) | 2015-08-18 | 2015-08-18 | A kind of text content sensitive analysis method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105373528B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107870945B (en) * | 2016-09-28 | 2020-10-02 | 腾讯科技(深圳)有限公司 | Content rating method and apparatus |
CN106503160A (en) * | 2016-10-31 | 2017-03-15 | 电信科学技术第五研究所 | A kind of method and device that is realized based on big data platform to news management and control |
CN109472152B (en) * | 2017-09-07 | 2020-11-06 | 中国移动通信集团广东有限公司 | Data sensitivity detection method and server |
CN111586421A (en) * | 2020-01-20 | 2020-08-25 | 全息空间(深圳)智能科技有限公司 | Method, system and storage medium for auditing live broadcast platform information |
CN114021564B (en) * | 2022-01-06 | 2022-04-01 | 成都无糖信息技术有限公司 | Segmentation word-taking method and system for social text |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102880636A (en) * | 2012-08-03 | 2013-01-16 | 深圳证券信息有限公司 | Bad information detection method and server |
CN103455639A (en) * | 2013-09-27 | 2013-12-18 | 清华大学 | Method and device for recognizing microblog burst hotspot events |
CN103678274A (en) * | 2013-04-15 | 2014-03-26 | 南京邮电大学 | Feature extraction method for text categorization based on improved mutual information and entropy |
CN104731797A (en) * | 2013-12-19 | 2015-06-24 | 北京新媒传信科技有限公司 | Keyword extracting method and keyword extracting device |
JP5754018B2 (en) * | 2011-07-11 | 2015-07-22 | 日本電気株式会社 | Polysemy extraction system, polysemy extraction method, and program |
-
2015
- 2015-08-18 CN CN201510509318.7A patent/CN105373528B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP5754018B2 (en) * | 2011-07-11 | 2015-07-22 | 日本電気株式会社 | Polysemy extraction system, polysemy extraction method, and program |
CN102880636A (en) * | 2012-08-03 | 2013-01-16 | 深圳证券信息有限公司 | Bad information detection method and server |
CN103678274A (en) * | 2013-04-15 | 2014-03-26 | 南京邮电大学 | Feature extraction method for text categorization based on improved mutual information and entropy |
CN103455639A (en) * | 2013-09-27 | 2013-12-18 | 清华大学 | Method and device for recognizing microblog burst hotspot events |
CN104731797A (en) * | 2013-12-19 | 2015-06-24 | 北京新媒传信科技有限公司 | Keyword extracting method and keyword extracting device |
Also Published As
Publication number | Publication date |
---|---|
CN105373528A (en) | 2016-03-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105373528B (en) | A kind of text content sensitive analysis method and device | |
CN104462547B (en) | A kind of method and system of configurable collecting webpage data | |
CN103123618B (en) | Text similarity acquisition methods and device | |
CN105893478A (en) | Tag extraction method and equipment | |
US10747769B2 (en) | Text representation method and apparatus | |
BR112012011091B1 (en) | method and apparatus for extracting and evaluating word quality | |
CN105608113B (en) | Judge the method and device of POI data in text | |
CN105205180A (en) | Knowledge map evaluation method and device | |
CN106528894B (en) | The method and device of label information is set | |
CN105095091B (en) | A kind of software defect code file localization method based on Inverted Index Technique | |
CN103559313B (en) | Searching method and device | |
CN104899335A (en) | Method for performing sentiment classification on network public sentiment of information | |
CN102855317A (en) | Multimode indexing method and system based on demonstration video | |
CN106503256B (en) | A kind of hot information method for digging based on social networks document | |
CN106469187A (en) | The extracting method of key word and device | |
Koike et al. | Time series topic modeling and bursty topic detection of correlated news and twitter | |
CN110287405A (en) | The method, apparatus and storage medium of sentiment analysis | |
CN106933799A (en) | A kind of Chinese word cutting method and device of point of interest POI titles | |
CN103999079A (en) | Aligning annotation of fields of documents | |
CN105989033A (en) | Information duplication eliminating method based on information fingerprints | |
CN104156458B (en) | The extracting method and device of a kind of information | |
CN108009152A (en) | A kind of data processing method and device of the text similarity analysis based on Spark-Streaming | |
CN108701126A (en) | Theme estimating device, theme presumption method and storage medium | |
CN105786929B (en) | A kind of information monitoring method and device | |
CN106770308A (en) | A kind of streamline quality inspection system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |