CN103761221B - System and method for identifying sensitive text messages - Google Patents
System and method for identifying sensitive text messages Download PDFInfo
- Publication number
- CN103761221B CN103761221B CN201310749656.9A CN201310749656A CN103761221B CN 103761221 B CN103761221 B CN 103761221B CN 201310749656 A CN201310749656 A CN 201310749656A CN 103761221 B CN103761221 B CN 103761221B
- Authority
- CN
- China
- Prior art keywords
- feature
- word
- module
- text
- test
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a system and a method for identifying sensitive text messages. The system for identifying the sensitive text messages comprises a data training module, a data testing module and an information source partitioning identification module; the data training module is used for expressing training texts to be vector space formed character space modules; the data testing module is used for expressing testing texts to be vector space formed character space modules; the information partitioning identification module is used for dividing a testing text assembly into a fuzzy region and a non-fuzzy region and identifying the fuzzy region and the non-fuzzy region in a classified mode.
Description
Technical field
The present invention relates to a kind of system and a method for identifying sensitive text messages.
Background technology
The text message of such as microblogging etc. has become the diversified booster of climate and amplifier, in the public opinion environment
There is great change in role.If network public opinion environment suffer it is illegal destroy, run counter to freedom of information and just biography
The principle broadcast, can cause wrongful commercial competition etc., and then upset the public.Network public opinion environment complicated and changeable and
Behind various doms are ordered about, and are brought to the effective monitoring of the text message such as microblogging and community forum BBS unprecedented
Challenge.
In the security fields for processing content of text messages, the technology for adopting at present mainly has rule-based method and is based on
The method of probability statistics.
(1)Rule-based method to refer to that design in advance one group regular, to indicate whether certain information belongs to sensitive
Information.As shown in figure 1, the method mainly by data input pretreatment module, data rule extraction module, rule judgment module and
As a result output module is realizing.It is made whether the judgement of qualified correlation, rule using rule judgment module to data
Judge module is nucleus module, and in concrete implementation, rule-based several typical methods are as follows:
Based on IP, domain name and routing rule:Black and white lists are arranged to IP, IP will be filtered in the information source of blacklist
Process, and IP is let slip in the information source of white list;The configuration of server end can realize to Access control List,
The server ends such as Top wrappers, Host routes table are configured;Safety certifying method has test/answering system and calculates test system
System etc..
Based on content rule and rule of conduct:It is for example in mail excessive comprising adertisement or English capitalization
Use, and excessively using for the correlation word such as house property or medicine, or excessively gaudy HTML pattern colors etc., these interior content regulations
Then once identified, the information source will be filtered;Information in information source mail head is judged to group mail
And the IP for monitoring the information source at MTA ends has exceeded flow threshold within the time of regulation, then carried out the special place such as filtering
Reason.
(2)Refer to using some features classify different contents based on the method for probability statistics, calculate certain
Feature belongs to estimating for certain classification, takes maximum, if the classification belonging to this maximum is sensitive kinds, the information is entered
Row relevant treatment.As shown in Fig. 2 the method is mainly by test data input module, training data training module, grader classification
Module, result output module are realizing.Data training process is the process of a statistical learning, obtains corresponding grader.Instruction
Practicing the sorting algorithm used during grader can determine according to practical application scene, such as naive Bayesian, the classification such as K-mean
Algorithm etc..
The shortcoming that rule-based method is present:Poor in regular unconspicuous application effect, some are normal
Information source can often be classified as improper information source.Even if in regular significantly application, when information source maker is known
After all of rule, in order to get around rule treatments, its behavior can become more hidden.Rule-based method another
Whether problem, information source can be identified as the reader and put up the different and different of position that sensitive information faced because of it, for
For some specific users, bulletin message or wikipedia, whether it is sensitive information that those can be used to be explicitly indicated, and its
His occasion may become quite normal.I.e. because different user defines the standard difference of sensitive information, in addition it is also necessary to for different use
Family, group etc. set up respective example and data acquisition system.
Also there is respective shortcoming in the method based on probability statistics, such as naive Bayesian because of the different algorithm for adopting
The maximum defect of grader is exactly that he cannot process the result of variations produced by feature based combination, when we assume that word is " beautiful
State " and " 911 " are non-sensitive word, and sensitive information can also be identified as non-sensitive information and put actually as " U.S. 911 "
Cross.Again the major defect of such as K-mean is that, in order to find closest data item, each data with prediction all must
Must and all of data item be compared and indispensable, in the face of million even up to ten million data sets, over time and space
All it is very poorly efficient.
Accordingly, it is desired to provide a kind of system and method for high performance identification sensitivity text message.
The content of the invention
The present invention is proposed to solve at least one of disadvantages mentioned above of the prior art and problem.Based on existing skill
The shortcoming that art is present, we have proposed the set of division information source, and information source is successively known using different types of feature
Method for distinguishing, on the one hand it can show higher performance when large data sets are processed;On the other hand, the method is applied to quick
When sense information is recognized, also have in effect than common sorting algorithm and greatly improve.
According on one side, the present invention proposes a kind of system for recognizing sensitive text message, including:Data are trained
Module, for training text to be expressed as the feature space model of vector space form;Data test module, for text will to be tested
Originally it is expressed as the feature space model of vector space form;And information source piecemeal identification module, for according to text point two
The distribution of dimension space, is divided into test text set confusion region and non-fuzzy area and confusion region and non-fuzzy area is entered respectively
Row Classification and Identification.
Alternatively, the data training module includes:Training text pretreatment module, for carrying out pre- place to training text
Reason;Feature extraction module, for carrying out feature extraction according to the pre-processed results of the training text pretreatment module;And it is special
Selecting module is levied, the feature for being extracted to the feature extraction module carries out feature selection, so as to by word, word and words
The feature of string composition carries out feature selection and obtains feature space.
Alternatively, the data test module includes:Test text pretreatment module, for carrying out pre- place to test text
Reason;Feature extraction module, for carrying out feature extraction according to the pre-processed results of the test text pretreatment module;And it is special
Selecting module is levied, the feature for being extracted to the feature extraction module carries out feature selection, so as to by word, word and words
The feature of string composition carries out feature selection and obtains feature space.
Alternatively, described information source piecemeal identification module includes:Region division module, for empty in two dimension according to text point
Between distribution the test text set is divided into into the confusion region and the non-fuzzy area;First Classification and Identification module, uses
In carrying out Classification and Identification to the confusion region as feature using word or word;And the second Classification and Identification module, for adjacent two
The binary character string of individual word or word composition carries out Classification and Identification as feature to the non-fuzzy area.
Alternatively, the word or word are obtained by participle instrument.
According to another aspect of the present invention, there is provided a kind of method for recognizing sensitive text message, including:Will instruction
Practice feature space model of the text representation for vector space form;Test text is expressed as into the feature space of vector space form
Model;According to text point in the distribution of two-dimensional space, test text set is divided into into confusion region and non-fuzzy area;With word or word
As feature, Classification and Identification is carried out to the confusion region;And using the binary character string that is made up of two neighboring word or word as
Feature, to the non-fuzzy area Classification and Identification is carried out.
Alternatively, the feature space model that training text is expressed as vector space form is included:Training text is carried out
Pretreatment;Feature extraction is carried out to pre-processed results;Feature to being extracted carries out feature selection.
Alternatively, the feature space model that test text is expressed as vector space form is included:Test text is carried out
Pretreatment;Feature extraction is carried out to pre-processed results;Feature to being extracted carries out feature selection.
Alternatively, the word or word are obtained by participle instrument.
Alternatively, using Bayes or K-means as sorting algorithm training grader, by test text set
It is divided into the confusion region and the non-fuzzy area.
Description of the drawings
By the description for carrying out below in conjunction with the accompanying drawings, the above and other aspect of some one exemplary embodiments of the invention, spy
Advantage of seeking peace will become clear to those skilled in the art, wherein:
Fig. 1 is the block diagram for realizing rule-based method;
Fig. 2 is the block diagram for realizing based on the method for probability statistics;
Fig. 3 is the block diagram for recognizing the system of sensitive text message;And
Fig. 4 is the flow chart for recognizing the method for sensitive text message.
Specific embodiment
There is provided and describe to help the one exemplary embodiment of the comprehensive understanding present invention below refer to the attached drawing.It includes various thin
They should be thought only exemplary by section to help understanding.Therefore, those of ordinary skill in the art should recognize
Arrive, various changes and modifications can be made to the embodiments described herein, without departing from scope and spirit of the present invention.Together
Sample, for clarity and conciseness, eliminates the description to known function and structure.
The system and method for identification sensitivity text message of the invention draw test text set according to fuzzy region
It is divided into two parts, and this two-part text is identified respectively using different types of feature.The technical scheme is to tradition
Improved based on probabilistic method technology, the accurate rate and recall rate for making identification and classification results has obtained substantially carrying
Height, and different algorithms can realize combination for sensitive information identification and text classification field, in the efficient of big data quantity
Rate classification aspect has critically important application potential.
Fig. 3 is the block diagram for recognizing the system of sensitive text message.
As shown in figure 3, the system for recognizing sensitive text message of the invention include data training module 310,
Data test module 320, information source piecemeal identification module 330.
Data training module 310 is used to that training text to be expressed as the feature space model of vector space form.
In one embodiment, data training module 310 includes training text pretreatment module 312, feature extraction module
314th, feature selection module 316.
Training text pretreatment module 312 is used to carry out pretreatment to training text.For example, training text pretreatment module
312 can remove punctuation mark and idle character, filter stop words and then carry out participle to training text from training text.Stop
Word can include ", ", " be, ... " etc..
Feature extraction module 314 can carry out feature extraction according to the pre-processed results of training text pretreatment module 312.
For example, feature extraction module 314 can according to demand select the binary character string of word, word and words composition as feature.The
The feature that one step is selected, it should with stronger covering power, there is stronger resolution capability, therefore chooses through the common of participle
Used as feature, recognition effect is better than single word or word combination to word;The feature that second step is selected, it should can be to the first step
Insecure those texts of recognition result have stronger resolution capability, and semantic feature will be projected more, and word combination
Just there is the preferable feature of semanteme, such as word combination " item property " has higher than single word " commodity ", " attribute "
Semantic effect, therefore second step select text participle using latter two adjacent binary word string as feature.As an example,
Participle instrument, the first step can be utilized directly to choose the word segmentation result of participle instrument as feature, second step chooses participle instrument
Two adjacent word combinations are used as feature after participle.
The feature that feature selection module 316 is extracted to feature extraction module 314 carries out feature selection, so as to by word,
The feature of word and words string composition carries out feature selection and obtains the feature space represented by vector space model.For example, it is special
Levying the adoptable method of selection includes word frequency (Term Frequency) statistic law, document frequency (Document Frequency)
Statistic law, inverse document frequency(IDF)Method, mutual information method, CHI side's statistic law, information gain method etc..For example, feature selection module
316 can carry out feature selection using CHI side's statistic law.
Similar with data training module 310, data test module 320 is used to for test text to be expressed as vector space form
Feature space model.
In one embodiment, data test module 320 includes test text pretreatment module 322, feature extraction module
324th, feature selection module 326.
Test text pretreatment module 322 is used to carry out pretreatment to training text.For example, training text pretreatment module
322 can remove punctuation mark and idle character, filter stop words and then carry out participle to training text from training text.Stop
Word can include ", ", " be, ... " etc..
Feature extraction module 324 can carry out feature extraction according to the pre-processed results of training text pretreatment module 322.
For example, feature extraction module 324 can according to demand select the binary character string of word, word and words composition as feature.Such as
With described in the past, the feature that the first step is selected, it should with stronger covering power, there is stronger resolution capability, therefore choose
Through participle common words as feature, recognition effect is better than single word or word combination;The feature that second step is selected, should
This can those texts insecure to first step recognition result there is stronger resolution capability, and semantic feature will more dash forward
Go out, and word combination just have the preferable feature of semanteme, such as word combination " item property " than single word " commodity ",
" attribute " has higher semantic effect, therefore second step selects text participle using latter two adjacent binary word string as spy
Levy.
The feature that feature selection module 326 is extracted to feature extraction module 324 carries out feature selection, so as to by word,
The feature of word and words string composition carries out feature selection and obtains the feature space represented by vector space model.For example, it is special
Levying the adoptable method of selection includes word frequency (Term Frequency) statistic law, document frequency (Document Frequency)
Statistic law, inverse document frequency(IDF)Method, mutual information method, CHI side's statistic law, information gain method etc..For example, feature selection module
316 can carry out feature selection using CHI side's statistic law.
Information source piecemeal identification module 330 is used for the distribution according to text point in two-dimensional space, and test text set is drawn
It is divided into confusion region A and non-fuzzy area B, first using word or word as feature, Classification and Identification is carried out to confusion region A, then with by adjacent
The binary character string of two words or word composition carries out Classification and Identification as feature to non-fuzzy area B.
In one embodiment, information source piecemeal identification module 330 includes region division module 332, the first Classification and Identification
Module 334, the second Classification and Identification module 336.Region division module 332 can be used for the distribution according to text point in two-dimensional space
Test text set is divided into into confusion region A and non-fuzzy area B.First Classification and Identification module 334 can be used for making with word or word
Be characterized carries out Classification and Identification to confusion region A.Second Classification and Identification module 334 can be used for the two of two neighboring word or word composition
Metacharacter string carries out Classification and Identification as feature to non-fuzzy area B.
For example, information source piecemeal identification module 330 can be instructed arbitrarily using Bayes or K-means as sorting algorithm
Practice grader, with by test text set-partition as two parts:Text collection A outside confusion region, the text collection in confusion region
B。
Specifically, by taking Bayesian Classification Arithmetic as an example, two-value text vector d=(w are given1,w2,...,wD)(Wherein, wi=0
Or 1;If ith feature is occurred in text d, wi=1, otherwise wi=0), text dxBelong to classification cjProbability be represented by
P(cj|dx), by text d after calculatingxAssign to the maximum classification of income value, P (cj|dx) computing formula can be expressed as:
Wherein, n is characterized number, and then discriminant function is expressed as:
Order
The discrimination formula of grader can distinguish approximate representation:f(d)=MVmax-MVsec, with MVmaxFor abscissa, with MVsec
For vertical coordinate construction two-dimensional space, a point (x, y) text d being expressed as in two-dimensional space, the point is to segmentation straight line f (d)
=0 apart from dist is:The size of its value can reflect the first step by text dxIt is identified as sensitive kinds c*
The degree of reliability, and then determine the first step recognize when unreliable part.Wherein, Dist is experience boundary constant.
For example, in one embodiment, using totally 5000 record evidence, wherein comprising nonsensitive data 4000, sensitive number
According to 1000.The first step, by nonsensitive data text and sensitive data stochastic averagina 4 parts are divided into, wherein 3 parts of composing training data
Collection, 1 part of composition test data set.Cross validation is carried out by 4 hurdles, 4 hurdle laboratory mean values are used as final index.First, from training
Extractor, word in data, feature selection constitutive characteristic space is carried out, then extracted by identical characteristic type from test text
Feature, classifies according to a certain algorithm to test text, and result is estimated, if reliable results, the result is made
For final identification judged result, if unreliable, the text to be identified is placed on into fuzzy interval, stays in the process of the 2nd step;
2nd step, using the diverse feature of characteristic type with the 1st step, first carries out study statistics constitutive characteristic empty to training data
Between, the feature that this feature type is then extracted from test text is identified, and is made according to the recognition result of two steps and finally being sentenced
Disconnected, through cross validation, the invention recognizes the pattra leaves of accurate rate, recall rate than single step using bayesian algorithm to sensitive information
This method effect is significantly improved.
Fig. 4 is the flow chart for recognizing the method for sensitive text message.
As shown in Figure 4, the method starts from step 410.In step 410, training text is expressed as into vector space shape
The feature space model of formula.
It is possible, firstly, to carry out pretreatment to training text.For example, punctuation mark and invalidation word can be removed from training text
Accord with, filter stop words and then participle is carried out to training text.Stop words can include ", ", " be, ... " etc..
It is then possible to pre-processed results carry out feature extraction.For example, word, word and words group can according to demand be selected
Into binary character string as feature.
Finally, feature selection can be carried out to the feature for being extracted, so as to enter to the feature being made up of word, word and words string
Row feature selection and obtain the feature space represented by vector space model.For example, the adoptable method of feature selection includes
Word frequency (Term Frequency) statistic law, document frequency (Document Frequency) statistic law, inverse document frequency(IDF)
Method, mutual information method, CHI side's statistic law, information gain method etc..It is for instance possible to use CHI side's statistic law carries out feature selection.
In step 420, test text is expressed as into the feature space model of vector space form.
It is possible, firstly, to carry out pretreatment to training text.For example, punctuation mark and invalidation word can be removed from training text
Accord with, filter stop words and then participle is carried out to training text.Stop words can include ", ", " be, ... " etc..
It is then possible to carry out feature extraction according to pre-processed results.For example, word, word and word can according to demand be selected
The binary character string of word composition is used as feature.
Finally, the feature to being extracted carries out feature selection, so as to carry out spy to the feature being made up of word, word and words string
Levy selection and obtain the feature space represented by vector space model.For example, the adoptable method of feature selection includes word frequency
(Term Frequency) statistic law, document frequency (Document Frequency) statistic law, inverse document frequency(IDF)Method,
Mutual information method, CHI side's statistic law, information gain method etc..It is for instance possible to use CHI side's statistic law carries out feature selection.
In step 430, according to text point in the distribution of two-dimensional space, text collection to be identified is divided into into confusion region A
With non-fuzzy area B.
In step 440, using word or word as feature, Classification and Identification is carried out to confusion region A.
In step 450, using the binary character string that is made up of two neighboring word or word as feature, non-fuzzy area B is carried out
Classification and Identification.
For example, arbitrarily grader can be trained as sorting algorithm using Bayes bayes or K-means, will be surveyed
Examination text collection is divided into two parts:Text collection A outside confusion region, the text collection B in confusion region.
By the concrete side that test text set-partition is the text collection B in text collection A and confusion region outside confusion region
Method may refer to above for the details described by Fig. 3, will not be described in detail herein.
It is to be noted that above apparatus and method of the present invention embodiment is described respectively respectively, but to one
The details of individual embodiment description also apply be applicable to another embodiment.
The ultimate principle of the present invention is described above in association with specific embodiment, however, it is desirable to, it is noted that to this area
For those of ordinary skill, it is to be understood that whole either any step or part of the method for the present invention and system can be with soft
Part, hardware, firmware or combinations thereof are realized that this is those of ordinary skill in the art in the explanation for having read the present invention
In the case of can be achieved with their basic programming skill.
Therefore, the purpose of the present invention can also be soft by one software module of operation on any computing device or one group
Part module is realizing.The computing device can be known fexible unit.Therefore, the purpose of the present invention can also be only by
There is provided the program product comprising the program code for realizing methods described or device to realize.That is, such program is produced
Product also constitute the present invention, and the storage medium of such program product that is stored with also constitutes the present invention.Obviously, the storage is situated between
Matter can be any known storage medium or any storage medium for being developed in the future.
Should not be to any by these detailed explanations although this specification includes many particular implementation details
Invention or the restriction of the scope of content that can be advocated, and should be construed to can be specific to the specific embodiment of specific invention
Feature description.Can also be by some combinations of features in this manual described in the situation of detached embodiment in list
Realize in individual embodiment.On the contrary, can also be by the character separation of each described in the situation in single embodiment many
Realize in individual embodiment or realize in any appropriate sub-portfolio.Although additionally, may describe feature as above
Work in some combinations, or even initially advocate thus, but can be in some cases by from the one of the combination advocated
Individual or multiple features are left out from combination, and the combination advocated can be pointed to the variant of sub-portfolio or sub-portfolio.
Similarly, although in the accompanying drawings operation is depicted with certain order, but this should not be interpreted as needing with institute
The certain order shown performs such operation or needs the operation for performing all diagrams to can be only achieved the phase with sequential order
The result of prestige.In some cases, multitask and parallel processing can be favourable.Additionally, should not be by above-mentioned enforcement
The separation of the various system components in example is interpreted as being required to such separation in all embodiments, and it should be appreciated that
Generally can be by described program assembly and the system integration are together into single software product or be encapsulated as multiple softwares and produce
Product.
Computer program(Also referred to as program, software, software application, script or code)Programming language that can be in any form
Speech is write, and the programming language includes compiling or interpretative code or illustrative or procedural language, and it can be with any shape
Formula is disposed, including as stand-alone program or as module, component, subprogram or being suitable to other lists for using in a computing environment
Unit.Computer program not necessarily corresponds to the file in file system.Program storage can kept other programs or number
According to file(For example, one or more scripts being stored in marking language document)A part, be exclusively used in discuss in journey
The single file or multiple coordination files of sequence(For example, the file of one or more modules, subprogram or partial code is stored)
In.
Above-mentioned specific embodiment, does not constitute limiting the scope of the invention.Those skilled in the art should be bright
It is white, depending on design requirement and other factors, various modifications, combination, sub-portfolio and replacement can occur.It is any
Modification, equivalent and improvement for being made within the spirit and principles in the present invention etc., should be included in the scope of the present invention
Within.
Claims (5)
1. a kind of system for recognizing sensitive text message, including:
Data training module, for training text to be expressed as the feature space model of vector space form;
Data test module, for test text to be expressed as the feature space model of vector space form;And
Information source piecemeal identification module, in the distribution of two-dimensional space, by test text set mould is divided into for according to text point
Paste area and non-fuzzy area and Classification and Identification is carried out respectively to confusion region and non-fuzzy area;
The data training module includes:Training text pretreatment module, for carrying out pretreatment to training text;Fisrt feature
Abstraction module, for carrying out feature extraction according to the pre-processed results of the training text pretreatment module;Fisrt feature is selected
Module, the feature for being extracted to the feature extraction module carries out feature selection, so as to being made up of word, word and words string
Feature carry out feature selection and obtain feature space;
The data test module includes:Test text pretreatment module, for carrying out pretreatment to test text;Second feature
Abstraction module, for carrying out feature extraction according to the pre-processed results of the test text pretreatment module;Second feature is selected
Module, the feature for being extracted to the feature extraction module carries out feature selection, so as to being made up of word, word and words string
Feature carry out feature selection and obtain feature space;
Described information source piecemeal identification module includes:Region division module, incites somebody to action for the distribution according to text point in two-dimensional space
The test text set is divided into the confusion region and the non-fuzzy area;First Classification and Identification module, for word or word
Classification and Identification is carried out to the confusion region as feature;Second Classification and Identification module, for what is constituted with two neighboring word or word
Binary character string carries out Classification and Identification as feature to the non-fuzzy area.
2. system according to claim 1, wherein the word or word are obtained by participle instrument.
3. a kind of method for recognizing sensitive text message, including:
Training text is expressed as into the feature space model of vector space form;
Test text is expressed as into the feature space model of vector space form;
According to text point in the distribution of two-dimensional space, test text set is divided into into confusion region and non-fuzzy area;
Using word or word as feature, Classification and Identification is carried out to the confusion region;And
Using the binary character string that is made up of two neighboring word or word as feature, Classification and Identification is carried out to the non-fuzzy area;
Wherein the feature space model that training text is expressed as vector space form is included:Pretreatment is carried out to training text;
Feature extraction is carried out to pre-processed results;Feature to being extracted carries out feature selection;
Wherein the feature space model that test text is expressed as vector space form is included:Pretreatment is carried out to test text;
Feature extraction is carried out to pre-processed results;Feature to being extracted carries out feature selection.
4. method according to claim 3, wherein, the word or word are obtained by participle instrument.
5. method according to claim 3, wherein, trained point as sorting algorithm using Bayes or K-means
Class device, with by test text set-partition as the confusion region and the non-fuzzy area.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310749656.9A CN103761221B (en) | 2013-12-31 | 2013-12-31 | System and method for identifying sensitive text messages |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310749656.9A CN103761221B (en) | 2013-12-31 | 2013-12-31 | System and method for identifying sensitive text messages |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103761221A CN103761221A (en) | 2014-04-30 |
CN103761221B true CN103761221B (en) | 2017-05-17 |
Family
ID=50528461
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310749656.9A Active CN103761221B (en) | 2013-12-31 | 2013-12-31 | System and method for identifying sensitive text messages |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103761221B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104484371B (en) * | 2014-12-05 | 2017-11-10 | 广州供电局有限公司 | Power marketing abnormal data on-line monitoring analysis method and system |
CN104933443A (en) * | 2015-06-26 | 2015-09-23 | 北京途美科技有限公司 | Automatic identifying and classifying method for sensitive data |
CN107818077A (en) * | 2016-09-13 | 2018-03-20 | 北京金山云网络技术有限公司 | A kind of sensitive content recognition methods and device |
CN107547718B (en) * | 2017-08-22 | 2020-11-03 | 电子科技大学 | Telecommunication fraud identification and defense system based on deep learning |
CN108763477A (en) * | 2018-05-29 | 2018-11-06 | 厦门快商通信息技术有限公司 | A kind of short text classification method and system |
CN109308264B (en) * | 2018-10-22 | 2021-11-16 | 北京天融信网络安全技术有限公司 | Method for evaluating data desensitization effect, corresponding device and storage medium |
US11789982B2 (en) * | 2020-09-23 | 2023-10-17 | Electronic Arts Inc. | Order independent data categorization, indication, and remediation across realtime datasets of live service environments |
CN114239591B (en) * | 2021-12-01 | 2023-08-18 | 马上消费金融股份有限公司 | Sensitive word recognition method and device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101876987A (en) * | 2009-12-04 | 2010-11-03 | 中国人民解放军信息工程大学 | A Two-Class Text Classification Method Oriented to Class Overlap |
CN102184262A (en) * | 2011-06-15 | 2011-09-14 | 悠易互通(北京)广告有限公司 | Web-based text classification mining system and web-based text classification mining method |
CN103136377A (en) * | 2013-03-26 | 2013-06-05 | 重庆邮电大学 | Chinese text classification method based on evolution super-network |
-
2013
- 2013-12-31 CN CN201310749656.9A patent/CN103761221B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101876987A (en) * | 2009-12-04 | 2010-11-03 | 中国人民解放军信息工程大学 | A Two-Class Text Classification Method Oriented to Class Overlap |
CN102184262A (en) * | 2011-06-15 | 2011-09-14 | 悠易互通(北京)广告有限公司 | Web-based text classification mining system and web-based text classification mining method |
CN103136377A (en) * | 2013-03-26 | 2013-06-05 | 重庆邮电大学 | Chinese text classification method based on evolution super-network |
Non-Patent Citations (3)
Title |
---|
一种不良信息过滤的文本预处理方法研究;吴慧玲 等;《微计算机信息》;20061231;第22卷(第12-3期);58-60 * |
基于不良文本信息过滤预处理方法的研究;吴慧玲 等;《网络安全技术与应用》;20061101;61-63 * |
基于柔性匹配的中文文本特征提取方法;帅正化 等;《计算机工程》;20100831;第36卷(第16期);63,64,70 * |
Also Published As
Publication number | Publication date |
---|---|
CN103761221A (en) | 2014-04-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103761221B (en) | System and method for identifying sensitive text messages | |
Boenninghoff et al. | Explainable authorship verification in social media via attention-based similarity learning | |
Fang et al. | Self multi-head attention-based convolutional neural networks for fake news detection | |
WO2020164278A1 (en) | Image processing method and device, electronic equipment and readable storage medium | |
CN106709032A (en) | Method and device for extracting structured information from spreadsheet document | |
CN111753120B (en) | Question searching method and device, electronic equipment and storage medium | |
Sang et al. | Robust movie character identification and the sensitivity analysis | |
CN110209819A (en) | File classification method, device, equipment and medium | |
CN108959329A (en) | A kind of file classification method, device, medium and equipment | |
CN114218391A (en) | Sensitive information identification method based on deep learning technology | |
CN112784111A (en) | Video classification method, device, equipment and medium | |
Schmøkel et al. | FBAdLibrarian and Pykognition: open science tools for the collection and emotion detection of images in Facebook political ads with computer vision | |
CN111639185B (en) | Relation information extraction method, device, electronic equipment and readable storage medium | |
CN116977692A (en) | Data processing method, device and computer readable storage medium | |
KR102185733B1 (en) | Server and method for automatically generating profile | |
CN114973107A (en) | Unsupervised cross-domain video action identification method based on multi-discriminator cooperation and strong and weak sharing mechanism | |
Unal et al. | Visual persuasion in covid-19 social media content: A multi-modal characterization | |
CN111784360B (en) | Anti-fraud prediction method and system based on network link backtracking | |
CN113837836A (en) | Model recommendation method, device, equipment and storage medium | |
CN113962199A (en) | Text recognition method, text recognition device, text recognition equipment, storage medium and program product | |
CN112861750A (en) | Video extraction method, device, equipment and medium based on inflection point detection | |
CN111581975A (en) | Case writing text processing method and device, storage medium and processor | |
CN108170838B (en) | Topic evolution visualization display method, application server and computer readable storage medium | |
CN111931229B (en) | Data identification method, device and storage medium | |
CN116863406A (en) | Intelligent alarm security system based on artificial intelligence and method thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |