CN107818077A - A kind of sensitive content recognition methods and device - Google Patents
A kind of sensitive content recognition methods and device Download PDFInfo
- Publication number
- CN107818077A CN107818077A CN201610822280.3A CN201610822280A CN107818077A CN 107818077 A CN107818077 A CN 107818077A CN 201610822280 A CN201610822280 A CN 201610822280A CN 107818077 A CN107818077 A CN 107818077A
- Authority
- CN
- China
- Prior art keywords
- content
- participle
- destinations traffic
- sensitive
- communication
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Abstract
The embodiment of the present invention, which provides a kind of sensitive content recognition methods and device, method, to be included:Word segmentation processing is carried out to destinations traffic content to be identified, obtains corresponding at least one target participle;Using default participle attribute create-rule, the participle attribute of each target participle is generated, and generates the target feature vector corresponding to destinations traffic content;Target feature vector is inputted into the sensitive content identification model pre-established, obtain destinations traffic content whether be sensitive content recognition result.When carrying out sensitive content identification using scheme provided in an embodiment of the present invention, not by sensitive word in keyword in Content of Communication and default dictionary identification is realized in literal simple match, but make use of the literal level of ratio of the corresponding participle of Content of Communication deeper into participle attribute, and characteristic vector is generated according to the participle attribute of each participle, and the sensitive content identification model trained using machine learning algorithm is identified, the accuracy rate of sensitive content identification is improved.
Description
Technical field
The present invention relates to technical field of network security, more particularly to a kind of sensitive content recognition methods and device.
Background technology
In recent years, with the development of network technology, requirement for network security also more and more higher, especially managed in company
The requirement of reason, copyright management, national security etc. for network security is more urgent.In consideration of it, need to Web content with
And the network user is monitored, to identify the sensitive content in network in time, and then identified according to the sensitive content identified
The sensitive users gone out in network, so as to reach the purpose to guarantee network security.
In the prior art, there is provided a kind of sensitive content identifying schemes based on Keywords matching:Enter to sensitive content
During row identification, Content of Communication to be identified can be first obtained, then Content of Communication to be identified is divided according to default word segmentation regulation
Word processing, and keyword is filtered out in several participles obtained after word segmentation processing, and then by the keyword filtered out and in advance
If the sensitive word in dictionary is matched, specifically, when the keyword identical for finding and filtering out from default dictionary is sensitive
During word, it is believed that the keyword is a sensitive word to match, and when matching result meets default condition, (what is such as matched is quick
Sense word quantity is more than predetermined number) when, judge the Content of Communication to be identified for sensitive content.
As seen from the above, although such scheme can realize the identification to the sensitive content in network, such scheme
Only the keyword in Content of Communication to be identified is matched on literal level, it is easy to cause what sensitive content identified
Accuracy rate is not high.For example, there is a keyword " iPhone " in Content of Communication to be identified, and preset not having in dictionary
In the case of having " iPhone " and having " iphone ", " iPhone " can not be matched with " iphone ", it is taken as that " apple
Fruit mobile phone " is not a sensitive word.It can be seen that such scheme can cause have very more keywords can not in Content of Communication to be identified
Matched with the sensitive word in default dictionary, the accuracy rate for causing sensitive content to identify is not high.
The content of the invention
The purpose of the embodiment of the present invention is to provide a kind of sensitive content recognition methods and device, to improve sensitive content knowledge
Other accuracy rate.
To reach above-mentioned purpose, the embodiment of the invention discloses a kind of sensitive content recognition methods, methods described includes:
Word segmentation processing is carried out to destinations traffic content to be identified, obtained at least one corresponding to the destinations traffic content
Target segments;
Using default participle attribute create-rule, the participle attribute that each target segments is generated;
The participle attribute segmented according to each target, generates the target feature vector corresponding to the destinations traffic content;
The target feature vector is inputted into the sensitive content identification model pre-established, obtains the destinations traffic
Content whether be sensitive content recognition result, wherein, the sensitive content identification model be using default machine learning calculate
Method, resulting classification is trained to the characteristic vector corresponding to default multiple Content of Communication samples with tag along sort
Model, the tag along sort include:The first label or second for identifying non-sensitive content for identifying sensitive content are marked
Label.
Optionally, the step of obtaining the destinations traffic content to be identified, including:
The packet transmitted in collection network;
Based on default application layer protocol, reduction treatment is carried out to the data content in the packet;
It is destinations traffic content to be identified to determine the data content after reduction treatment.
Optionally, it is described that word segmentation processing is carried out to destinations traffic content to be identified, obtain the destinations traffic content pair
The step of at least one target participle answered, including:
According to default first word division rule, by the destinations traffic division of teaching contents into several words;
The word obtained dividing segments as target corresponding to the destinations traffic content.
Optionally, it is described that word segmentation processing is carried out to destinations traffic content to be identified, obtain the destinations traffic content pair
The step of at least one target participle answered, including:
According to default second word division rule, by the destinations traffic division of teaching contents into several words;
Remove the stop words in obtained several words of division, wherein, the stop words be part of speech be adverbial word, preposition or
The word of pronoun;
Segmented remaining word as target corresponding to the destinations traffic content.
Optionally, it is described using default participle attribute create-rule, generate the step for segmenting attribute that each target segments
Suddenly, including:
According to default weighting algorithm, the weights that each target segments are calculated;
The corresponding weight value calculated is defined as to the participle attribute of each target participle.
Optionally, the participle attribute segmented according to each target, generates the mesh corresponding to the destinations traffic content
The step of marking characteristic vector, including:
Judge whether each keyword in default lists of keywords is identical with either objective participle respectively;
The participle attribute that target corresponding to first class keywords segments is defined as feature category corresponding to the first class keywords
Property, default default value is defined as characteristic attribute corresponding to the second class keywords;Wherein, first class keywords for institute
Target participle identical keyword is stated, second class keywords are the keyword differed with target participle;
Vector element using the characteristic attribute corresponding to each keyword as target feature vector, and according to each key
Element position of the word in the lists of keywords, generate the target feature vector corresponding to the destinations traffic content.
Optionally, described sensitive content identification model, including:
The disaggregated model established based on algorithm of support vector machine, Bayesian Classification Arithmetic or neural network classification algorithm.
Optionally, by the target feature vector input to the step in the sensitive content identification model pre-established it
Before, in addition to:
According to the destinations traffic content in transmission used protocol type, determine the letter of the destinations traffic content
Cease type;Wherein, described information type includes:The initiative information sent to outside and the outside passive information sent;
It is described to input the target feature vector into the sensitive content identification model pre-established, obtain the target
Content of Communication whether be sensitive content recognition result, wherein, the sensitive content identification model be utilize default engineering
Practise algorithm, the characteristic vector corresponding to default multiple Content of Communication samples with tag along sort is trained it is resulting
Disaggregated model, including:
When the destinations traffic content is initiative information, the target feature vector is inputted to first pre-established
In sensitive content identification model, obtain the destinations traffic content whether be sensitive content recognition result;Wherein, described first
Sensitive content identification model for using algorithm of support vector machine, to default multiple Content of Communication sample institutes with tag along sort
Corresponding characteristic vector is trained resulting disaggregated model;
When the destinations traffic content is initiative information, the target feature vector is inputted to second pre-established
In sensitive content identification model, obtain the destinations traffic content whether be sensitive content recognition result;Wherein, described second
Sensitive content identification model for using Bayesian Classification Arithmetic, to default multiple Content of Communication sample institutes with tag along sort
Corresponding characteristic vector is trained resulting disaggregated model.
Optionally, the step of establishing the first sensitive content identification model, including:
Obtain multiple first kind Content of Communication samples with tag along sort;Wherein, the first kind Content of Communication sample
For the sample that information type is initiative information;
Word segmentation processing is carried out to each first kind Content of Communication sample respectively, obtains each first kind Content of Communication sample pair
At least one first kind participle answered;
Using the participle attribute create-rule, generate what the first kind corresponding to each first kind Content of Communication sample segmented
Segment attribute;
The participle attribute segmented according to the first kind corresponding to each first kind Content of Communication sample, generates each first kind and leads to
Believe the first kind characteristic vector corresponding to content sample;
It is initial based on first kind characteristic vector and tag along sort corresponding to each first kind Content of Communication sample, training first
Identification model, the first sensitive content identification model is obtained, wherein, the first initial identification model is based on algorithm of support vector machine
Model.
Optionally, the step of establishing the second sensitive content identification model, including:
Obtain multiple second class Content of Communication samples with tag along sort;Wherein, the second class Content of Communication sample
For the sample that information type is passive information;
Word segmentation processing is carried out to each second class Content of Communication sample respectively, obtains each second class Content of Communication sample pair
At least one second class participle answered;
Using the participle attribute create-rule, the second class participle corresponding to each second class Content of Communication sample is generated
Segment attribute;
The participle attribute segmented according to the second class corresponding to each second class Content of Communication sample, generate each second class and lead to
Believe the second category feature vector corresponding to content sample;
It is initial based on the second category feature vector sum tag along sort corresponding to each second class Content of Communication sample, training second
Identification model, the second sensitive content identification model is obtained, wherein, the second initial identification model is based on Bayesian Classification Arithmetic
Model.
Optionally, it is described according to the destinations traffic content transmission when used protocol type, determine the target
The step of information type of Content of Communication, including:
In the case where the destinations traffic content is webpage:If used protocol type is POST classes when transmitting webpage
Type, it is determined that the destinations traffic content is initiative information;If used protocol type is GET types when transmitting webpage,
It is passive information to determine the destinations traffic content;
In the case where the Content of Communication to be identified is mail:When sending agreement transmission mail according to mail, it is determined that
The destinations traffic content is initiative information;When transmitting mail according to mail reception agreement, the destinations traffic content is determined
For passive information.
To reach above-mentioned purpose, the embodiment of the invention discloses a kind of sensitive content identification device, described device includes:
Word segmentation processing module, for carrying out word segmentation processing to destinations traffic content to be identified, obtain the destinations traffic
At least one target participle corresponding to content;
Attribute generation module is segmented, for using default participle attribute create-rule, generating dividing for each target participle
Word attribute;
Feature vector generation module, for the participle attribute segmented according to each target, generate the destinations traffic content
Corresponding target feature vector;
Sensitive content identification module, mould is identified for the target feature vector to be inputted to the sensitive content pre-established
In type, obtain the destinations traffic content whether be sensitive content recognition result, wherein, the sensitive content identification model is
Using default machine learning algorithm, to the characteristic vector corresponding to default multiple Content of Communication samples with tag along sort
Resulting disaggregated model is trained, the tag along sort includes:For identifying the first label of sensitive content or for marking
Know the second label of non-sensitive content.
Optionally, a kind of sensitive content identification device that the embodiment of the present invention is provided also includes:
Content of Communication obtains module, is specifically used for:
The packet transmitted in collection network;
Based on default application layer protocol, reduction treatment is carried out to the data content in the packet;
It is destinations traffic content to be identified to determine the data content after reduction treatment.
Optionally, the word segmentation processing module, including:First division submodule and first participle determination sub-module;Wherein,
The first division submodule, for according to default first word division rule, by the destinations traffic content
It is divided into several words;
The first participle determination sub-module, for obtained word will to be divided as corresponding to the destinations traffic content
Target segments.
Optionally, the word segmentation processing module, including:Second division submodule, participle remove submodule and the second participle
Determination sub-module;Wherein,
The second division submodule, for according to default second word division rule, by the destinations traffic content
It is divided into several words;
The participle removes submodule, the stop words in several words obtained for removing division, wherein, it is described to stop
Word is the word that part of speech is adverbial word, preposition or pronoun;
The second participle determination sub-module, for using remaining word as target corresponding to the destinations traffic content
Participle.
Optionally, the participle attribute generation module, including:Weight computing submodule and participle attribute determination sub-module;
Wherein,
The weight computing submodule, for according to default weighting algorithm, calculating the weights that each target segments;
The participle attribute determination sub-module, for the corresponding weight value calculated to be defined as to the participle of each target participle
Attribute.
Optionally, the feature vector generation module, including:Segment judging submodule, characteristic attribute determination sub-module and
Characteristic vector generates submodule;Wherein,
The participle judging submodule, for judging each keyword in default lists of keywords whether with appointing respectively
One target participle is identical;
The characteristic attribute determination sub-module, the participle attribute for target corresponding to the first class keywords to be segmented determine
For characteristic attribute corresponding to the first class keywords, default default value is defined as characteristic attribute corresponding to the second class keywords;
Wherein, first class keywords are to segment identical keyword with the target, and second class keywords are and the mesh
The keyword that mark participle differs;
The characteristic vector generates submodule, for using the characteristic attribute corresponding to each keyword as target signature to
The vector element of amount, and the element position according to each keyword in the lists of keywords, are generated in the destinations traffic
Hold corresponding target feature vector.
Optionally, described sensitive content identification model, including:
The disaggregated model established based on algorithm of support vector machine, Bayesian Classification Arithmetic or neural network classification algorithm.
Optionally, a kind of sensitive content identification device that the embodiment of the present invention is provided also includes:
Information type determining module, is used for:
The target feature vector is inputted to before the step in the sensitive content identification model pre-established, according to institute
Destinations traffic content used protocol type in transmission is stated, determines the information type of the destinations traffic content;Wherein, institute
Stating information type includes:The initiative information sent to outside and the outside passive information sent;
The sensitive content identification module, including:First identification submodule and the second identification submodule;Wherein,
Described first identification submodule, for when the destinations traffic content be initiative information, by target spy
Sign vector input obtains whether the destinations traffic content is in sensitivity into the first sensitive content identification model pre-established
The recognition result of appearance;Wherein, the first sensitive content identification model is to divide using algorithm of support vector machine, to default carry
Characteristic vector corresponding to multiple Content of Communication samples of class label is trained resulting disaggregated model;
Described second identification submodule, for when the destinations traffic content be initiative information, by target spy
Sign vector input obtains whether the destinations traffic content is in sensitivity into the second sensitive content identification model pre-established
The recognition result of appearance;Wherein, the second sensitive content identification model is to divide using Bayesian Classification Arithmetic, to default carry
Characteristic vector corresponding to multiple Content of Communication samples of class label is trained resulting disaggregated model.
Optionally, the step of establishing the first sensitive content identification model, including:
Obtain multiple first kind Content of Communication samples with tag along sort;Wherein, the first kind Content of Communication sample
For the sample that information type is initiative information;
Word segmentation processing is carried out to each first kind Content of Communication sample respectively, obtains each first kind Content of Communication sample pair
At least one first kind participle answered;
Using the participle attribute create-rule, generate what the first kind corresponding to each first kind Content of Communication sample segmented
Segment attribute;
The participle attribute segmented according to the first kind corresponding to each first kind Content of Communication sample, generates each first kind and leads to
Believe the first kind characteristic vector corresponding to content sample;
It is initial based on first kind characteristic vector and tag along sort corresponding to each first kind Content of Communication sample, training first
Identification model, the first sensitive content identification model is obtained, wherein, the first initial identification model is based on algorithm of support vector machine
Model.
Optionally, the step of establishing the second sensitive content identification model, including:
Obtain multiple second class Content of Communication samples with tag along sort;Wherein, the second class Content of Communication sample
For the sample that information type is passive information;
Word segmentation processing is carried out to each second class Content of Communication sample respectively, obtains each second class Content of Communication sample pair
At least one second class participle answered;
Using the participle attribute create-rule, the second class participle corresponding to each second class Content of Communication sample is generated
Segment attribute;
The participle attribute segmented according to the second class corresponding to each second class Content of Communication sample, generate each second class and lead to
Believe the second category feature vector corresponding to content sample;
It is initial based on the second category feature vector sum tag along sort corresponding to each second class Content of Communication sample, training second
Identification model, the second sensitive content identification model is obtained, wherein, the second initial identification model is based on Bayesian Classification Arithmetic
Model.
Optionally, described information determination type module, including:First information type determination module and the second information type
Determination sub-module;Wherein,
The first information type determination module, in the case of being webpage in the destinations traffic content:If pass
Used protocol type is POST types during defeated webpage, it is determined that the destinations traffic content is initiative information;If transmission network
Used protocol type is GET types during page, it is determined that the destinations traffic content is passive information;
The second information type determination sub-module, in the case of being mail in the Content of Communication to be identified:If
When sending agreement transmission mail using mail, it is initiative information to determine the destinations traffic content;According to mail reception agreement
When transmitting mail, it is passive information to determine the destinations traffic content.
The embodiment of the present invention provides a kind of sensitive content recognition methods and device.It is first right when carrying out sensitive content identification
Destinations traffic content to be identified carries out word segmentation processing, obtains at least one target participle corresponding to it;Then, utilization is default
Attribute create-rule is segmented, generates the participle attribute of each target participle;Then, the participle attribute segmented according to each target,
Generate the target feature vector corresponding to destinations traffic content;Finally, target feature vector is inputted to the sensitivity pre-established
In content recognition model, obtain destinations traffic content whether be sensitive content recognition result.
As seen from the above, when carrying out sensitive content identification using scheme provided in an embodiment of the present invention, not by logical
Believe that the keyword in content realizes identification with the sensitive word in default dictionary in literal simple match, but make use of logical
Believe the literal level of ratio of the corresponding participle of content deeper into participle attribute, and feature is generated according to the participle attribute of each participle
Vector, and train to obtain sensitive content identification model using machine learning algorithm and be identified, so as to improve sensitive content
The accuracy rate of identification.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing
There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this
Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with
Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is a kind of schematic flow sheet of sensitive content recognition methods provided in an embodiment of the present invention;
Fig. 2 is the schematic flow sheet of another sensitive content recognition methods provided in an embodiment of the present invention;
Fig. 3 is a kind of structural representation of sensitive content identification device provided in an embodiment of the present invention;
Fig. 4 is the structural representation of another sensitive content identification device provided in an embodiment of the present invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete
Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.It is based on
Embodiment in the present invention, those of ordinary skill in the art are obtained every other under the premise of creative work is not made
Embodiment, belong to the scope of protection of the invention.
In order to improve the accuracy rate of sensitive content identification, the embodiments of the invention provide a kind of sensitive content recognition methods and
Device.
A kind of sensitive content recognition methods provided in an embodiment of the present invention is introduced first below.
It should be noted that a kind of executive agent for sensitive content recognition methods that the embodiment of the present invention is provided can be
A kind of sensitive content identification device.Wherein, the sensitive content identification device can be the plug-in unit in existing capability software, can also
For independent functional software, this is all rational.
As shown in figure 1, it is a kind of sensitive content recognition methods provided in an embodiment of the present invention, this method can include following
Step:
S101:Word segmentation processing is carried out to destinations traffic content to be identified, obtained at least one corresponding to destinations traffic content
Individual target participle.
When needing to carry out sensitive identification to a certain Content of Communication, destinations traffic content to be identified can be carried out first
Word segmentation processing, so as to obtain at least one target participle corresponding to the destinations traffic content, and carry out follow-up processing.
Wherein, the acquisition modes of destinations traffic content to be identified can be it is a variety of, in one implementation, Ke Yitong
Cross in the following manner and obtain destinations traffic content:
(1) packet transmitted in network is gathered.
Specifically, the gateway server in network can be configured, and then the data flow that gateway server will be flowed into
Amount is incorporated into the network adapter of default experimental machine, and the network adapter of experimental machine is arranged into promiscuous mode, so as to
Realize to the collection of the packet transmitted in network.
(2) default application layer protocol is based on, reduction treatment is carried out to the data content in packet.
Specifically, after the packet transmitted in network is collected, data flow can be carried out using transport layer protocol
Restructuring, corresponding application layer protocol is then recycled to carry out data convert processing to the data content in packet.In general, number
It can be divided into TCP (Transmission Control Protocol, transmission control protocol) data flows and UDP (User according to stream
Datagram Protocol, UDP) two kinds of forms of data flow, therefore, it is necessary to according to corresponding to packet data
Stream type determines corresponding application layer protocol, and then utilizes the corresponding application layer protocol determined to the number in packet
Reduction treatment is carried out according to content.For example, application layer protocol can include HTTP (HTTP-Hypertext transfer
Protocol, HTTP), FTP (File Transfer Protocol, FTP), MSN
(Microsoft Service Network, Microsoft MSN), SMTP (Simple Mail Transfer Protocol, simple postal
Part host-host protocol) and POP3 (Post Office Protocol-Version 3, Post Office Protocol,Version 3) etc., it is necessary to explanation
It is that the embodiment of the present invention need not be simultaneously defined to specific application layer protocol.
In addition, it is necessary to illustrate, the whole needed for a TCP communication connection procedure has been usually contained in TCP data stream
TCP data bag, therefore, it is possible that phenomenon that is out of order or retransmitting in the transmitting procedure of packet, therefore, it is necessary to by five yuan
Group is (i.e.:Source internet protocol address, source port, purpose internet protocol address, destination interface and transport layer protocol) identical TCP numbers
It is ranked up according to bag according to ACK (Acknowledgement, confirming character) and SEQ (Sequence, sequence number) numberings, to solve
The problem of out of order and re-transmission of TCP data bag.
(3) it is destinations traffic content to be identified to determine the data content after reduction treatment.
Specifically, after reduction treatment is carried out to the packet gathered according to corresponding application layer protocol, can obtain
Data content corresponding to packet, for example, the webpage browsed, the file of transmission, the message content, etc. that transmits, Jin Erke
So that the data content restored to be defined as to destinations traffic content to be identified, identified with carrying out follow-up sensitive content.
It should be noted that the mode of above-mentioned acquisition destinations traffic content is merely exemplary, should not form to this
The restriction of inventive embodiments.Such as:Destinations traffic content to be identified can also crawl to obtain by web crawlers, or, can
With by being manually entered, etc..It is emphasized that the embodiment of the present invention is not carried out to the scene for needing sensitive content to identify
Limit, namely do not limit the acquisition modes of destinations traffic content to be identified.
It is understood that carrying out word segmentation processing to destinations traffic content, obtain at least one corresponding to destinations traffic content
The specific implementation of individual target participle exists a variety of.Wherein, in a kind of implementation, can be segmented by following steps
Processing, so as to obtain at least one target participle corresponding to destinations traffic content:
(1) according to default first word division rule, by destinations traffic division of teaching contents into several words.
Specifically, the ICTCLAS Words partition systems that can be provided according to Inst. of Computing Techn. Academia Sinica
(Institute of Computing Technology, Chinese Lexical Analysis System, Chinese lexical point
Analysis system) word division is carried out to destinations traffic content, and then obtain several words marked off.ICTCLAS participles system
The major function of system includes:Chinese word segmentation, part-of-speech tagging, name Entity recognition, new word identification while support user-oriented dictionary.Mesh
It is preceding to be upgraded to ICTCLAS3.0 versions, the participle speed unit 996KB/s of the ICTCLAS3.0 versions, the precision of word segmentation
98.45%, API (Application Programming Interface, application programming interface) are no more than 200KB, respectively
It is most handy Chinese lexical analysis device in the world at present less than 3M after the compression of kind of dictionary data.
It should be noted that ICTCLAS Words partition systems described herein are only a kind of tool of the first word division rule
Body form, the present invention need not be simultaneously defined, any possible realization to the concrete form of the first word division rule
Mode can apply to the present invention.
(2) word obtained dividing segments as target corresponding to destinations traffic content.
Specifically, when according to default first word division rule, by destinations traffic division of teaching contents into after several words,
The word that division obtains can be defined as into target corresponding to destinations traffic content to segment.For example, if destinations traffic content
For:" carrying out sensitive content identification to the Content of Communication in network ", several words for dividing to obtain are:" to ", " network ",
" in ", " ", " communication ", " content ", " progress ", " sensitivity ", " content ", " identification ", according to this implementation, will can be divided
Obtained " to ", " network ", " in ", " ", " communication ", " content ", " progress ", " sensitivity ", " content ", " identification " are defined as
Target corresponding to destinations traffic content segments.
It should be noted that several words that division obtains can be stored in number successively according to the sequencing marked off
According in table, certainly, the storage form of the embodiment of the present invention and several words that need not be obtained to division is defined, any
Possible implementation can apply to the present invention.
In another implementation, word segmentation processing can be carried out by following steps, so as to obtain destinations traffic content pair
At least one target participle answered:
(1) according to default second word division rule, by destinations traffic division of teaching contents into several words.
Advised it should be noted that the second word division rule described herein can divide with above-mentioned the first word referred to
It is then identical, such as can also be according to the ICTCLAS Words partition systems that Inst. of Computing Techn. Academia Sinica provides in destinations traffic
Hold and carry out word division, it is, of course, also possible to different from above-mentioned the first word division rule referred to, the embodiment of the present invention is equally not
Need to be defined the concrete form of the second word division rule, any possible implementation can apply to this hair
It is bright.
(2) remove the stop words in obtained several words of division, wherein, stop words be part of speech be adverbial word, preposition or
The word of pronoun.
In general, after carrying out word division to destinations traffic content, usually contained in several words for dividing to obtain
Some do not have the word of practical significance, for example, " especially " " very " " almost " " " " I " " they " etc..And these words are past
Toward the subsequently identification to sensitive content can be influenceed, therefore, it is necessary to remove these influences in several words obtained from division
The word of recognition result is stop words.
It should be noted that part of speech described herein is the word of adverbial word, preposition or pronoun, it is only to list stop words
Several specific examples, certainly, stop words can also include the word of other parts of speech, and the present invention simultaneously need not be to the tool of stop words
Body form is defined, and those skilled in the art need the concrete condition in practical application reasonably to be set.
(3) segmented remaining word as target corresponding to destinations traffic content.
Specifically, after destinations traffic is divided into several words according to default second word division rule, and
It is not that the word directly obtained dividing segments as target corresponding to destinations traffic content, but division is obtained some
Individual word removes the word after stop words, is defined as corresponding to destinations traffic content target and segments, and is so advantageous to improve quick
Feel the accuracy rate of content recognition.
Still illustrated by taking destinations traffic content " carrying out sensitive content identification to the Content of Communication in network " as an example, according to
Several words that second word division rule divides to obtain are:" to ", " network ", " in ", " ", " communication ", " content ",
" progress ", " sensitivity ", " content ", " identification ", remove stop words and obtain following word afterwards:" network ", " communication ", " content ",
" sensitivity ", " content ", " identification ", finally, the word that obtains after stop words will be removed and be defined as corresponding to destinations traffic content
Target segments.
It should be noted that above-mentioned is only to list to carry out word segmentation processing and obtain corresponding to destinations traffic content extremely
Two kinds of concrete modes of few target participle, the present invention and need not to how to carry out word segmentation processing to destinations traffic content,
And how to obtain the concrete mode that target corresponding to destinations traffic content segments and be defined, any possible implementation is equal
The present invention is can apply to, those skilled in the art need the concrete condition in practical application reasonably to be set
Put.
S102:Using default participle attribute create-rule, the participle attribute that each target segments is generated.
In a kind of implementation, the participle attribute of each target participle can be generated by following steps:
(1) according to default weighting algorithm, the weights that each target segments are calculated.
Specifically, can be according to TF-IDF (term frequency-inverse document frequency, one kind
The conventional weighting technique prospected for information retrieval with information) algorithm is weighted, and obtains each target participle
Weights.It should be noted that being only to list a kind of specific weighting algorithm here, the present invention simultaneously need not be to weighting algorithm
Concrete form be defined, any possible implementation can apply to the present invention, and those skilled in the art need
Reasonably to be set according to the concrete condition in practical application.
(2) corresponding weight value calculated is defined as to the participle attribute of each target participle.
Specifically, after each target participle is weighted according to specific weighting algorithm, can obtain each
The corresponding weights of individual target participle, and then the weights corresponding with each target participle calculated can be defined as respectively
The participle attribute of individual target participle.It should be noted that it is only to list the participle attribute for generating each target participle here
A kind of concrete form, the present invention need not simultaneously be defined to the concrete form for the participle attribute for generating each target participle,
Those skilled in the art need the concrete condition in practical application reasonably to be set.
S103:The participle attribute segmented according to each target, generate the target feature vector corresponding to destinations traffic content.
In a kind of implementation, target feature vector can be generated in the following manner:
(1) judge whether each keyword in default lists of keywords is identical with either objective participle respectively.
Wherein, the keyword in default lists of keywords described herein, can be obtained in the following manner:First,
After carrying out word segmentation processing to default a large amount of language materials, the chi-square value of each participle is calculated using Chi-square Test, and then screens card release
Side is worth larger predetermined number and segmented as the keyword in default lists of keywords.
Wherein, Chi-square Test is a kind of hypothesis testing method having many uses, its answering in grouped data statistical inference
With, including:Two rates or two form frequently compared with Chi-square Test;Multiple rates or it is multiple form frequently compared with Chi-square Test and
Correlation analysis of grouped data etc..The statistic of Chi-square Test is chi-square value, and it is the actual frequency of each grid in four fold table data
Number A and theoretical frequency T difference square with the ratio between theoretical frequency add up and, and chi-square value is bigger, illustrates actual frequency
It is more obvious with the difference of theoretical frequency.It should be noted that Chi-square Test is disclosed in prior art, here no longer to card side
The specific calculating process examined is described in detail, and can effectively be extracted by Chi-square Test to whether language material is in sensitivity
Have the participle of visible marking's effect.
(2) the participle attribute that target corresponding to the first class keywords segments is defined as feature corresponding to the first class keywords
Attribute, default default value is defined as characteristic attribute corresponding to the second class keywords;Wherein, the first class keywords are and target
Identical keyword is segmented, the second class keywords are the keyword differed with target participle.
It should be noted that here and the concrete numerical value of default default value need not be defined, in the art
Technical staff needs the concrete condition in practical application reasonably to be set.
(3) vector element using the characteristic attribute corresponding to each keyword as target feature vector, and according to each
Element position of the keyword in lists of keywords, generate the target feature vector corresponding to destinations traffic content.
For example, if the keyword in lists of keywords is followed successively by:Webpage, compete, find, to, tourism, destinations traffic
Target participle is corresponding to content 1:Webpage, recruitment, military affairs, sportsman, target participle is corresponding to destinations traffic content 2:Equipment,
Compete, find, fighting, visitor, it is clear that judging keyword only " webpage " this keyword and the target in lists of keywords
Target participle is identical corresponding to Content of Communication 1, and the keyword in lists of keywords only " is competed " and " finding " the two keys
Word is identical with target participle corresponding to destinations traffic content 2.
Assuming that:The participle attribute of target participle is respectively corresponding to destinations traffic content 1:707 1,146 1797455, mesh
The participle attribute of target participle is corresponding to mark Content of Communication 2:931 2420 1184 1227 1041;If default value is arranged to
0, then, according to aforesaid way, it may be determined that characteristic vector corresponding to destinations traffic content 1 is { 707,0,0,0,0 }, and target
Characteristic vector corresponding to Content of Communication 2 is { 0,2420,1184,0,0 }.
It should be understood that for different destinations traffic contents, resulting target participle and target participle
Number be also it is different, therefore, for corresponding to different destinations traffic contents target participle for, its pass that can be covered
The quantity of keyword and keyword in keyword list is also different.But the no matter particular content of destinations traffic content
Whether identical, the dimension of the characteristic vector corresponding to each destinations traffic content finally given is identical, i.e., with pass
The number of keyword in keyword list is consistent.
It can be seen that the number of the keyword in default lists of keywords directly determines the spy corresponding to destinations traffic content
Levy vector dimension, and the dimension of characteristic vector can directly affect destinations traffic content whether be sensitive content recognition result,
Therefore, the number of the keyword in default lists of keywords is just particularly important, if number is very little, is just not enough to cover mesh
Mark target participle corresponding to Content of Communication so that there is complete zero situation in characteristic vector, influences the accuracy rate of sensitive content identification,
Conversely, if number is too many, the time spent by calculating process will become big, influence recognition speed.It is it should be noted that of the invention
Embodiment need not be simultaneously defined to the number of the keyword in default lists of keywords, and those skilled in the art need
Reasonably to be set according to the concrete condition in practical application.
S104:Target feature vector is inputted into the sensitive content identification model pre-established, obtained in destinations traffic
Hold whether be sensitive content recognition result.
Specifically, after characteristic vector corresponding to generation destinations traffic content, can be using this feature vector as sensitivity
The input of content recognition model, by the input of this feature vector to after the model, the sensitive content identification model according to itself
Model parameter etc. is calculated this feature vector, and obtain the destinations traffic content is sensitive content or is not sensitive content
Recognition result.
Wherein, sensitive content identification model is to carry tag along sort using default machine learning algorithm, to default
Characteristic vector corresponding to multiple Content of Communication samples is trained resulting disaggregated model;Tag along sort includes:For marking
Know the first label of sensitive content or the second label for identifying non-sensitive content.
Specifically, sensitive content identification model, can include:Based on algorithm of support vector machine, Bayesian Classification Arithmetic or
Neural network classification algorithm and the disaggregated model established.
It should be noted that when establishing sensitive content identification model, it is necessary to obtain substantial amounts of for training in the sensitivity
Hold the training sample of identification model, in addition, training sample is the sample with tag along sort, so, by each training sample pair
The characteristic vector answered is inputted to initial identification model, can be by the reality point of the recognition result that model exports and the training sample
Class label is compared, and the model parameter of initial identification model is corrected according to comparative result, until the initial identification model
The classification results exported meet default model training termination condition (such as:Class interval is more than the first default value, error
It is more than the 3rd default value, etc. less than the second default value or training iterations) when, complete sensitive content identification model
Establish, the recognition result of initial identification model now being capable of approaching to reality situation as far as possible.
As seen from the above, when carrying out sensitive content identification using scheme provided in an embodiment of the present invention, not by logical
Believe that the keyword in content realizes identification with the sensitive word in default dictionary in literal simple match, but make use of logical
Believe the literal level of ratio of the corresponding participle of content deeper into participle attribute, and feature is generated according to the participle attribute of each participle
Vector, and train to obtain sensitive content identification model using machine learning algorithm and be identified, so as to improve sensitive content
The accuracy rate of identification.
As shown in Fig. 2 it is another sensitive content recognition methods provided in an embodiment of the present invention, in the method shown in Fig. 1
On the basis of embodiment, before step S104, it can also comprise the following steps:
S105:According to destinations traffic content in transmission used protocol type, determine the information of destinations traffic content
Type.
Wherein, information type can include:The initiative information sent to outside and the outside passive information sent.
For example, initiative information can include mail, the model etc. of forum's issue sent, and passive information can be with
Including the webpage browsed, the mail of reception etc..Certainly, it is only that the concrete form of initiative information and passive information is entered here
Gone for example, the present invention and the concrete form of initiative information and passive information need not be defined.
In a kind of implementation, the information type of destinations traffic content can be determined in the following manner:
(1) in the case where destinations traffic content is webpage:If used protocol type is POST classes when transmitting webpage
Type, it is determined that destinations traffic content is initiative information;If used protocol type is GET types when transmitting webpage, it is determined that
Destinations traffic content is passive information.
(2) in the case where Content of Communication to be identified is mail:When sending agreement transmission mail according to mail, mesh is determined
Mark Content of Communication is initiative information;When transmitting mail according to mail reception agreement, it is passive information to determine destinations traffic content.
It should be noted that the concrete mode of the information type of above-mentioned determination destinations traffic content is merely exemplary, and
The restriction to the embodiment of the present invention should not be formed.
It should be noted that initiative information is usually the information sent to outside, passive information is usually outside send
Information, comparatively, the accuracy rate that those skilled in the art identify more concerned with the sensitive content of initiative information, due to outside
The data volume for the passive information sent would generally be very big, therefore, those skilled in the art more concerned be passive information
The recognition rate of sensitive content identification.
Accordingly, step S104 can include S1041 and the sub-steps of S1042 two, so as to for different information types
Destinations traffic content, the identification of sensitive content, two sons of S1041 and S1042 are carried out from different sensitive content identification models
The particular content of step is as follows:
S1041:When destinations traffic content is initiative information, target feature vector is inputted quick to first pre-established
Feel content recognition model in, obtain destinations traffic content whether be sensitive content recognition result.
Wherein, the first sensitive content identification model is to carry tag along sort using algorithm of support vector machine, to default
Characteristic vector corresponding to multiple Content of Communication samples is trained resulting disaggregated model.
S1042:When destinations traffic content is initiative information, target feature vector is inputted quick to second pre-established
Feel content recognition model in, obtain destinations traffic content whether be sensitive content recognition result.
Wherein, the second sensitive content identification model is to carry tag along sort using Bayesian Classification Arithmetic, to default
Characteristic vector corresponding to multiple Content of Communication samples is trained resulting disaggregated model.
It should be noted that algorithm of support vector machine has more accurate classifying quality, therefore can ensure actively to believe
The accuracy rate of the sensitive content identification of breath, and Bayesian Classification Arithmetic has higher classification speed, therefore can ensure passively
The recognition rate of the sensitive content identification of information.
In a kind of implementation, the first sensitive content identification model can be established in the following manner:
A1:Obtain multiple first kind Content of Communication samples with tag along sort;Wherein, first kind Content of Communication sample is
Information type is the sample of initiative information.
It should be noted that, it is necessary to obtain substantial amounts of training sample (i.e. before the first sensitive content identification model is established
First kind Content of Communication sample) for training the first sensitive content identification model is obtained, in addition, training sample is with classification
The sample of label, can be with subsequently to input characteristic vector corresponding to each training sample to the first initial identification model
The recognition result that model exports is corrected compared with the actual classification label of the training sample, and according to the comparative result
The current model parameter of first initial identification model so that the gradual approaching to reality feelings of recognition result of the first initial identification model
Condition.
A2:Word segmentation processing is carried out to each first kind Content of Communication sample respectively, obtains each first kind Content of Communication sample
At least one first kind participle corresponding to this.
A3:Using attribute create-rule is segmented, generate what the first kind corresponding to each first kind Content of Communication sample segmented
Segment attribute.
A4:The participle attribute segmented according to the first kind corresponding to each first kind Content of Communication sample, generation each first
First kind characteristic vector corresponding to class Content of Communication sample.
It should be noted that step A2-A4 specific steps may refer to the S101-S103 of embodiment of the method shown in Fig. 1
Correlation step, here is omitted.
A5:Based on first kind characteristic vector and tag along sort corresponding to each first kind Content of Communication sample, training first
Initial identification model, the first sensitive content identification model is obtained, wherein, the first initial identification model is to be calculated based on SVMs
The model of method.
Specifically, following steps can be performed to each first kind characteristic vector successively:First kind characteristic vector is inputted
Into the first initial identification model built in advance, the model calculates the first kind characteristic vector using current model parameter
Be the first confidence level of sensitive content and be not non-sensitive content the second confidence level, and put according to the first confidence level and second
Reliability, export the first kind Content of Communication sample whether be sensitive content classification results, then according to the classification knot exported
The fruit model parameter current to the first initial identification model with the comparative result of the tag along sort of the first kind Content of Communication sample
It is modified, until the classification results of the first initial identification model output meet default first model training termination condition
When, complete the foundation of the first sensitive content identification model.
In a kind of implementation, the second sensitive content identification model can be established in the following manner:
B1:Obtain multiple second class Content of Communication samples with tag along sort;Wherein, the second class Content of Communication sample is
Information type is the sample of passive information.
It should be noted that step B1 specific steps may refer to A1 correlation step, except that, first is sensitive
The first kind Content of Communication sample that content recognition model is selected is the sample that information type is initiative information, and in the second sensitivity
The the second class Content of Communication sample for holding identification model is the sample that information type is passive information, it should be pointed out that the second class
Content of Communication sample can be the same or different with the first Content of Communication sample on acquisition pattern, and the embodiment of the present invention is not
Need to be defined this.
B2:Word segmentation processing is carried out to each second class Content of Communication sample respectively, obtains each second class Content of Communication sample
At least one second class participle corresponding to this.
B3:Using attribute create-rule is segmented, the second class participle corresponding to each second class Content of Communication sample is generated
Segment attribute.
B4:The participle attribute segmented according to the second class corresponding to each second class Content of Communication sample, generation each second
The second category feature vector corresponding to class Content of Communication sample.
It should be noted that step B2-B4 specific steps may refer to the S101-S103 of embodiment of the method shown in Fig. 1
Correlation step, here is omitted.
B5:Based on the second category feature vector sum tag along sort corresponding to each second class Content of Communication sample, training second
Initial identification model, the second sensitive content identification model is obtained, wherein, the second initial identification model is to be calculated based on Bayes's classification
The model of method.
Specifically, following steps can be performed to each second category feature vector successively:Second category feature vector is inputted
Into the second initial identification model built in advance, the model is used to divide according to the second category feature vector inputted to optimize
Second class Content of Communication sample is sensitive content or the Optimal Separating Hyperplane of non-sensitive content, until the Optimal Separating Hyperplane is marked off
Two classifications (i.e. the classification of the classification of sensitive content and non-sensitive content) between class interval maximum when, complete second quick
Feel the foundation of content recognition model.
It is emphasized that have been disclosed in the prior art based on algorithm of support vector machine and calculated based on Bayes's classification
Method establishes the specific steps of disaggregated model, reference can be made to correlation step of the prior art, is not described in detail herein.
In addition, the description of the first kind, the second class involved in the embodiment of the present invention and not having any limiting meaning.
As seen from the above, when carrying out sensitive content identification using scheme provided in an embodiment of the present invention, except possessing Fig. 1 institutes
Show outside all advantages of embodiment of the method, also taken into full account the information type of destinations traffic content, and be directed to different letters
The destinations traffic content of type is ceased, the identification of sensitive content is carried out from different sensitive content identification models, is based on as utilized
First sensitive content identification model of algorithm of support vector machine carries out quick to information type for the destinations traffic content of initiative information
Feel the identification of content, be passive information to information type using the second sensitive content identification model based on Bayesian Classification Arithmetic
Destinations traffic content carry out the identification of sensitive content, it is necessary to explanation, initiative information are usually the information that is sent to outside,
Passive information is usually the outside information sent, comparatively, sensitivity of the those skilled in the art more concerned with initiative information
The accuracy rate of content recognition, because the data volume of the outside passive information sent would generally be very big, therefore, technology in the art
Personnel more concerned be passive information sensitive content identify recognition rate, and algorithm of support vector machine have it is more accurate
Classifying quality, therefore the accuracy rate of the sensitive content identification of initiative information can be ensured, and Bayesian Classification Arithmetic is with higher
Classification speed, therefore can ensure passive information sensitive content identification recognition rate, more reasonably meet sensitivity
The needs of content recognition.
Corresponding to above method embodiment, the embodiment of the present invention additionally provides a kind of sensitive content identification device.
A kind of sensitive content identification device provided in an embodiment of the present invention is introduced again below.
As shown in figure 3, being a kind of sensitive content identification device provided in an embodiment of the present invention, the device can include:
Word segmentation processing module 210, for carrying out word segmentation processing to destinations traffic content to be identified, obtain in destinations traffic
At least one target participle corresponding to appearance.
Attribute generation module 220 is segmented, for using default participle attribute create-rule, generating each target participle
Segment attribute.
Feature vector generation module 230, for the participle attribute segmented according to each target, generation destinations traffic content institute
Corresponding target feature vector.
Sensitive content identification module 240, mould is identified for target feature vector to be inputted to the sensitive content pre-established
In type, obtain destinations traffic content whether be sensitive content recognition result.
Wherein, sensitive content identification model is to carry tag along sort using default machine learning algorithm, to default
Characteristic vector corresponding to multiple Content of Communication samples is trained resulting disaggregated model, and tag along sort includes:For marking
Know the first label of sensitive content or the second label for identifying non-sensitive content.
As seen from the above, when carrying out sensitive content identification using scheme provided in an embodiment of the present invention, not by logical
Believe that the keyword in content realizes identification with the sensitive word in default dictionary in literal simple match, but make use of logical
Believe the literal level of ratio of the corresponding participle of content deeper into participle attribute, and feature is generated according to the participle attribute of each participle
Vector, and train to obtain sensitive content identification model using machine learning algorithm and be identified, so as to improve sensitive content
The accuracy rate of identification.
Specifically, the device can also include:
Content of Communication obtains module, is specifically used for:
(1) packet transmitted in network is gathered.
(2) default application layer protocol is based on, reduction treatment is carried out to the data content in packet.
(3) it is destinations traffic content to be identified to determine the data content after reduction treatment.
In a kind of implementation, word segmentation processing module 210, it can include:First division submodule and the first participle determine
Submodule.
Wherein, the first division submodule, for according to default first word division rule, by destinations traffic division of teaching contents
Into several words.
First participle determination sub-module, for obtained word will to be divided as target corresponding to destinations traffic content point
Word.
In another implementation, word segmentation processing module 210, it can include:Second division submodule, participle remove submodule
Block and the second participle determination sub-module.
Wherein, the second division submodule, for according to default second word division rule, by destinations traffic division of teaching contents
Into several words.
Participle removes submodule, the stop words in several words obtained for removing division, wherein, stop words is word
Property be adverbial word, preposition or pronoun word;
Second participle determination sub-module, for being segmented remaining word as target corresponding to destinations traffic content.
Wherein, attribute generation module 220 is segmented, can be included:Weight computing submodule and participle attribute determination sub-module.
Weight computing submodule, for according to default weighting algorithm, calculating the weights that each target segments.
Attribute determination sub-module is segmented, for the corresponding weight value calculated to be defined as to the participle category of each target participle
Property.
Wherein, feature vector generation module 230, can include:Segment judging submodule, characteristic attribute determination sub-module and
Characteristic vector generates submodule.
Segment judging submodule, for judge respectively each keyword in default lists of keywords whether with any mesh
Mark participle is identical.
Characteristic attribute determination sub-module, the participle attribute for target corresponding to the first class keywords to be segmented are defined as the
Characteristic attribute corresponding to one class keywords, default default value is defined as characteristic attribute corresponding to the second class keywords.
Wherein, the first class keywords are to segment identical keyword with target, and the second class keywords are to be segmented not with target
Identical keyword.
Characteristic vector generates submodule, for using the characteristic attribute corresponding to each keyword as target feature vector
Vector element, and the element position according to each keyword in lists of keywords, generate the mesh corresponding to destinations traffic content
Mark characteristic vector.
In a kind of implementation, above-mentioned sensitive content identification model, it can include:
The disaggregated model established based on algorithm of support vector machine, Bayesian Classification Arithmetic or neural network classification algorithm.
In a kind of implementation, as shown in figure 4, the device can also include information type determining module 250, it is used for:
Target feature vector is inputted to before the step in the sensitive content identification model pre-established, led to according to target
Believe content used protocol type in transmission, determine the information type of destinations traffic content.
Wherein, information type includes:The initiative information sent to outside and the outside passive information sent.
Wherein, information type determining module 250, can include:First information type determination module and the second info class
Type determination sub-module.
First information type determination module, in the case of being webpage in destinations traffic content:If transmit webpage
Used protocol type is POST types, it is determined that destinations traffic content is initiative information;If transmit used during webpage
Protocol type is GET types, it is determined that destinations traffic content is passive information.
Second information type determination sub-module, in the case of being mail in Content of Communication to be identified:According to mail
When sending agreement transmission mail, it is initiative information to determine destinations traffic content;When transmitting mail according to mail reception agreement, really
The Content of Communication that sets the goal is passive information.
Accordingly, sensitive content identification module 240, can include:First identification submodule 241 and second identifies submodule
242。
Wherein, the first identification submodule 241, for when destinations traffic content is initiative information, by target feature vector
Input into the first sensitive content identification model pre-established, obtain destinations traffic content whether be sensitive content identification knot
Fruit.
Wherein, the first sensitive content identification model is to carry tag along sort using algorithm of support vector machine, to default
Characteristic vector corresponding to multiple Content of Communication samples is trained resulting disaggregated model.
Second identification submodule 242, for when destinations traffic content is initiative information, by target feature vector input to
In the second sensitive content identification model pre-established, obtain destinations traffic content whether be sensitive content recognition result.
Wherein, the second sensitive content identification model is to carry tag along sort using Bayesian Classification Arithmetic, to default
Characteristic vector corresponding to multiple Content of Communication samples is trained resulting disaggregated model.
In a kind of implementation, the first sensitive content identification model can be established according to following steps:
Obtain multiple first kind Content of Communication samples with tag along sort;Wherein, first kind Content of Communication sample is letter
Cease the sample that type is initiative information;
Word segmentation processing is carried out to each first kind Content of Communication sample respectively, obtains each first kind Content of Communication sample pair
At least one first kind participle answered;
Using attribute create-rule is segmented, the participle that the first kind corresponding to each first kind Content of Communication sample segments is generated
Attribute;
The participle attribute segmented according to the first kind corresponding to each first kind Content of Communication sample, generates each first kind and leads to
Believe the first kind characteristic vector corresponding to content sample;
It is initial based on first kind characteristic vector and tag along sort corresponding to each first kind Content of Communication sample, training first
Identification model, the first sensitive content identification model is obtained, wherein, the first initial identification model is based on algorithm of support vector machine
Model.
In a kind of implementation, the second sensitive content identification model can be established according to following steps:
Obtain multiple second class Content of Communication samples with tag along sort;Wherein, the second class Content of Communication sample is letter
Cease the sample that type is passive information;
Word segmentation processing is carried out to each second class Content of Communication sample respectively, obtains each second class Content of Communication sample pair
At least one second class participle answered;
Using attribute create-rule is segmented, the participle of the second class participle corresponding to each second class Content of Communication sample is generated
Attribute;
The participle attribute segmented according to the second class corresponding to each second class Content of Communication sample, generate each second class and lead to
Believe the second category feature vector corresponding to content sample;
It is initial based on the second category feature vector sum tag along sort corresponding to each second class Content of Communication sample, training second
Identification model, the second sensitive content identification model is obtained, wherein, the second initial identification model is based on Bayesian Classification Arithmetic
Model.
As seen from the above, when carrying out sensitive content identification using scheme provided in an embodiment of the present invention, except possessing Fig. 3 institutes
Outside all advantages of showing device embodiment, the information type of destinations traffic content is also taken into full account, and be directed to different letters
The destinations traffic content of type is ceased, the identification of sensitive content is carried out from different sensitive content identification models, is based on as utilized
First sensitive content identification model of algorithm of support vector machine carries out quick to information type for the destinations traffic content of initiative information
Feel the identification of content, be passive information to information type using the second sensitive content identification model based on Bayesian Classification Arithmetic
Destinations traffic content carry out the identification of sensitive content, it is necessary to explanation, initiative information are usually the information that is sent to outside,
Passive information is usually the outside information sent, comparatively, sensitivity of the those skilled in the art more concerned with initiative information
The accuracy rate of content recognition, because the data volume of the outside passive information sent would generally be very big, therefore, technology in the art
Personnel more concerned be passive information sensitive content identify recognition rate, and algorithm of support vector machine have it is more accurate
Classifying quality, therefore the accuracy rate of the sensitive content identification of initiative information can be ensured, and Bayesian Classification Arithmetic is with higher
Classification speed, therefore can ensure passive information sensitive content identification recognition rate, more reasonably meet sensitivity
The needs of content recognition.
For device embodiment, because it is substantially similar to embodiment of the method, so describe fairly simple, it is related
Part illustrates referring to the part of embodiment of the method.
It should be noted that herein, such as first and second or the like relational terms are used merely to a reality
Body or operation make a distinction with another entity or operation, and not necessarily require or imply and deposited between these entities or operation
In any this actual relation or order.Moreover, term " comprising ", "comprising" or its any other variant are intended to
Nonexcludability includes, so that process, method, article or equipment including a series of elements not only will including those
Element, but also the other element including being not expressly set out, or it is this process, method, article or equipment also to include
Intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that
Other identical element also be present in process, method, article or equipment including the key element.
Can one of ordinary skill in the art will appreciate that realizing that all or part of step in above method embodiment is
To instruct the hardware of correlation to complete by program, described program can be stored in computer read/write memory medium,
Storage medium designated herein, such as:ROM/RAM, magnetic disc, CD etc..
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all
Any modification, equivalent substitution and improvements made within the spirit and principles in the present invention etc., are all contained in protection scope of the present invention
It is interior.
Claims (22)
1. a kind of sensitive content recognition methods, it is characterised in that methods described includes:
Word segmentation processing is carried out to destinations traffic content to be identified, obtains at least one target corresponding to the destinations traffic content
Participle;
Using default participle attribute create-rule, the participle attribute that each target segments is generated;
The participle attribute segmented according to each target, generates the target feature vector corresponding to the destinations traffic content;
The target feature vector is inputted into the sensitive content identification model pre-established, obtains the destinations traffic content
Whether be sensitive content recognition result, wherein, the sensitive content identification model is to utilize default machine learning algorithm, right
Characteristic vector corresponding to default multiple Content of Communication samples with tag along sort is trained resulting disaggregated model,
The tag along sort includes:For identifying the first label of sensitive content or the second label for identifying non-sensitive content.
2. according to the method for claim 1, it is characterised in that the step of obtaining the destinations traffic content to be identified,
Including:
The packet transmitted in collection network;
Based on default application layer protocol, reduction treatment is carried out to the data content in the packet;
It is destinations traffic content to be identified to determine the data content after reduction treatment.
3. according to the method for claim 1, it is characterised in that described that destinations traffic content to be identified is carried out at participle
Reason, the step of at least one target segments corresponding to the destinations traffic content is obtained, including:
According to default first word division rule, by the destinations traffic division of teaching contents into several words;
The word obtained dividing segments as target corresponding to the destinations traffic content.
4. according to the method for claim 1, it is characterised in that described that destinations traffic content to be identified is carried out at participle
Reason, the step of at least one target segments corresponding to the destinations traffic content is obtained, including:
According to default second word division rule, by the destinations traffic division of teaching contents into several words;Removal divides
To several words in stop words, wherein, the stop words is the word that part of speech is adverbial word, preposition or pronoun;
Segmented remaining word as target corresponding to the destinations traffic content.
5. according to the method for claim 1, it is characterised in that described to utilize default participle attribute create-rule, generation
The step of participle attribute of each target participle, including:
According to default weighting algorithm, the weights that each target segments are calculated;
The corresponding weight value calculated is defined as to the participle attribute of each target participle.
6. according to the method for claim 1, it is characterised in that the participle attribute segmented according to each target, generation
The step of target feature vector corresponding to the destinations traffic content, including:
Judge whether each keyword in default lists of keywords is identical with either objective participle respectively;
The participle attribute that target corresponding to first class keywords segments is defined as characteristic attribute corresponding to the first class keywords, will
Default default value is defined as characteristic attribute corresponding to the second class keywords;Wherein, first class keywords are and the mesh
Mark participle identical keyword, second class keywords are the keyword differed with target participle;
Vector element using the characteristic attribute corresponding to each keyword as target feature vector, and exist according to each keyword
Element position in the lists of keywords, generate the target feature vector corresponding to the destinations traffic content.
7. according to the method for claim 1, it is characterised in that described sensitive content identification model, including:
The disaggregated model established based on algorithm of support vector machine, Bayesian Classification Arithmetic or neural network classification algorithm.
8. according to the method for claim 1, it is characterised in that input the target feature vector to what is pre-established quick
Before feeling the step in content recognition model, in addition to:
According to the destinations traffic content in transmission used protocol type, determine the info class of the destinations traffic content
Type;Wherein, described information type includes:The initiative information sent to outside and the outside passive information sent;
It is described to input the target feature vector into the sensitive content identification model pre-established, obtain the destinations traffic
Content whether be sensitive content recognition result, wherein, the sensitive content identification model be using default machine learning calculate
Method, resulting classification is trained to the characteristic vector corresponding to default multiple Content of Communication samples with tag along sort
Model, including:
When the destinations traffic content is initiative information, the target feature vector is inputted sensitive to first pre-established
In content recognition model, obtain the destinations traffic content whether be sensitive content recognition result;Wherein, described first is sensitive
Content recognition model is using algorithm of support vector machine, to corresponding to default multiple Content of Communication samples with tag along sort
Characteristic vector be trained resulting disaggregated model;
When the destinations traffic content is initiative information, the target feature vector is inputted sensitive to second pre-established
In content recognition model, obtain the destinations traffic content whether be sensitive content recognition result;Wherein, described second is sensitive
Content recognition model is using Bayesian Classification Arithmetic, to corresponding to default multiple Content of Communication samples with tag along sort
Characteristic vector be trained resulting disaggregated model.
9. according to the method for claim 8, it is characterised in that the step of establishing the first sensitive content identification model,
Including:
Obtain multiple first kind Content of Communication samples with tag along sort;Wherein, the first kind Content of Communication sample is letter
Cease the sample that type is initiative information;
Word segmentation processing is carried out to each first kind Content of Communication sample respectively, obtained corresponding to each first kind Content of Communication sample
At least one first kind participle;
Using the participle attribute create-rule, the participle that the first kind corresponding to each first kind Content of Communication sample segments is generated
Attribute;
The participle attribute segmented according to the first kind corresponding to each first kind Content of Communication sample, generate in each first kind communication
Hold the first kind characteristic vector corresponding to sample;
Based on first kind characteristic vector and tag along sort corresponding to each first kind Content of Communication sample, the first initial identification is trained
Model, the first sensitive content identification model is obtained, wherein, the first initial identification model is the mould based on algorithm of support vector machine
Type.
10. according to the method for claim 8, it is characterised in that the step of establishing the second sensitive content identification model,
Including:
Obtain multiple second class Content of Communication samples with tag along sort;Wherein, the second class Content of Communication sample is letter
Cease the sample that type is passive information;
Word segmentation processing is carried out to each second class Content of Communication sample respectively, obtained corresponding to each second class Content of Communication sample
At least one second class participle;
Using the participle attribute create-rule, the participle that the second class corresponding to each second class Content of Communication sample segments is generated
Attribute;
The participle attribute segmented according to the second class corresponding to each second class Content of Communication sample, generate in each second class communication
Hold the second category feature vector corresponding to sample;
Based on the second category feature vector sum tag along sort corresponding to each second class Content of Communication sample, the second initial identification is trained
Model, the second sensitive content identification model is obtained, wherein, the second initial identification model is the mould based on Bayesian Classification Arithmetic
Type.
11. according to the method for claim 8, it is characterised in that described that when institute is being transmitted according to the destinations traffic content
The protocol type of use, the step of determining the information type of the destinations traffic content, including:
In the case where the destinations traffic content is webpage:If used protocol type is POST types when transmitting webpage,
It is initiative information then to determine the destinations traffic content;If used protocol type is GET types when transmitting webpage, it is determined that
The destinations traffic content is passive information;
In the case where the Content of Communication to be identified is mail:When sending agreement transmission mail according to mail, it is determined that described
Destinations traffic content is initiative information;When transmitting mail according to mail reception agreement, it is quilt to determine the destinations traffic content
Dynamic information.
12. a kind of sensitive content identification device, it is characterised in that described device includes:
Word segmentation processing module, for carrying out word segmentation processing to destinations traffic content to be identified, obtain the destinations traffic content
Corresponding at least one target participle;
Attribute generation module is segmented, for using default participle attribute create-rule, generating the participle category that each target segments
Property;
Feature vector generation module, for the participle attribute segmented according to each target, it is right to generate the destinations traffic content institute
The target feature vector answered;
Sensitive content identification module, for the target feature vector to be inputted to the sensitive content identification model pre-established
In, obtain the destinations traffic content whether be sensitive content recognition result, wherein, the sensitive content identification model for profit
Enter with default machine learning algorithm, to the characteristic vector corresponding to default multiple Content of Communication samples with tag along sort
Disaggregated model obtained by row training, the tag along sort include:For identifying the first label of sensitive content or for identifying
Second label of non-sensitive content.
13. device according to claim 12, it is characterised in that also include:
Content of Communication obtains module, is specifically used for:
The packet transmitted in collection network;
Based on default application layer protocol, reduction treatment is carried out to the data content in the packet;
It is destinations traffic content to be identified to determine the data content after reduction treatment.
14. device according to claim 12, it is characterised in that the word segmentation processing module, including:First division submodule
Block and first participle determination sub-module;Wherein,
The first division submodule, for according to default first word division rule, by the destinations traffic division of teaching contents
Into several words;
The first participle determination sub-module, for obtained word will to be divided as target corresponding to the destinations traffic content
Participle.
15. device according to claim 12, it is characterised in that the word segmentation processing module, including:Second division submodule
Block, participle remove submodule and the second participle determination sub-module;Wherein,
The second division submodule, for according to default second word division rule, by the destinations traffic division of teaching contents
Into several words;
The participle removes submodule, the stop words in several words obtained for removing division, wherein, the stop words
For the word that part of speech is adverbial word, preposition or pronoun;
The second participle determination sub-module, for using remaining word as target corresponding to the destinations traffic content point
Word.
16. device according to claim 12, it is characterised in that the participle attribute generation module, including:Weight computing
Submodule and participle attribute determination sub-module;Wherein,
The weight computing submodule, for according to default weighting algorithm, calculating the weights that each target segments;
The participle attribute determination sub-module, for the corresponding weight value calculated to be defined as to the participle category of each target participle
Property.
17. device according to claim 12, it is characterised in that the feature vector generation module, including:Participle judges
Submodule, characteristic attribute determination sub-module and characteristic vector generation submodule;Wherein,
The participle judging submodule, for judge respectively each keyword in default lists of keywords whether with any mesh
Mark participle is identical;
The characteristic attribute determination sub-module, the participle attribute for target corresponding to the first class keywords to be segmented are defined as the
Characteristic attribute corresponding to one class keywords, default default value is defined as characteristic attribute corresponding to the second class keywords;Wherein,
First class keywords are to segment identical keyword with the target, and second class keywords are to be segmented with the target
The keyword differed;
The characteristic vector generates submodule, for using the characteristic attribute corresponding to each keyword as target feature vector
Vector element, and the element position according to each keyword in the lists of keywords, generate the destinations traffic content institute
Corresponding target feature vector.
18. device according to claim 12, it is characterised in that described sensitive content identification model, including:
The disaggregated model established based on algorithm of support vector machine, Bayesian Classification Arithmetic or neural network classification algorithm.
19. device according to claim 12, it is characterised in that also include:
Information type determining module, is used for:
The target feature vector is inputted to before the step in the sensitive content identification model pre-established, according to the mesh
Content of Communication used protocol type in transmission is marked, determines the information type of the destinations traffic content;Wherein, the letter
Breath type includes:The initiative information sent to outside and the outside passive information sent;
The sensitive content identification module, including:First identification submodule and the second identification submodule;Wherein,
Described first identification submodule, for when the destinations traffic content is initiative information, by the target signature to
Amount input obtains whether the destinations traffic content is sensitive content into the first sensitive content identification model pre-established
Recognition result;Wherein, the first sensitive content identification model is to carry contingency table using algorithm of support vector machine, to default
Characteristic vector corresponding to multiple Content of Communication samples of label is trained resulting disaggregated model;
Described second identification submodule, for when the destinations traffic content is initiative information, by the target signature to
Amount input obtains whether the destinations traffic content is sensitive content into the second sensitive content identification model pre-established
Recognition result;Wherein, the second sensitive content identification model is to carry contingency table using Bayesian Classification Arithmetic, to default
Characteristic vector corresponding to multiple Content of Communication samples of label is trained resulting disaggregated model.
20. device according to claim 19, it is characterised in that establish the step of the first sensitive content identification model
Suddenly, including:
Obtain multiple first kind Content of Communication samples with tag along sort;Wherein, the first kind Content of Communication sample is letter
Cease the sample that type is initiative information;
Word segmentation processing is carried out to each first kind Content of Communication sample respectively, obtained corresponding to each first kind Content of Communication sample
At least one first kind participle;
Using the participle attribute create-rule, the participle that the first kind corresponding to each first kind Content of Communication sample segments is generated
Attribute;
The participle attribute segmented according to the first kind corresponding to each first kind Content of Communication sample, generate in each first kind communication
Hold the first kind characteristic vector corresponding to sample;
Based on first kind characteristic vector and tag along sort corresponding to each first kind Content of Communication sample, the first initial identification is trained
Model, the first sensitive content identification model is obtained, wherein, the first initial identification model is the mould based on algorithm of support vector machine
Type.
21. device according to claim 19, it is characterised in that establish the step of the second sensitive content identification model
Suddenly, including:
Obtain multiple second class Content of Communication samples with tag along sort;Wherein, the second class Content of Communication sample is letter
Cease the sample that type is passive information;
Word segmentation processing is carried out to each second class Content of Communication sample respectively, obtained corresponding to each second class Content of Communication sample
At least one second class participle;
Using the participle attribute create-rule, the participle that the second class corresponding to each second class Content of Communication sample segments is generated
Attribute;
The participle attribute segmented according to the second class corresponding to each second class Content of Communication sample, generate in each second class communication
Hold the second category feature vector corresponding to sample;
Based on the second category feature vector sum tag along sort corresponding to each second class Content of Communication sample, the second initial identification is trained
Model, the second sensitive content identification model is obtained, wherein, the second initial identification model is the mould based on Bayesian Classification Arithmetic
Type.
22. device according to claim 19, it is characterised in that described information determination type module, including:The first information
Type determination module and the second information type determination sub-module;Wherein,
The first information type determination module, in the case of being webpage in the destinations traffic content:If transmission network
Used protocol type is POST types during page, it is determined that the destinations traffic content is initiative information;If transmit webpage
Used protocol type is GET types, it is determined that the destinations traffic content is passive information;
The second information type determination sub-module, in the case of being mail in the Content of Communication to be identified:According to
When mail sends agreement transmission mail, it is initiative information to determine the destinations traffic content;Transmitted according to mail reception agreement
During mail, it is passive information to determine the destinations traffic content.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610822280.3A CN107818077A (en) | 2016-09-13 | 2016-09-13 | A kind of sensitive content recognition methods and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610822280.3A CN107818077A (en) | 2016-09-13 | 2016-09-13 | A kind of sensitive content recognition methods and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107818077A true CN107818077A (en) | 2018-03-20 |
Family
ID=61600614
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610822280.3A Pending CN107818077A (en) | 2016-09-13 | 2016-09-13 | A kind of sensitive content recognition methods and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107818077A (en) |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108829680A (en) * | 2018-06-22 | 2018-11-16 | 北京百悟科技有限公司 | A kind of violation publicity detection method and device, computer readable storage medium |
CN109063155A (en) * | 2018-08-10 | 2018-12-21 | 广州锋网信息科技有限公司 | Language model parameter determination method, device and computer equipment |
CN109241330A (en) * | 2018-08-20 | 2019-01-18 | 北京百度网讯科技有限公司 | The method, apparatus, equipment and medium of key phrase in audio for identification |
CN109407504A (en) * | 2018-11-30 | 2019-03-01 | 华南理工大学 | A kind of personal safety detection system and method based on smartwatch |
CN109858280A (en) * | 2019-01-21 | 2019-06-07 | 深圳昂楷科技有限公司 | A kind of desensitization method based on machine learning, device and desensitization equipment |
CN110008470A (en) * | 2019-03-19 | 2019-07-12 | 阿里巴巴集团控股有限公司 | The sensibility stage division and device of report |
CN110134966A (en) * | 2019-05-21 | 2019-08-16 | 中电健康云科技有限公司 | A kind of sensitive information determines method and device |
CN110275958A (en) * | 2019-06-26 | 2019-09-24 | 北京市博汇科技股份有限公司 | Site information recognition methods, device and electronic equipment |
CN110472234A (en) * | 2019-07-19 | 2019-11-19 | 平安科技(深圳)有限公司 | Sensitive text recognition method, device, medium and computer equipment |
CN110737770A (en) * | 2018-07-03 | 2020-01-31 | 百度在线网络技术(北京)有限公司 | Text data sensitivity identification method and device, electronic equipment and storage medium |
CN110750981A (en) * | 2019-10-16 | 2020-02-04 | 杭州安恒信息技术股份有限公司 | High-accuracy website sensitive word detection method based on machine learning |
CN110784330A (en) * | 2018-07-30 | 2020-02-11 | 华为技术有限公司 | Method and device for generating application recognition model |
CN110826320A (en) * | 2019-11-28 | 2020-02-21 | 上海观安信息技术股份有限公司 | Sensitive data discovery method and system based on text recognition |
WO2020063071A1 (en) * | 2018-09-27 | 2020-04-02 | 厦门快商通信息技术有限公司 | Sentence vector calculation method based on chi-square test, and text classification method and system |
CN111291570A (en) * | 2018-12-07 | 2020-06-16 | 北京国双科技有限公司 | Method and device for realizing element identification in judicial documents |
CN111898060A (en) * | 2020-07-14 | 2020-11-06 | 大汉软件股份有限公司 | Content automatic monitoring method based on deep learning |
CN112069500A (en) * | 2020-09-17 | 2020-12-11 | 杭州安恒信息技术股份有限公司 | Application software detection method, device and medium |
CN112434331A (en) * | 2020-11-20 | 2021-03-02 | 百度在线网络技术(北京)有限公司 | Data desensitization method, device, equipment and storage medium |
CN113220980A (en) * | 2020-02-06 | 2021-08-06 | 北京沃东天骏信息技术有限公司 | Article attribute word recognition method, device, equipment and storage medium |
CN113434775A (en) * | 2021-07-15 | 2021-09-24 | 北京达佳互联信息技术有限公司 | Method and device for determining search content |
US20230012656A1 (en) * | 2021-04-02 | 2023-01-19 | Sift Science, Inc. | Systems and methods for automated labeling of subscriber digital event data in a machine learning-based digital threat mitigation platform |
CN117197538A (en) * | 2023-08-16 | 2023-12-08 | 哈尔滨工业大学 | Bayesian convolution neural network structure apparent damage identification method based on Gaussian distribution weight sampling |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070276773A1 (en) * | 2006-03-06 | 2007-11-29 | Murali Aravamudan | Methods and systems for selecting and presenting content on a first system based on user preferences learned on a second system |
CN102012985A (en) * | 2010-11-19 | 2011-04-13 | 国网电力科学研究院 | Sensitive data dynamic identification method based on data mining |
CN102289522A (en) * | 2011-09-19 | 2011-12-21 | 北京金和软件股份有限公司 | Method of intelligently classifying texts |
CN103744905A (en) * | 2013-12-25 | 2014-04-23 | 新浪网技术(中国)有限公司 | Junk mail judgment method and device |
CN103761221A (en) * | 2013-12-31 | 2014-04-30 | 北京京东尚科信息技术有限公司 | System and method for identifying sensitive text messages |
CN105631049A (en) * | 2016-02-17 | 2016-06-01 | 北京奇虎科技有限公司 | Method and system for recognizing defrauding short messages |
-
2016
- 2016-09-13 CN CN201610822280.3A patent/CN107818077A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070276773A1 (en) * | 2006-03-06 | 2007-11-29 | Murali Aravamudan | Methods and systems for selecting and presenting content on a first system based on user preferences learned on a second system |
CN102012985A (en) * | 2010-11-19 | 2011-04-13 | 国网电力科学研究院 | Sensitive data dynamic identification method based on data mining |
CN102289522A (en) * | 2011-09-19 | 2011-12-21 | 北京金和软件股份有限公司 | Method of intelligently classifying texts |
CN103744905A (en) * | 2013-12-25 | 2014-04-23 | 新浪网技术(中国)有限公司 | Junk mail judgment method and device |
CN103761221A (en) * | 2013-12-31 | 2014-04-30 | 北京京东尚科信息技术有限公司 | System and method for identifying sensitive text messages |
CN105631049A (en) * | 2016-02-17 | 2016-06-01 | 北京奇虎科技有限公司 | Method and system for recognizing defrauding short messages |
Cited By (31)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108829680A (en) * | 2018-06-22 | 2018-11-16 | 北京百悟科技有限公司 | A kind of violation publicity detection method and device, computer readable storage medium |
CN110737770A (en) * | 2018-07-03 | 2020-01-31 | 百度在线网络技术(北京)有限公司 | Text data sensitivity identification method and device, electronic equipment and storage medium |
CN110737770B (en) * | 2018-07-03 | 2023-01-20 | 百度在线网络技术(北京)有限公司 | Text data sensitivity identification method and device, electronic equipment and storage medium |
CN110784330B (en) * | 2018-07-30 | 2022-04-05 | 华为技术有限公司 | Method and device for generating application recognition model |
CN110784330A (en) * | 2018-07-30 | 2020-02-11 | 华为技术有限公司 | Method and device for generating application recognition model |
CN109063155A (en) * | 2018-08-10 | 2018-12-21 | 广州锋网信息科技有限公司 | Language model parameter determination method, device and computer equipment |
CN109241330A (en) * | 2018-08-20 | 2019-01-18 | 北京百度网讯科技有限公司 | The method, apparatus, equipment and medium of key phrase in audio for identification |
US11308937B2 (en) | 2018-08-20 | 2022-04-19 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and apparatus for identifying key phrase in audio, device and medium |
WO2020063071A1 (en) * | 2018-09-27 | 2020-04-02 | 厦门快商通信息技术有限公司 | Sentence vector calculation method based on chi-square test, and text classification method and system |
CN109407504A (en) * | 2018-11-30 | 2019-03-01 | 华南理工大学 | A kind of personal safety detection system and method based on smartwatch |
CN109407504B (en) * | 2018-11-30 | 2021-05-14 | 华南理工大学 | Personal safety detection system and method based on smart watch |
CN111291570A (en) * | 2018-12-07 | 2020-06-16 | 北京国双科技有限公司 | Method and device for realizing element identification in judicial documents |
CN109858280A (en) * | 2019-01-21 | 2019-06-07 | 深圳昂楷科技有限公司 | A kind of desensitization method based on machine learning, device and desensitization equipment |
CN110008470A (en) * | 2019-03-19 | 2019-07-12 | 阿里巴巴集团控股有限公司 | The sensibility stage division and device of report |
CN110134966A (en) * | 2019-05-21 | 2019-08-16 | 中电健康云科技有限公司 | A kind of sensitive information determines method and device |
CN110275958A (en) * | 2019-06-26 | 2019-09-24 | 北京市博汇科技股份有限公司 | Site information recognition methods, device and electronic equipment |
CN110275958B (en) * | 2019-06-26 | 2021-07-27 | 北京市博汇科技股份有限公司 | Website information identification method and device and electronic equipment |
CN110472234A (en) * | 2019-07-19 | 2019-11-19 | 平安科技(深圳)有限公司 | Sensitive text recognition method, device, medium and computer equipment |
CN110750981A (en) * | 2019-10-16 | 2020-02-04 | 杭州安恒信息技术股份有限公司 | High-accuracy website sensitive word detection method based on machine learning |
CN110826320B (en) * | 2019-11-28 | 2023-10-13 | 上海观安信息技术股份有限公司 | Sensitive data discovery method and system based on text recognition |
CN110826320A (en) * | 2019-11-28 | 2020-02-21 | 上海观安信息技术股份有限公司 | Sensitive data discovery method and system based on text recognition |
CN113220980A (en) * | 2020-02-06 | 2021-08-06 | 北京沃东天骏信息技术有限公司 | Article attribute word recognition method, device, equipment and storage medium |
CN111898060A (en) * | 2020-07-14 | 2020-11-06 | 大汉软件股份有限公司 | Content automatic monitoring method based on deep learning |
CN112069500A (en) * | 2020-09-17 | 2020-12-11 | 杭州安恒信息技术股份有限公司 | Application software detection method, device and medium |
CN112434331B (en) * | 2020-11-20 | 2023-08-18 | 百度在线网络技术(北京)有限公司 | Data desensitization method, device, equipment and storage medium |
CN112434331A (en) * | 2020-11-20 | 2021-03-02 | 百度在线网络技术(北京)有限公司 | Data desensitization method, device, equipment and storage medium |
US20230012656A1 (en) * | 2021-04-02 | 2023-01-19 | Sift Science, Inc. | Systems and methods for automated labeling of subscriber digital event data in a machine learning-based digital threat mitigation platform |
US11645386B2 (en) * | 2021-04-02 | 2023-05-09 | Sift Science, Inc. | Systems and methods for automated labeling of subscriber digital event data in a machine learning-based digital threat mitigation platform |
CN113434775A (en) * | 2021-07-15 | 2021-09-24 | 北京达佳互联信息技术有限公司 | Method and device for determining search content |
CN113434775B (en) * | 2021-07-15 | 2024-03-26 | 北京达佳互联信息技术有限公司 | Method and device for determining search content |
CN117197538A (en) * | 2023-08-16 | 2023-12-08 | 哈尔滨工业大学 | Bayesian convolution neural network structure apparent damage identification method based on Gaussian distribution weight sampling |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107818077A (en) | A kind of sensitive content recognition methods and device | |
CN107391760B (en) | User interest recognition methods, device and computer readable storage medium | |
US20170063893A1 (en) | Learning detector of malicious network traffic from weak labels | |
CN105630767B (en) | The comparative approach and device of a kind of text similarity | |
CN107992596A (en) | A kind of Text Clustering Method, device, server and storage medium | |
CN106489149A (en) | A kind of data mask method based on data mining and mass-rent and system | |
CN107491435A (en) | Method and device based on Computer Automatic Recognition user feeling | |
WO2019179010A1 (en) | Data set acquisition method, classification method and device, apparatus, and storage medium | |
CN112686022A (en) | Method and device for detecting illegal corpus, computer equipment and storage medium | |
CN107391675A (en) | Method and apparatus for generating structure information | |
CN113762377B (en) | Network traffic identification method, device, equipment and storage medium | |
CN108491866A (en) | Porny identification method, electronic device and readable storage medium storing program for executing | |
CN110347840A (en) | Complain prediction technique, system, equipment and the storage medium of text categories | |
CN110019790A (en) | Text identification, text monitoring, data object identification, data processing method | |
CN116489152B (en) | Linkage control method and device for Internet of things equipment, electronic equipment and medium | |
CN109766441A (en) | File classification method, apparatus and system | |
CN110458296A (en) | The labeling method and device of object event, storage medium and electronic device | |
CN112580108A (en) | Signature and seal integrity verification method and computer equipment | |
CN112884121A (en) | Traffic identification method based on generation of confrontation deep convolutional network | |
CN108536673A (en) | Media event abstracting method and device | |
Shrivastav et al. | Network traffic classification using semi-supervised approach | |
CN108268461A (en) | A kind of document sorting apparatus based on hybrid classifer | |
WO2024055603A1 (en) | Method and apparatus for identifying text from minor | |
CN117371049A (en) | Machine-generated text detection method and system based on blockchain and generated countermeasure network | |
CN107506407A (en) | A kind of document classification, the method and device called |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180320 |