CN107818077A - A kind of sensitive content recognition methods and device - Google Patents

A kind of sensitive content recognition methods and device Download PDF

Info

Publication number
CN107818077A
CN107818077A CN201610822280.3A CN201610822280A CN107818077A CN 107818077 A CN107818077 A CN 107818077A CN 201610822280 A CN201610822280 A CN 201610822280A CN 107818077 A CN107818077 A CN 107818077A
Authority
CN
China
Prior art keywords
content
participle
destinations traffic
sensitive
communication
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610822280.3A
Other languages
Chinese (zh)
Inventor
吕昭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Cloud Network Technology Co Ltd
Beijing Kingsoft Cloud Technology Co Ltd
Original Assignee
Beijing Kingsoft Cloud Network Technology Co Ltd
Beijing Kingsoft Cloud Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Cloud Network Technology Co Ltd, Beijing Kingsoft Cloud Technology Co Ltd filed Critical Beijing Kingsoft Cloud Network Technology Co Ltd
Priority to CN201610822280.3A priority Critical patent/CN107818077A/en
Publication of CN107818077A publication Critical patent/CN107818077A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Abstract

The embodiment of the present invention, which provides a kind of sensitive content recognition methods and device, method, to be included:Word segmentation processing is carried out to destinations traffic content to be identified, obtains corresponding at least one target participle;Using default participle attribute create-rule, the participle attribute of each target participle is generated, and generates the target feature vector corresponding to destinations traffic content;Target feature vector is inputted into the sensitive content identification model pre-established, obtain destinations traffic content whether be sensitive content recognition result.When carrying out sensitive content identification using scheme provided in an embodiment of the present invention, not by sensitive word in keyword in Content of Communication and default dictionary identification is realized in literal simple match, but make use of the literal level of ratio of the corresponding participle of Content of Communication deeper into participle attribute, and characteristic vector is generated according to the participle attribute of each participle, and the sensitive content identification model trained using machine learning algorithm is identified, the accuracy rate of sensitive content identification is improved.

Description

A kind of sensitive content recognition methods and device
Technical field
The present invention relates to technical field of network security, more particularly to a kind of sensitive content recognition methods and device.
Background technology
In recent years, with the development of network technology, requirement for network security also more and more higher, especially managed in company The requirement of reason, copyright management, national security etc. for network security is more urgent.In consideration of it, need to Web content with And the network user is monitored, to identify the sensitive content in network in time, and then identified according to the sensitive content identified The sensitive users gone out in network, so as to reach the purpose to guarantee network security.
In the prior art, there is provided a kind of sensitive content identifying schemes based on Keywords matching:Enter to sensitive content During row identification, Content of Communication to be identified can be first obtained, then Content of Communication to be identified is divided according to default word segmentation regulation Word processing, and keyword is filtered out in several participles obtained after word segmentation processing, and then by the keyword filtered out and in advance If the sensitive word in dictionary is matched, specifically, when the keyword identical for finding and filtering out from default dictionary is sensitive During word, it is believed that the keyword is a sensitive word to match, and when matching result meets default condition, (what is such as matched is quick Sense word quantity is more than predetermined number) when, judge the Content of Communication to be identified for sensitive content.
As seen from the above, although such scheme can realize the identification to the sensitive content in network, such scheme Only the keyword in Content of Communication to be identified is matched on literal level, it is easy to cause what sensitive content identified Accuracy rate is not high.For example, there is a keyword " iPhone " in Content of Communication to be identified, and preset not having in dictionary In the case of having " iPhone " and having " iphone ", " iPhone " can not be matched with " iphone ", it is taken as that " apple Fruit mobile phone " is not a sensitive word.It can be seen that such scheme can cause have very more keywords can not in Content of Communication to be identified Matched with the sensitive word in default dictionary, the accuracy rate for causing sensitive content to identify is not high.
The content of the invention
The purpose of the embodiment of the present invention is to provide a kind of sensitive content recognition methods and device, to improve sensitive content knowledge Other accuracy rate.
To reach above-mentioned purpose, the embodiment of the invention discloses a kind of sensitive content recognition methods, methods described includes:
Word segmentation processing is carried out to destinations traffic content to be identified, obtained at least one corresponding to the destinations traffic content Target segments;
Using default participle attribute create-rule, the participle attribute that each target segments is generated;
The participle attribute segmented according to each target, generates the target feature vector corresponding to the destinations traffic content;
The target feature vector is inputted into the sensitive content identification model pre-established, obtains the destinations traffic Content whether be sensitive content recognition result, wherein, the sensitive content identification model be using default machine learning calculate Method, resulting classification is trained to the characteristic vector corresponding to default multiple Content of Communication samples with tag along sort Model, the tag along sort include:The first label or second for identifying non-sensitive content for identifying sensitive content are marked Label.
Optionally, the step of obtaining the destinations traffic content to be identified, including:
The packet transmitted in collection network;
Based on default application layer protocol, reduction treatment is carried out to the data content in the packet;
It is destinations traffic content to be identified to determine the data content after reduction treatment.
Optionally, it is described that word segmentation processing is carried out to destinations traffic content to be identified, obtain the destinations traffic content pair The step of at least one target participle answered, including:
According to default first word division rule, by the destinations traffic division of teaching contents into several words;
The word obtained dividing segments as target corresponding to the destinations traffic content.
Optionally, it is described that word segmentation processing is carried out to destinations traffic content to be identified, obtain the destinations traffic content pair The step of at least one target participle answered, including:
According to default second word division rule, by the destinations traffic division of teaching contents into several words;
Remove the stop words in obtained several words of division, wherein, the stop words be part of speech be adverbial word, preposition or The word of pronoun;
Segmented remaining word as target corresponding to the destinations traffic content.
Optionally, it is described using default participle attribute create-rule, generate the step for segmenting attribute that each target segments Suddenly, including:
According to default weighting algorithm, the weights that each target segments are calculated;
The corresponding weight value calculated is defined as to the participle attribute of each target participle.
Optionally, the participle attribute segmented according to each target, generates the mesh corresponding to the destinations traffic content The step of marking characteristic vector, including:
Judge whether each keyword in default lists of keywords is identical with either objective participle respectively;
The participle attribute that target corresponding to first class keywords segments is defined as feature category corresponding to the first class keywords Property, default default value is defined as characteristic attribute corresponding to the second class keywords;Wherein, first class keywords for institute Target participle identical keyword is stated, second class keywords are the keyword differed with target participle;
Vector element using the characteristic attribute corresponding to each keyword as target feature vector, and according to each key Element position of the word in the lists of keywords, generate the target feature vector corresponding to the destinations traffic content.
Optionally, described sensitive content identification model, including:
The disaggregated model established based on algorithm of support vector machine, Bayesian Classification Arithmetic or neural network classification algorithm.
Optionally, by the target feature vector input to the step in the sensitive content identification model pre-established it Before, in addition to:
According to the destinations traffic content in transmission used protocol type, determine the letter of the destinations traffic content Cease type;Wherein, described information type includes:The initiative information sent to outside and the outside passive information sent;
It is described to input the target feature vector into the sensitive content identification model pre-established, obtain the target Content of Communication whether be sensitive content recognition result, wherein, the sensitive content identification model be utilize default engineering Practise algorithm, the characteristic vector corresponding to default multiple Content of Communication samples with tag along sort is trained it is resulting Disaggregated model, including:
When the destinations traffic content is initiative information, the target feature vector is inputted to first pre-established In sensitive content identification model, obtain the destinations traffic content whether be sensitive content recognition result;Wherein, described first Sensitive content identification model for using algorithm of support vector machine, to default multiple Content of Communication sample institutes with tag along sort Corresponding characteristic vector is trained resulting disaggregated model;
When the destinations traffic content is initiative information, the target feature vector is inputted to second pre-established In sensitive content identification model, obtain the destinations traffic content whether be sensitive content recognition result;Wherein, described second Sensitive content identification model for using Bayesian Classification Arithmetic, to default multiple Content of Communication sample institutes with tag along sort Corresponding characteristic vector is trained resulting disaggregated model.
Optionally, the step of establishing the first sensitive content identification model, including:
Obtain multiple first kind Content of Communication samples with tag along sort;Wherein, the first kind Content of Communication sample For the sample that information type is initiative information;
Word segmentation processing is carried out to each first kind Content of Communication sample respectively, obtains each first kind Content of Communication sample pair At least one first kind participle answered;
Using the participle attribute create-rule, generate what the first kind corresponding to each first kind Content of Communication sample segmented Segment attribute;
The participle attribute segmented according to the first kind corresponding to each first kind Content of Communication sample, generates each first kind and leads to Believe the first kind characteristic vector corresponding to content sample;
It is initial based on first kind characteristic vector and tag along sort corresponding to each first kind Content of Communication sample, training first Identification model, the first sensitive content identification model is obtained, wherein, the first initial identification model is based on algorithm of support vector machine Model.
Optionally, the step of establishing the second sensitive content identification model, including:
Obtain multiple second class Content of Communication samples with tag along sort;Wherein, the second class Content of Communication sample For the sample that information type is passive information;
Word segmentation processing is carried out to each second class Content of Communication sample respectively, obtains each second class Content of Communication sample pair At least one second class participle answered;
Using the participle attribute create-rule, the second class participle corresponding to each second class Content of Communication sample is generated Segment attribute;
The participle attribute segmented according to the second class corresponding to each second class Content of Communication sample, generate each second class and lead to Believe the second category feature vector corresponding to content sample;
It is initial based on the second category feature vector sum tag along sort corresponding to each second class Content of Communication sample, training second Identification model, the second sensitive content identification model is obtained, wherein, the second initial identification model is based on Bayesian Classification Arithmetic Model.
Optionally, it is described according to the destinations traffic content transmission when used protocol type, determine the target The step of information type of Content of Communication, including:
In the case where the destinations traffic content is webpage:If used protocol type is POST classes when transmitting webpage Type, it is determined that the destinations traffic content is initiative information;If used protocol type is GET types when transmitting webpage, It is passive information to determine the destinations traffic content;
In the case where the Content of Communication to be identified is mail:When sending agreement transmission mail according to mail, it is determined that The destinations traffic content is initiative information;When transmitting mail according to mail reception agreement, the destinations traffic content is determined For passive information.
To reach above-mentioned purpose, the embodiment of the invention discloses a kind of sensitive content identification device, described device includes:
Word segmentation processing module, for carrying out word segmentation processing to destinations traffic content to be identified, obtain the destinations traffic At least one target participle corresponding to content;
Attribute generation module is segmented, for using default participle attribute create-rule, generating dividing for each target participle Word attribute;
Feature vector generation module, for the participle attribute segmented according to each target, generate the destinations traffic content Corresponding target feature vector;
Sensitive content identification module, mould is identified for the target feature vector to be inputted to the sensitive content pre-established In type, obtain the destinations traffic content whether be sensitive content recognition result, wherein, the sensitive content identification model is Using default machine learning algorithm, to the characteristic vector corresponding to default multiple Content of Communication samples with tag along sort Resulting disaggregated model is trained, the tag along sort includes:For identifying the first label of sensitive content or for marking Know the second label of non-sensitive content.
Optionally, a kind of sensitive content identification device that the embodiment of the present invention is provided also includes:
Content of Communication obtains module, is specifically used for:
The packet transmitted in collection network;
Based on default application layer protocol, reduction treatment is carried out to the data content in the packet;
It is destinations traffic content to be identified to determine the data content after reduction treatment.
Optionally, the word segmentation processing module, including:First division submodule and first participle determination sub-module;Wherein,
The first division submodule, for according to default first word division rule, by the destinations traffic content It is divided into several words;
The first participle determination sub-module, for obtained word will to be divided as corresponding to the destinations traffic content Target segments.
Optionally, the word segmentation processing module, including:Second division submodule, participle remove submodule and the second participle Determination sub-module;Wherein,
The second division submodule, for according to default second word division rule, by the destinations traffic content It is divided into several words;
The participle removes submodule, the stop words in several words obtained for removing division, wherein, it is described to stop Word is the word that part of speech is adverbial word, preposition or pronoun;
The second participle determination sub-module, for using remaining word as target corresponding to the destinations traffic content Participle.
Optionally, the participle attribute generation module, including:Weight computing submodule and participle attribute determination sub-module; Wherein,
The weight computing submodule, for according to default weighting algorithm, calculating the weights that each target segments;
The participle attribute determination sub-module, for the corresponding weight value calculated to be defined as to the participle of each target participle Attribute.
Optionally, the feature vector generation module, including:Segment judging submodule, characteristic attribute determination sub-module and Characteristic vector generates submodule;Wherein,
The participle judging submodule, for judging each keyword in default lists of keywords whether with appointing respectively One target participle is identical;
The characteristic attribute determination sub-module, the participle attribute for target corresponding to the first class keywords to be segmented determine For characteristic attribute corresponding to the first class keywords, default default value is defined as characteristic attribute corresponding to the second class keywords; Wherein, first class keywords are to segment identical keyword with the target, and second class keywords are and the mesh The keyword that mark participle differs;
The characteristic vector generates submodule, for using the characteristic attribute corresponding to each keyword as target signature to The vector element of amount, and the element position according to each keyword in the lists of keywords, are generated in the destinations traffic Hold corresponding target feature vector.
Optionally, described sensitive content identification model, including:
The disaggregated model established based on algorithm of support vector machine, Bayesian Classification Arithmetic or neural network classification algorithm.
Optionally, a kind of sensitive content identification device that the embodiment of the present invention is provided also includes:
Information type determining module, is used for:
The target feature vector is inputted to before the step in the sensitive content identification model pre-established, according to institute Destinations traffic content used protocol type in transmission is stated, determines the information type of the destinations traffic content;Wherein, institute Stating information type includes:The initiative information sent to outside and the outside passive information sent;
The sensitive content identification module, including:First identification submodule and the second identification submodule;Wherein,
Described first identification submodule, for when the destinations traffic content be initiative information, by target spy Sign vector input obtains whether the destinations traffic content is in sensitivity into the first sensitive content identification model pre-established The recognition result of appearance;Wherein, the first sensitive content identification model is to divide using algorithm of support vector machine, to default carry Characteristic vector corresponding to multiple Content of Communication samples of class label is trained resulting disaggregated model;
Described second identification submodule, for when the destinations traffic content be initiative information, by target spy Sign vector input obtains whether the destinations traffic content is in sensitivity into the second sensitive content identification model pre-established The recognition result of appearance;Wherein, the second sensitive content identification model is to divide using Bayesian Classification Arithmetic, to default carry Characteristic vector corresponding to multiple Content of Communication samples of class label is trained resulting disaggregated model.
Optionally, the step of establishing the first sensitive content identification model, including:
Obtain multiple first kind Content of Communication samples with tag along sort;Wherein, the first kind Content of Communication sample For the sample that information type is initiative information;
Word segmentation processing is carried out to each first kind Content of Communication sample respectively, obtains each first kind Content of Communication sample pair At least one first kind participle answered;
Using the participle attribute create-rule, generate what the first kind corresponding to each first kind Content of Communication sample segmented Segment attribute;
The participle attribute segmented according to the first kind corresponding to each first kind Content of Communication sample, generates each first kind and leads to Believe the first kind characteristic vector corresponding to content sample;
It is initial based on first kind characteristic vector and tag along sort corresponding to each first kind Content of Communication sample, training first Identification model, the first sensitive content identification model is obtained, wherein, the first initial identification model is based on algorithm of support vector machine Model.
Optionally, the step of establishing the second sensitive content identification model, including:
Obtain multiple second class Content of Communication samples with tag along sort;Wherein, the second class Content of Communication sample For the sample that information type is passive information;
Word segmentation processing is carried out to each second class Content of Communication sample respectively, obtains each second class Content of Communication sample pair At least one second class participle answered;
Using the participle attribute create-rule, the second class participle corresponding to each second class Content of Communication sample is generated Segment attribute;
The participle attribute segmented according to the second class corresponding to each second class Content of Communication sample, generate each second class and lead to Believe the second category feature vector corresponding to content sample;
It is initial based on the second category feature vector sum tag along sort corresponding to each second class Content of Communication sample, training second Identification model, the second sensitive content identification model is obtained, wherein, the second initial identification model is based on Bayesian Classification Arithmetic Model.
Optionally, described information determination type module, including:First information type determination module and the second information type Determination sub-module;Wherein,
The first information type determination module, in the case of being webpage in the destinations traffic content:If pass Used protocol type is POST types during defeated webpage, it is determined that the destinations traffic content is initiative information;If transmission network Used protocol type is GET types during page, it is determined that the destinations traffic content is passive information;
The second information type determination sub-module, in the case of being mail in the Content of Communication to be identified:If When sending agreement transmission mail using mail, it is initiative information to determine the destinations traffic content;According to mail reception agreement When transmitting mail, it is passive information to determine the destinations traffic content.
The embodiment of the present invention provides a kind of sensitive content recognition methods and device.It is first right when carrying out sensitive content identification Destinations traffic content to be identified carries out word segmentation processing, obtains at least one target participle corresponding to it;Then, utilization is default Attribute create-rule is segmented, generates the participle attribute of each target participle;Then, the participle attribute segmented according to each target, Generate the target feature vector corresponding to destinations traffic content;Finally, target feature vector is inputted to the sensitivity pre-established In content recognition model, obtain destinations traffic content whether be sensitive content recognition result.
As seen from the above, when carrying out sensitive content identification using scheme provided in an embodiment of the present invention, not by logical Believe that the keyword in content realizes identification with the sensitive word in default dictionary in literal simple match, but make use of logical Believe the literal level of ratio of the corresponding participle of content deeper into participle attribute, and feature is generated according to the participle attribute of each participle Vector, and train to obtain sensitive content identification model using machine learning algorithm and be identified, so as to improve sensitive content The accuracy rate of identification.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is a kind of schematic flow sheet of sensitive content recognition methods provided in an embodiment of the present invention;
Fig. 2 is the schematic flow sheet of another sensitive content recognition methods provided in an embodiment of the present invention;
Fig. 3 is a kind of structural representation of sensitive content identification device provided in an embodiment of the present invention;
Fig. 4 is the structural representation of another sensitive content identification device provided in an embodiment of the present invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other under the premise of creative work is not made Embodiment, belong to the scope of protection of the invention.
In order to improve the accuracy rate of sensitive content identification, the embodiments of the invention provide a kind of sensitive content recognition methods and Device.
A kind of sensitive content recognition methods provided in an embodiment of the present invention is introduced first below.
It should be noted that a kind of executive agent for sensitive content recognition methods that the embodiment of the present invention is provided can be A kind of sensitive content identification device.Wherein, the sensitive content identification device can be the plug-in unit in existing capability software, can also For independent functional software, this is all rational.
As shown in figure 1, it is a kind of sensitive content recognition methods provided in an embodiment of the present invention, this method can include following Step:
S101:Word segmentation processing is carried out to destinations traffic content to be identified, obtained at least one corresponding to destinations traffic content Individual target participle.
When needing to carry out sensitive identification to a certain Content of Communication, destinations traffic content to be identified can be carried out first Word segmentation processing, so as to obtain at least one target participle corresponding to the destinations traffic content, and carry out follow-up processing.
Wherein, the acquisition modes of destinations traffic content to be identified can be it is a variety of, in one implementation, Ke Yitong Cross in the following manner and obtain destinations traffic content:
(1) packet transmitted in network is gathered.
Specifically, the gateway server in network can be configured, and then the data flow that gateway server will be flowed into Amount is incorporated into the network adapter of default experimental machine, and the network adapter of experimental machine is arranged into promiscuous mode, so as to Realize to the collection of the packet transmitted in network.
(2) default application layer protocol is based on, reduction treatment is carried out to the data content in packet.
Specifically, after the packet transmitted in network is collected, data flow can be carried out using transport layer protocol Restructuring, corresponding application layer protocol is then recycled to carry out data convert processing to the data content in packet.In general, number It can be divided into TCP (Transmission Control Protocol, transmission control protocol) data flows and UDP (User according to stream Datagram Protocol, UDP) two kinds of forms of data flow, therefore, it is necessary to according to corresponding to packet data Stream type determines corresponding application layer protocol, and then utilizes the corresponding application layer protocol determined to the number in packet Reduction treatment is carried out according to content.For example, application layer protocol can include HTTP (HTTP-Hypertext transfer Protocol, HTTP), FTP (File Transfer Protocol, FTP), MSN (Microsoft Service Network, Microsoft MSN), SMTP (Simple Mail Transfer Protocol, simple postal Part host-host protocol) and POP3 (Post Office Protocol-Version 3, Post Office Protocol,Version 3) etc., it is necessary to explanation It is that the embodiment of the present invention need not be simultaneously defined to specific application layer protocol.
In addition, it is necessary to illustrate, the whole needed for a TCP communication connection procedure has been usually contained in TCP data stream TCP data bag, therefore, it is possible that phenomenon that is out of order or retransmitting in the transmitting procedure of packet, therefore, it is necessary to by five yuan Group is (i.e.:Source internet protocol address, source port, purpose internet protocol address, destination interface and transport layer protocol) identical TCP numbers It is ranked up according to bag according to ACK (Acknowledgement, confirming character) and SEQ (Sequence, sequence number) numberings, to solve The problem of out of order and re-transmission of TCP data bag.
(3) it is destinations traffic content to be identified to determine the data content after reduction treatment.
Specifically, after reduction treatment is carried out to the packet gathered according to corresponding application layer protocol, can obtain Data content corresponding to packet, for example, the webpage browsed, the file of transmission, the message content, etc. that transmits, Jin Erke So that the data content restored to be defined as to destinations traffic content to be identified, identified with carrying out follow-up sensitive content.
It should be noted that the mode of above-mentioned acquisition destinations traffic content is merely exemplary, should not form to this The restriction of inventive embodiments.Such as:Destinations traffic content to be identified can also crawl to obtain by web crawlers, or, can With by being manually entered, etc..It is emphasized that the embodiment of the present invention is not carried out to the scene for needing sensitive content to identify Limit, namely do not limit the acquisition modes of destinations traffic content to be identified.
It is understood that carrying out word segmentation processing to destinations traffic content, obtain at least one corresponding to destinations traffic content The specific implementation of individual target participle exists a variety of.Wherein, in a kind of implementation, can be segmented by following steps Processing, so as to obtain at least one target participle corresponding to destinations traffic content:
(1) according to default first word division rule, by destinations traffic division of teaching contents into several words.
Specifically, the ICTCLAS Words partition systems that can be provided according to Inst. of Computing Techn. Academia Sinica (Institute of Computing Technology, Chinese Lexical Analysis System, Chinese lexical point Analysis system) word division is carried out to destinations traffic content, and then obtain several words marked off.ICTCLAS participles system The major function of system includes:Chinese word segmentation, part-of-speech tagging, name Entity recognition, new word identification while support user-oriented dictionary.Mesh It is preceding to be upgraded to ICTCLAS3.0 versions, the participle speed unit 996KB/s of the ICTCLAS3.0 versions, the precision of word segmentation 98.45%, API (Application Programming Interface, application programming interface) are no more than 200KB, respectively It is most handy Chinese lexical analysis device in the world at present less than 3M after the compression of kind of dictionary data.
It should be noted that ICTCLAS Words partition systems described herein are only a kind of tool of the first word division rule Body form, the present invention need not be simultaneously defined, any possible realization to the concrete form of the first word division rule Mode can apply to the present invention.
(2) word obtained dividing segments as target corresponding to destinations traffic content.
Specifically, when according to default first word division rule, by destinations traffic division of teaching contents into after several words, The word that division obtains can be defined as into target corresponding to destinations traffic content to segment.For example, if destinations traffic content For:" carrying out sensitive content identification to the Content of Communication in network ", several words for dividing to obtain are:" to ", " network ", " in ", " ", " communication ", " content ", " progress ", " sensitivity ", " content ", " identification ", according to this implementation, will can be divided Obtained " to ", " network ", " in ", " ", " communication ", " content ", " progress ", " sensitivity ", " content ", " identification " are defined as Target corresponding to destinations traffic content segments.
It should be noted that several words that division obtains can be stored in number successively according to the sequencing marked off According in table, certainly, the storage form of the embodiment of the present invention and several words that need not be obtained to division is defined, any Possible implementation can apply to the present invention.
In another implementation, word segmentation processing can be carried out by following steps, so as to obtain destinations traffic content pair At least one target participle answered:
(1) according to default second word division rule, by destinations traffic division of teaching contents into several words.
Advised it should be noted that the second word division rule described herein can divide with above-mentioned the first word referred to It is then identical, such as can also be according to the ICTCLAS Words partition systems that Inst. of Computing Techn. Academia Sinica provides in destinations traffic Hold and carry out word division, it is, of course, also possible to different from above-mentioned the first word division rule referred to, the embodiment of the present invention is equally not Need to be defined the concrete form of the second word division rule, any possible implementation can apply to this hair It is bright.
(2) remove the stop words in obtained several words of division, wherein, stop words be part of speech be adverbial word, preposition or The word of pronoun.
In general, after carrying out word division to destinations traffic content, usually contained in several words for dividing to obtain Some do not have the word of practical significance, for example, " especially " " very " " almost " " " " I " " they " etc..And these words are past Toward the subsequently identification to sensitive content can be influenceed, therefore, it is necessary to remove these influences in several words obtained from division The word of recognition result is stop words.
It should be noted that part of speech described herein is the word of adverbial word, preposition or pronoun, it is only to list stop words Several specific examples, certainly, stop words can also include the word of other parts of speech, and the present invention simultaneously need not be to the tool of stop words Body form is defined, and those skilled in the art need the concrete condition in practical application reasonably to be set.
(3) segmented remaining word as target corresponding to destinations traffic content.
Specifically, after destinations traffic is divided into several words according to default second word division rule, and It is not that the word directly obtained dividing segments as target corresponding to destinations traffic content, but division is obtained some Individual word removes the word after stop words, is defined as corresponding to destinations traffic content target and segments, and is so advantageous to improve quick Feel the accuracy rate of content recognition.
Still illustrated by taking destinations traffic content " carrying out sensitive content identification to the Content of Communication in network " as an example, according to Several words that second word division rule divides to obtain are:" to ", " network ", " in ", " ", " communication ", " content ", " progress ", " sensitivity ", " content ", " identification ", remove stop words and obtain following word afterwards:" network ", " communication ", " content ", " sensitivity ", " content ", " identification ", finally, the word that obtains after stop words will be removed and be defined as corresponding to destinations traffic content Target segments.
It should be noted that above-mentioned is only to list to carry out word segmentation processing and obtain corresponding to destinations traffic content extremely Two kinds of concrete modes of few target participle, the present invention and need not to how to carry out word segmentation processing to destinations traffic content, And how to obtain the concrete mode that target corresponding to destinations traffic content segments and be defined, any possible implementation is equal The present invention is can apply to, those skilled in the art need the concrete condition in practical application reasonably to be set Put.
S102:Using default participle attribute create-rule, the participle attribute that each target segments is generated.
In a kind of implementation, the participle attribute of each target participle can be generated by following steps:
(1) according to default weighting algorithm, the weights that each target segments are calculated.
Specifically, can be according to TF-IDF (term frequency-inverse document frequency, one kind The conventional weighting technique prospected for information retrieval with information) algorithm is weighted, and obtains each target participle Weights.It should be noted that being only to list a kind of specific weighting algorithm here, the present invention simultaneously need not be to weighting algorithm Concrete form be defined, any possible implementation can apply to the present invention, and those skilled in the art need Reasonably to be set according to the concrete condition in practical application.
(2) corresponding weight value calculated is defined as to the participle attribute of each target participle.
Specifically, after each target participle is weighted according to specific weighting algorithm, can obtain each The corresponding weights of individual target participle, and then the weights corresponding with each target participle calculated can be defined as respectively The participle attribute of individual target participle.It should be noted that it is only to list the participle attribute for generating each target participle here A kind of concrete form, the present invention need not simultaneously be defined to the concrete form for the participle attribute for generating each target participle, Those skilled in the art need the concrete condition in practical application reasonably to be set.
S103:The participle attribute segmented according to each target, generate the target feature vector corresponding to destinations traffic content.
In a kind of implementation, target feature vector can be generated in the following manner:
(1) judge whether each keyword in default lists of keywords is identical with either objective participle respectively.
Wherein, the keyword in default lists of keywords described herein, can be obtained in the following manner:First, After carrying out word segmentation processing to default a large amount of language materials, the chi-square value of each participle is calculated using Chi-square Test, and then screens card release Side is worth larger predetermined number and segmented as the keyword in default lists of keywords.
Wherein, Chi-square Test is a kind of hypothesis testing method having many uses, its answering in grouped data statistical inference With, including:Two rates or two form frequently compared with Chi-square Test;Multiple rates or it is multiple form frequently compared with Chi-square Test and Correlation analysis of grouped data etc..The statistic of Chi-square Test is chi-square value, and it is the actual frequency of each grid in four fold table data Number A and theoretical frequency T difference square with the ratio between theoretical frequency add up and, and chi-square value is bigger, illustrates actual frequency It is more obvious with the difference of theoretical frequency.It should be noted that Chi-square Test is disclosed in prior art, here no longer to card side The specific calculating process examined is described in detail, and can effectively be extracted by Chi-square Test to whether language material is in sensitivity Have the participle of visible marking's effect.
(2) the participle attribute that target corresponding to the first class keywords segments is defined as feature corresponding to the first class keywords Attribute, default default value is defined as characteristic attribute corresponding to the second class keywords;Wherein, the first class keywords are and target Identical keyword is segmented, the second class keywords are the keyword differed with target participle.
It should be noted that here and the concrete numerical value of default default value need not be defined, in the art Technical staff needs the concrete condition in practical application reasonably to be set.
(3) vector element using the characteristic attribute corresponding to each keyword as target feature vector, and according to each Element position of the keyword in lists of keywords, generate the target feature vector corresponding to destinations traffic content.
For example, if the keyword in lists of keywords is followed successively by:Webpage, compete, find, to, tourism, destinations traffic Target participle is corresponding to content 1:Webpage, recruitment, military affairs, sportsman, target participle is corresponding to destinations traffic content 2:Equipment, Compete, find, fighting, visitor, it is clear that judging keyword only " webpage " this keyword and the target in lists of keywords Target participle is identical corresponding to Content of Communication 1, and the keyword in lists of keywords only " is competed " and " finding " the two keys Word is identical with target participle corresponding to destinations traffic content 2.
Assuming that:The participle attribute of target participle is respectively corresponding to destinations traffic content 1:707 1,146 1797455, mesh The participle attribute of target participle is corresponding to mark Content of Communication 2:931 2420 1184 1227 1041;If default value is arranged to 0, then, according to aforesaid way, it may be determined that characteristic vector corresponding to destinations traffic content 1 is { 707,0,0,0,0 }, and target Characteristic vector corresponding to Content of Communication 2 is { 0,2420,1184,0,0 }.
It should be understood that for different destinations traffic contents, resulting target participle and target participle Number be also it is different, therefore, for corresponding to different destinations traffic contents target participle for, its pass that can be covered The quantity of keyword and keyword in keyword list is also different.But the no matter particular content of destinations traffic content Whether identical, the dimension of the characteristic vector corresponding to each destinations traffic content finally given is identical, i.e., with pass The number of keyword in keyword list is consistent.
It can be seen that the number of the keyword in default lists of keywords directly determines the spy corresponding to destinations traffic content Levy vector dimension, and the dimension of characteristic vector can directly affect destinations traffic content whether be sensitive content recognition result, Therefore, the number of the keyword in default lists of keywords is just particularly important, if number is very little, is just not enough to cover mesh Mark target participle corresponding to Content of Communication so that there is complete zero situation in characteristic vector, influences the accuracy rate of sensitive content identification, Conversely, if number is too many, the time spent by calculating process will become big, influence recognition speed.It is it should be noted that of the invention Embodiment need not be simultaneously defined to the number of the keyword in default lists of keywords, and those skilled in the art need Reasonably to be set according to the concrete condition in practical application.
S104:Target feature vector is inputted into the sensitive content identification model pre-established, obtained in destinations traffic Hold whether be sensitive content recognition result.
Specifically, after characteristic vector corresponding to generation destinations traffic content, can be using this feature vector as sensitivity The input of content recognition model, by the input of this feature vector to after the model, the sensitive content identification model according to itself Model parameter etc. is calculated this feature vector, and obtain the destinations traffic content is sensitive content or is not sensitive content Recognition result.
Wherein, sensitive content identification model is to carry tag along sort using default machine learning algorithm, to default Characteristic vector corresponding to multiple Content of Communication samples is trained resulting disaggregated model;Tag along sort includes:For marking Know the first label of sensitive content or the second label for identifying non-sensitive content.
Specifically, sensitive content identification model, can include:Based on algorithm of support vector machine, Bayesian Classification Arithmetic or Neural network classification algorithm and the disaggregated model established.
It should be noted that when establishing sensitive content identification model, it is necessary to obtain substantial amounts of for training in the sensitivity Hold the training sample of identification model, in addition, training sample is the sample with tag along sort, so, by each training sample pair The characteristic vector answered is inputted to initial identification model, can be by the reality point of the recognition result that model exports and the training sample Class label is compared, and the model parameter of initial identification model is corrected according to comparative result, until the initial identification model The classification results exported meet default model training termination condition (such as:Class interval is more than the first default value, error It is more than the 3rd default value, etc. less than the second default value or training iterations) when, complete sensitive content identification model Establish, the recognition result of initial identification model now being capable of approaching to reality situation as far as possible.
As seen from the above, when carrying out sensitive content identification using scheme provided in an embodiment of the present invention, not by logical Believe that the keyword in content realizes identification with the sensitive word in default dictionary in literal simple match, but make use of logical Believe the literal level of ratio of the corresponding participle of content deeper into participle attribute, and feature is generated according to the participle attribute of each participle Vector, and train to obtain sensitive content identification model using machine learning algorithm and be identified, so as to improve sensitive content The accuracy rate of identification.
As shown in Fig. 2 it is another sensitive content recognition methods provided in an embodiment of the present invention, in the method shown in Fig. 1 On the basis of embodiment, before step S104, it can also comprise the following steps:
S105:According to destinations traffic content in transmission used protocol type, determine the information of destinations traffic content Type.
Wherein, information type can include:The initiative information sent to outside and the outside passive information sent.
For example, initiative information can include mail, the model etc. of forum's issue sent, and passive information can be with Including the webpage browsed, the mail of reception etc..Certainly, it is only that the concrete form of initiative information and passive information is entered here Gone for example, the present invention and the concrete form of initiative information and passive information need not be defined.
In a kind of implementation, the information type of destinations traffic content can be determined in the following manner:
(1) in the case where destinations traffic content is webpage:If used protocol type is POST classes when transmitting webpage Type, it is determined that destinations traffic content is initiative information;If used protocol type is GET types when transmitting webpage, it is determined that Destinations traffic content is passive information.
(2) in the case where Content of Communication to be identified is mail:When sending agreement transmission mail according to mail, mesh is determined Mark Content of Communication is initiative information;When transmitting mail according to mail reception agreement, it is passive information to determine destinations traffic content.
It should be noted that the concrete mode of the information type of above-mentioned determination destinations traffic content is merely exemplary, and The restriction to the embodiment of the present invention should not be formed.
It should be noted that initiative information is usually the information sent to outside, passive information is usually outside send Information, comparatively, the accuracy rate that those skilled in the art identify more concerned with the sensitive content of initiative information, due to outside The data volume for the passive information sent would generally be very big, therefore, those skilled in the art more concerned be passive information The recognition rate of sensitive content identification.
Accordingly, step S104 can include S1041 and the sub-steps of S1042 two, so as to for different information types Destinations traffic content, the identification of sensitive content, two sons of S1041 and S1042 are carried out from different sensitive content identification models The particular content of step is as follows:
S1041:When destinations traffic content is initiative information, target feature vector is inputted quick to first pre-established Feel content recognition model in, obtain destinations traffic content whether be sensitive content recognition result.
Wherein, the first sensitive content identification model is to carry tag along sort using algorithm of support vector machine, to default Characteristic vector corresponding to multiple Content of Communication samples is trained resulting disaggregated model.
S1042:When destinations traffic content is initiative information, target feature vector is inputted quick to second pre-established Feel content recognition model in, obtain destinations traffic content whether be sensitive content recognition result.
Wherein, the second sensitive content identification model is to carry tag along sort using Bayesian Classification Arithmetic, to default Characteristic vector corresponding to multiple Content of Communication samples is trained resulting disaggregated model.
It should be noted that algorithm of support vector machine has more accurate classifying quality, therefore can ensure actively to believe The accuracy rate of the sensitive content identification of breath, and Bayesian Classification Arithmetic has higher classification speed, therefore can ensure passively The recognition rate of the sensitive content identification of information.
In a kind of implementation, the first sensitive content identification model can be established in the following manner:
A1:Obtain multiple first kind Content of Communication samples with tag along sort;Wherein, first kind Content of Communication sample is Information type is the sample of initiative information.
It should be noted that, it is necessary to obtain substantial amounts of training sample (i.e. before the first sensitive content identification model is established First kind Content of Communication sample) for training the first sensitive content identification model is obtained, in addition, training sample is with classification The sample of label, can be with subsequently to input characteristic vector corresponding to each training sample to the first initial identification model The recognition result that model exports is corrected compared with the actual classification label of the training sample, and according to the comparative result The current model parameter of first initial identification model so that the gradual approaching to reality feelings of recognition result of the first initial identification model Condition.
A2:Word segmentation processing is carried out to each first kind Content of Communication sample respectively, obtains each first kind Content of Communication sample At least one first kind participle corresponding to this.
A3:Using attribute create-rule is segmented, generate what the first kind corresponding to each first kind Content of Communication sample segmented Segment attribute.
A4:The participle attribute segmented according to the first kind corresponding to each first kind Content of Communication sample, generation each first First kind characteristic vector corresponding to class Content of Communication sample.
It should be noted that step A2-A4 specific steps may refer to the S101-S103 of embodiment of the method shown in Fig. 1 Correlation step, here is omitted.
A5:Based on first kind characteristic vector and tag along sort corresponding to each first kind Content of Communication sample, training first Initial identification model, the first sensitive content identification model is obtained, wherein, the first initial identification model is to be calculated based on SVMs The model of method.
Specifically, following steps can be performed to each first kind characteristic vector successively:First kind characteristic vector is inputted Into the first initial identification model built in advance, the model calculates the first kind characteristic vector using current model parameter Be the first confidence level of sensitive content and be not non-sensitive content the second confidence level, and put according to the first confidence level and second Reliability, export the first kind Content of Communication sample whether be sensitive content classification results, then according to the classification knot exported The fruit model parameter current to the first initial identification model with the comparative result of the tag along sort of the first kind Content of Communication sample It is modified, until the classification results of the first initial identification model output meet default first model training termination condition When, complete the foundation of the first sensitive content identification model.
In a kind of implementation, the second sensitive content identification model can be established in the following manner:
B1:Obtain multiple second class Content of Communication samples with tag along sort;Wherein, the second class Content of Communication sample is Information type is the sample of passive information.
It should be noted that step B1 specific steps may refer to A1 correlation step, except that, first is sensitive The first kind Content of Communication sample that content recognition model is selected is the sample that information type is initiative information, and in the second sensitivity The the second class Content of Communication sample for holding identification model is the sample that information type is passive information, it should be pointed out that the second class Content of Communication sample can be the same or different with the first Content of Communication sample on acquisition pattern, and the embodiment of the present invention is not Need to be defined this.
B2:Word segmentation processing is carried out to each second class Content of Communication sample respectively, obtains each second class Content of Communication sample At least one second class participle corresponding to this.
B3:Using attribute create-rule is segmented, the second class participle corresponding to each second class Content of Communication sample is generated Segment attribute.
B4:The participle attribute segmented according to the second class corresponding to each second class Content of Communication sample, generation each second The second category feature vector corresponding to class Content of Communication sample.
It should be noted that step B2-B4 specific steps may refer to the S101-S103 of embodiment of the method shown in Fig. 1 Correlation step, here is omitted.
B5:Based on the second category feature vector sum tag along sort corresponding to each second class Content of Communication sample, training second Initial identification model, the second sensitive content identification model is obtained, wherein, the second initial identification model is to be calculated based on Bayes's classification The model of method.
Specifically, following steps can be performed to each second category feature vector successively:Second category feature vector is inputted Into the second initial identification model built in advance, the model is used to divide according to the second category feature vector inputted to optimize Second class Content of Communication sample is sensitive content or the Optimal Separating Hyperplane of non-sensitive content, until the Optimal Separating Hyperplane is marked off Two classifications (i.e. the classification of the classification of sensitive content and non-sensitive content) between class interval maximum when, complete second quick Feel the foundation of content recognition model.
It is emphasized that have been disclosed in the prior art based on algorithm of support vector machine and calculated based on Bayes's classification Method establishes the specific steps of disaggregated model, reference can be made to correlation step of the prior art, is not described in detail herein.
In addition, the description of the first kind, the second class involved in the embodiment of the present invention and not having any limiting meaning.
As seen from the above, when carrying out sensitive content identification using scheme provided in an embodiment of the present invention, except possessing Fig. 1 institutes Show outside all advantages of embodiment of the method, also taken into full account the information type of destinations traffic content, and be directed to different letters The destinations traffic content of type is ceased, the identification of sensitive content is carried out from different sensitive content identification models, is based on as utilized First sensitive content identification model of algorithm of support vector machine carries out quick to information type for the destinations traffic content of initiative information Feel the identification of content, be passive information to information type using the second sensitive content identification model based on Bayesian Classification Arithmetic Destinations traffic content carry out the identification of sensitive content, it is necessary to explanation, initiative information are usually the information that is sent to outside, Passive information is usually the outside information sent, comparatively, sensitivity of the those skilled in the art more concerned with initiative information The accuracy rate of content recognition, because the data volume of the outside passive information sent would generally be very big, therefore, technology in the art Personnel more concerned be passive information sensitive content identify recognition rate, and algorithm of support vector machine have it is more accurate Classifying quality, therefore the accuracy rate of the sensitive content identification of initiative information can be ensured, and Bayesian Classification Arithmetic is with higher Classification speed, therefore can ensure passive information sensitive content identification recognition rate, more reasonably meet sensitivity The needs of content recognition.
Corresponding to above method embodiment, the embodiment of the present invention additionally provides a kind of sensitive content identification device.
A kind of sensitive content identification device provided in an embodiment of the present invention is introduced again below.
As shown in figure 3, being a kind of sensitive content identification device provided in an embodiment of the present invention, the device can include:
Word segmentation processing module 210, for carrying out word segmentation processing to destinations traffic content to be identified, obtain in destinations traffic At least one target participle corresponding to appearance.
Attribute generation module 220 is segmented, for using default participle attribute create-rule, generating each target participle Segment attribute.
Feature vector generation module 230, for the participle attribute segmented according to each target, generation destinations traffic content institute Corresponding target feature vector.
Sensitive content identification module 240, mould is identified for target feature vector to be inputted to the sensitive content pre-established In type, obtain destinations traffic content whether be sensitive content recognition result.
Wherein, sensitive content identification model is to carry tag along sort using default machine learning algorithm, to default Characteristic vector corresponding to multiple Content of Communication samples is trained resulting disaggregated model, and tag along sort includes:For marking Know the first label of sensitive content or the second label for identifying non-sensitive content.
As seen from the above, when carrying out sensitive content identification using scheme provided in an embodiment of the present invention, not by logical Believe that the keyword in content realizes identification with the sensitive word in default dictionary in literal simple match, but make use of logical Believe the literal level of ratio of the corresponding participle of content deeper into participle attribute, and feature is generated according to the participle attribute of each participle Vector, and train to obtain sensitive content identification model using machine learning algorithm and be identified, so as to improve sensitive content The accuracy rate of identification.
Specifically, the device can also include:
Content of Communication obtains module, is specifically used for:
(1) packet transmitted in network is gathered.
(2) default application layer protocol is based on, reduction treatment is carried out to the data content in packet.
(3) it is destinations traffic content to be identified to determine the data content after reduction treatment.
In a kind of implementation, word segmentation processing module 210, it can include:First division submodule and the first participle determine Submodule.
Wherein, the first division submodule, for according to default first word division rule, by destinations traffic division of teaching contents Into several words.
First participle determination sub-module, for obtained word will to be divided as target corresponding to destinations traffic content point Word.
In another implementation, word segmentation processing module 210, it can include:Second division submodule, participle remove submodule Block and the second participle determination sub-module.
Wherein, the second division submodule, for according to default second word division rule, by destinations traffic division of teaching contents Into several words.
Participle removes submodule, the stop words in several words obtained for removing division, wherein, stop words is word Property be adverbial word, preposition or pronoun word;
Second participle determination sub-module, for being segmented remaining word as target corresponding to destinations traffic content.
Wherein, attribute generation module 220 is segmented, can be included:Weight computing submodule and participle attribute determination sub-module.
Weight computing submodule, for according to default weighting algorithm, calculating the weights that each target segments.
Attribute determination sub-module is segmented, for the corresponding weight value calculated to be defined as to the participle category of each target participle Property.
Wherein, feature vector generation module 230, can include:Segment judging submodule, characteristic attribute determination sub-module and Characteristic vector generates submodule.
Segment judging submodule, for judge respectively each keyword in default lists of keywords whether with any mesh Mark participle is identical.
Characteristic attribute determination sub-module, the participle attribute for target corresponding to the first class keywords to be segmented are defined as the Characteristic attribute corresponding to one class keywords, default default value is defined as characteristic attribute corresponding to the second class keywords.
Wherein, the first class keywords are to segment identical keyword with target, and the second class keywords are to be segmented not with target Identical keyword.
Characteristic vector generates submodule, for using the characteristic attribute corresponding to each keyword as target feature vector Vector element, and the element position according to each keyword in lists of keywords, generate the mesh corresponding to destinations traffic content Mark characteristic vector.
In a kind of implementation, above-mentioned sensitive content identification model, it can include:
The disaggregated model established based on algorithm of support vector machine, Bayesian Classification Arithmetic or neural network classification algorithm.
In a kind of implementation, as shown in figure 4, the device can also include information type determining module 250, it is used for:
Target feature vector is inputted to before the step in the sensitive content identification model pre-established, led to according to target Believe content used protocol type in transmission, determine the information type of destinations traffic content.
Wherein, information type includes:The initiative information sent to outside and the outside passive information sent.
Wherein, information type determining module 250, can include:First information type determination module and the second info class Type determination sub-module.
First information type determination module, in the case of being webpage in destinations traffic content:If transmit webpage Used protocol type is POST types, it is determined that destinations traffic content is initiative information;If transmit used during webpage Protocol type is GET types, it is determined that destinations traffic content is passive information.
Second information type determination sub-module, in the case of being mail in Content of Communication to be identified:According to mail When sending agreement transmission mail, it is initiative information to determine destinations traffic content;When transmitting mail according to mail reception agreement, really The Content of Communication that sets the goal is passive information.
Accordingly, sensitive content identification module 240, can include:First identification submodule 241 and second identifies submodule 242。
Wherein, the first identification submodule 241, for when destinations traffic content is initiative information, by target feature vector Input into the first sensitive content identification model pre-established, obtain destinations traffic content whether be sensitive content identification knot Fruit.
Wherein, the first sensitive content identification model is to carry tag along sort using algorithm of support vector machine, to default Characteristic vector corresponding to multiple Content of Communication samples is trained resulting disaggregated model.
Second identification submodule 242, for when destinations traffic content is initiative information, by target feature vector input to In the second sensitive content identification model pre-established, obtain destinations traffic content whether be sensitive content recognition result.
Wherein, the second sensitive content identification model is to carry tag along sort using Bayesian Classification Arithmetic, to default Characteristic vector corresponding to multiple Content of Communication samples is trained resulting disaggregated model.
In a kind of implementation, the first sensitive content identification model can be established according to following steps:
Obtain multiple first kind Content of Communication samples with tag along sort;Wherein, first kind Content of Communication sample is letter Cease the sample that type is initiative information;
Word segmentation processing is carried out to each first kind Content of Communication sample respectively, obtains each first kind Content of Communication sample pair At least one first kind participle answered;
Using attribute create-rule is segmented, the participle that the first kind corresponding to each first kind Content of Communication sample segments is generated Attribute;
The participle attribute segmented according to the first kind corresponding to each first kind Content of Communication sample, generates each first kind and leads to Believe the first kind characteristic vector corresponding to content sample;
It is initial based on first kind characteristic vector and tag along sort corresponding to each first kind Content of Communication sample, training first Identification model, the first sensitive content identification model is obtained, wherein, the first initial identification model is based on algorithm of support vector machine Model.
In a kind of implementation, the second sensitive content identification model can be established according to following steps:
Obtain multiple second class Content of Communication samples with tag along sort;Wherein, the second class Content of Communication sample is letter Cease the sample that type is passive information;
Word segmentation processing is carried out to each second class Content of Communication sample respectively, obtains each second class Content of Communication sample pair At least one second class participle answered;
Using attribute create-rule is segmented, the participle of the second class participle corresponding to each second class Content of Communication sample is generated Attribute;
The participle attribute segmented according to the second class corresponding to each second class Content of Communication sample, generate each second class and lead to Believe the second category feature vector corresponding to content sample;
It is initial based on the second category feature vector sum tag along sort corresponding to each second class Content of Communication sample, training second Identification model, the second sensitive content identification model is obtained, wherein, the second initial identification model is based on Bayesian Classification Arithmetic Model.
As seen from the above, when carrying out sensitive content identification using scheme provided in an embodiment of the present invention, except possessing Fig. 3 institutes Outside all advantages of showing device embodiment, the information type of destinations traffic content is also taken into full account, and be directed to different letters The destinations traffic content of type is ceased, the identification of sensitive content is carried out from different sensitive content identification models, is based on as utilized First sensitive content identification model of algorithm of support vector machine carries out quick to information type for the destinations traffic content of initiative information Feel the identification of content, be passive information to information type using the second sensitive content identification model based on Bayesian Classification Arithmetic Destinations traffic content carry out the identification of sensitive content, it is necessary to explanation, initiative information are usually the information that is sent to outside, Passive information is usually the outside information sent, comparatively, sensitivity of the those skilled in the art more concerned with initiative information The accuracy rate of content recognition, because the data volume of the outside passive information sent would generally be very big, therefore, technology in the art Personnel more concerned be passive information sensitive content identify recognition rate, and algorithm of support vector machine have it is more accurate Classifying quality, therefore the accuracy rate of the sensitive content identification of initiative information can be ensured, and Bayesian Classification Arithmetic is with higher Classification speed, therefore can ensure passive information sensitive content identification recognition rate, more reasonably meet sensitivity The needs of content recognition.
For device embodiment, because it is substantially similar to embodiment of the method, so describe fairly simple, it is related Part illustrates referring to the part of embodiment of the method.
It should be noted that herein, such as first and second or the like relational terms are used merely to a reality Body or operation make a distinction with another entity or operation, and not necessarily require or imply and deposited between these entities or operation In any this actual relation or order.Moreover, term " comprising ", "comprising" or its any other variant are intended to Nonexcludability includes, so that process, method, article or equipment including a series of elements not only will including those Element, but also the other element including being not expressly set out, or it is this process, method, article or equipment also to include Intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that Other identical element also be present in process, method, article or equipment including the key element.
Can one of ordinary skill in the art will appreciate that realizing that all or part of step in above method embodiment is To instruct the hardware of correlation to complete by program, described program can be stored in computer read/write memory medium, Storage medium designated herein, such as:ROM/RAM, magnetic disc, CD etc..
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all Any modification, equivalent substitution and improvements made within the spirit and principles in the present invention etc., are all contained in protection scope of the present invention It is interior.

Claims (22)

1. a kind of sensitive content recognition methods, it is characterised in that methods described includes:
Word segmentation processing is carried out to destinations traffic content to be identified, obtains at least one target corresponding to the destinations traffic content Participle;
Using default participle attribute create-rule, the participle attribute that each target segments is generated;
The participle attribute segmented according to each target, generates the target feature vector corresponding to the destinations traffic content;
The target feature vector is inputted into the sensitive content identification model pre-established, obtains the destinations traffic content Whether be sensitive content recognition result, wherein, the sensitive content identification model is to utilize default machine learning algorithm, right Characteristic vector corresponding to default multiple Content of Communication samples with tag along sort is trained resulting disaggregated model, The tag along sort includes:For identifying the first label of sensitive content or the second label for identifying non-sensitive content.
2. according to the method for claim 1, it is characterised in that the step of obtaining the destinations traffic content to be identified, Including:
The packet transmitted in collection network;
Based on default application layer protocol, reduction treatment is carried out to the data content in the packet;
It is destinations traffic content to be identified to determine the data content after reduction treatment.
3. according to the method for claim 1, it is characterised in that described that destinations traffic content to be identified is carried out at participle Reason, the step of at least one target segments corresponding to the destinations traffic content is obtained, including:
According to default first word division rule, by the destinations traffic division of teaching contents into several words;
The word obtained dividing segments as target corresponding to the destinations traffic content.
4. according to the method for claim 1, it is characterised in that described that destinations traffic content to be identified is carried out at participle Reason, the step of at least one target segments corresponding to the destinations traffic content is obtained, including:
According to default second word division rule, by the destinations traffic division of teaching contents into several words;Removal divides To several words in stop words, wherein, the stop words is the word that part of speech is adverbial word, preposition or pronoun;
Segmented remaining word as target corresponding to the destinations traffic content.
5. according to the method for claim 1, it is characterised in that described to utilize default participle attribute create-rule, generation The step of participle attribute of each target participle, including:
According to default weighting algorithm, the weights that each target segments are calculated;
The corresponding weight value calculated is defined as to the participle attribute of each target participle.
6. according to the method for claim 1, it is characterised in that the participle attribute segmented according to each target, generation The step of target feature vector corresponding to the destinations traffic content, including:
Judge whether each keyword in default lists of keywords is identical with either objective participle respectively;
The participle attribute that target corresponding to first class keywords segments is defined as characteristic attribute corresponding to the first class keywords, will Default default value is defined as characteristic attribute corresponding to the second class keywords;Wherein, first class keywords are and the mesh Mark participle identical keyword, second class keywords are the keyword differed with target participle;
Vector element using the characteristic attribute corresponding to each keyword as target feature vector, and exist according to each keyword Element position in the lists of keywords, generate the target feature vector corresponding to the destinations traffic content.
7. according to the method for claim 1, it is characterised in that described sensitive content identification model, including:
The disaggregated model established based on algorithm of support vector machine, Bayesian Classification Arithmetic or neural network classification algorithm.
8. according to the method for claim 1, it is characterised in that input the target feature vector to what is pre-established quick Before feeling the step in content recognition model, in addition to:
According to the destinations traffic content in transmission used protocol type, determine the info class of the destinations traffic content Type;Wherein, described information type includes:The initiative information sent to outside and the outside passive information sent;
It is described to input the target feature vector into the sensitive content identification model pre-established, obtain the destinations traffic Content whether be sensitive content recognition result, wherein, the sensitive content identification model be using default machine learning calculate Method, resulting classification is trained to the characteristic vector corresponding to default multiple Content of Communication samples with tag along sort Model, including:
When the destinations traffic content is initiative information, the target feature vector is inputted sensitive to first pre-established In content recognition model, obtain the destinations traffic content whether be sensitive content recognition result;Wherein, described first is sensitive Content recognition model is using algorithm of support vector machine, to corresponding to default multiple Content of Communication samples with tag along sort Characteristic vector be trained resulting disaggregated model;
When the destinations traffic content is initiative information, the target feature vector is inputted sensitive to second pre-established In content recognition model, obtain the destinations traffic content whether be sensitive content recognition result;Wherein, described second is sensitive Content recognition model is using Bayesian Classification Arithmetic, to corresponding to default multiple Content of Communication samples with tag along sort Characteristic vector be trained resulting disaggregated model.
9. according to the method for claim 8, it is characterised in that the step of establishing the first sensitive content identification model, Including:
Obtain multiple first kind Content of Communication samples with tag along sort;Wherein, the first kind Content of Communication sample is letter Cease the sample that type is initiative information;
Word segmentation processing is carried out to each first kind Content of Communication sample respectively, obtained corresponding to each first kind Content of Communication sample At least one first kind participle;
Using the participle attribute create-rule, the participle that the first kind corresponding to each first kind Content of Communication sample segments is generated Attribute;
The participle attribute segmented according to the first kind corresponding to each first kind Content of Communication sample, generate in each first kind communication Hold the first kind characteristic vector corresponding to sample;
Based on first kind characteristic vector and tag along sort corresponding to each first kind Content of Communication sample, the first initial identification is trained Model, the first sensitive content identification model is obtained, wherein, the first initial identification model is the mould based on algorithm of support vector machine Type.
10. according to the method for claim 8, it is characterised in that the step of establishing the second sensitive content identification model, Including:
Obtain multiple second class Content of Communication samples with tag along sort;Wherein, the second class Content of Communication sample is letter Cease the sample that type is passive information;
Word segmentation processing is carried out to each second class Content of Communication sample respectively, obtained corresponding to each second class Content of Communication sample At least one second class participle;
Using the participle attribute create-rule, the participle that the second class corresponding to each second class Content of Communication sample segments is generated Attribute;
The participle attribute segmented according to the second class corresponding to each second class Content of Communication sample, generate in each second class communication Hold the second category feature vector corresponding to sample;
Based on the second category feature vector sum tag along sort corresponding to each second class Content of Communication sample, the second initial identification is trained Model, the second sensitive content identification model is obtained, wherein, the second initial identification model is the mould based on Bayesian Classification Arithmetic Type.
11. according to the method for claim 8, it is characterised in that described that when institute is being transmitted according to the destinations traffic content The protocol type of use, the step of determining the information type of the destinations traffic content, including:
In the case where the destinations traffic content is webpage:If used protocol type is POST types when transmitting webpage, It is initiative information then to determine the destinations traffic content;If used protocol type is GET types when transmitting webpage, it is determined that The destinations traffic content is passive information;
In the case where the Content of Communication to be identified is mail:When sending agreement transmission mail according to mail, it is determined that described Destinations traffic content is initiative information;When transmitting mail according to mail reception agreement, it is quilt to determine the destinations traffic content Dynamic information.
12. a kind of sensitive content identification device, it is characterised in that described device includes:
Word segmentation processing module, for carrying out word segmentation processing to destinations traffic content to be identified, obtain the destinations traffic content Corresponding at least one target participle;
Attribute generation module is segmented, for using default participle attribute create-rule, generating the participle category that each target segments Property;
Feature vector generation module, for the participle attribute segmented according to each target, it is right to generate the destinations traffic content institute The target feature vector answered;
Sensitive content identification module, for the target feature vector to be inputted to the sensitive content identification model pre-established In, obtain the destinations traffic content whether be sensitive content recognition result, wherein, the sensitive content identification model for profit Enter with default machine learning algorithm, to the characteristic vector corresponding to default multiple Content of Communication samples with tag along sort Disaggregated model obtained by row training, the tag along sort include:For identifying the first label of sensitive content or for identifying Second label of non-sensitive content.
13. device according to claim 12, it is characterised in that also include:
Content of Communication obtains module, is specifically used for:
The packet transmitted in collection network;
Based on default application layer protocol, reduction treatment is carried out to the data content in the packet;
It is destinations traffic content to be identified to determine the data content after reduction treatment.
14. device according to claim 12, it is characterised in that the word segmentation processing module, including:First division submodule Block and first participle determination sub-module;Wherein,
The first division submodule, for according to default first word division rule, by the destinations traffic division of teaching contents Into several words;
The first participle determination sub-module, for obtained word will to be divided as target corresponding to the destinations traffic content Participle.
15. device according to claim 12, it is characterised in that the word segmentation processing module, including:Second division submodule Block, participle remove submodule and the second participle determination sub-module;Wherein,
The second division submodule, for according to default second word division rule, by the destinations traffic division of teaching contents Into several words;
The participle removes submodule, the stop words in several words obtained for removing division, wherein, the stop words For the word that part of speech is adverbial word, preposition or pronoun;
The second participle determination sub-module, for using remaining word as target corresponding to the destinations traffic content point Word.
16. device according to claim 12, it is characterised in that the participle attribute generation module, including:Weight computing Submodule and participle attribute determination sub-module;Wherein,
The weight computing submodule, for according to default weighting algorithm, calculating the weights that each target segments;
The participle attribute determination sub-module, for the corresponding weight value calculated to be defined as to the participle category of each target participle Property.
17. device according to claim 12, it is characterised in that the feature vector generation module, including:Participle judges Submodule, characteristic attribute determination sub-module and characteristic vector generation submodule;Wherein,
The participle judging submodule, for judge respectively each keyword in default lists of keywords whether with any mesh Mark participle is identical;
The characteristic attribute determination sub-module, the participle attribute for target corresponding to the first class keywords to be segmented are defined as the Characteristic attribute corresponding to one class keywords, default default value is defined as characteristic attribute corresponding to the second class keywords;Wherein, First class keywords are to segment identical keyword with the target, and second class keywords are to be segmented with the target The keyword differed;
The characteristic vector generates submodule, for using the characteristic attribute corresponding to each keyword as target feature vector Vector element, and the element position according to each keyword in the lists of keywords, generate the destinations traffic content institute Corresponding target feature vector.
18. device according to claim 12, it is characterised in that described sensitive content identification model, including:
The disaggregated model established based on algorithm of support vector machine, Bayesian Classification Arithmetic or neural network classification algorithm.
19. device according to claim 12, it is characterised in that also include:
Information type determining module, is used for:
The target feature vector is inputted to before the step in the sensitive content identification model pre-established, according to the mesh Content of Communication used protocol type in transmission is marked, determines the information type of the destinations traffic content;Wherein, the letter Breath type includes:The initiative information sent to outside and the outside passive information sent;
The sensitive content identification module, including:First identification submodule and the second identification submodule;Wherein,
Described first identification submodule, for when the destinations traffic content is initiative information, by the target signature to Amount input obtains whether the destinations traffic content is sensitive content into the first sensitive content identification model pre-established Recognition result;Wherein, the first sensitive content identification model is to carry contingency table using algorithm of support vector machine, to default Characteristic vector corresponding to multiple Content of Communication samples of label is trained resulting disaggregated model;
Described second identification submodule, for when the destinations traffic content is initiative information, by the target signature to Amount input obtains whether the destinations traffic content is sensitive content into the second sensitive content identification model pre-established Recognition result;Wherein, the second sensitive content identification model is to carry contingency table using Bayesian Classification Arithmetic, to default Characteristic vector corresponding to multiple Content of Communication samples of label is trained resulting disaggregated model.
20. device according to claim 19, it is characterised in that establish the step of the first sensitive content identification model Suddenly, including:
Obtain multiple first kind Content of Communication samples with tag along sort;Wherein, the first kind Content of Communication sample is letter Cease the sample that type is initiative information;
Word segmentation processing is carried out to each first kind Content of Communication sample respectively, obtained corresponding to each first kind Content of Communication sample At least one first kind participle;
Using the participle attribute create-rule, the participle that the first kind corresponding to each first kind Content of Communication sample segments is generated Attribute;
The participle attribute segmented according to the first kind corresponding to each first kind Content of Communication sample, generate in each first kind communication Hold the first kind characteristic vector corresponding to sample;
Based on first kind characteristic vector and tag along sort corresponding to each first kind Content of Communication sample, the first initial identification is trained Model, the first sensitive content identification model is obtained, wherein, the first initial identification model is the mould based on algorithm of support vector machine Type.
21. device according to claim 19, it is characterised in that establish the step of the second sensitive content identification model Suddenly, including:
Obtain multiple second class Content of Communication samples with tag along sort;Wherein, the second class Content of Communication sample is letter Cease the sample that type is passive information;
Word segmentation processing is carried out to each second class Content of Communication sample respectively, obtained corresponding to each second class Content of Communication sample At least one second class participle;
Using the participle attribute create-rule, the participle that the second class corresponding to each second class Content of Communication sample segments is generated Attribute;
The participle attribute segmented according to the second class corresponding to each second class Content of Communication sample, generate in each second class communication Hold the second category feature vector corresponding to sample;
Based on the second category feature vector sum tag along sort corresponding to each second class Content of Communication sample, the second initial identification is trained Model, the second sensitive content identification model is obtained, wherein, the second initial identification model is the mould based on Bayesian Classification Arithmetic Type.
22. device according to claim 19, it is characterised in that described information determination type module, including:The first information Type determination module and the second information type determination sub-module;Wherein,
The first information type determination module, in the case of being webpage in the destinations traffic content:If transmission network Used protocol type is POST types during page, it is determined that the destinations traffic content is initiative information;If transmit webpage Used protocol type is GET types, it is determined that the destinations traffic content is passive information;
The second information type determination sub-module, in the case of being mail in the Content of Communication to be identified:According to When mail sends agreement transmission mail, it is initiative information to determine the destinations traffic content;Transmitted according to mail reception agreement During mail, it is passive information to determine the destinations traffic content.
CN201610822280.3A 2016-09-13 2016-09-13 A kind of sensitive content recognition methods and device Pending CN107818077A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610822280.3A CN107818077A (en) 2016-09-13 2016-09-13 A kind of sensitive content recognition methods and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610822280.3A CN107818077A (en) 2016-09-13 2016-09-13 A kind of sensitive content recognition methods and device

Publications (1)

Publication Number Publication Date
CN107818077A true CN107818077A (en) 2018-03-20

Family

ID=61600614

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610822280.3A Pending CN107818077A (en) 2016-09-13 2016-09-13 A kind of sensitive content recognition methods and device

Country Status (1)

Country Link
CN (1) CN107818077A (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108829680A (en) * 2018-06-22 2018-11-16 北京百悟科技有限公司 A kind of violation publicity detection method and device, computer readable storage medium
CN109063155A (en) * 2018-08-10 2018-12-21 广州锋网信息科技有限公司 Language model parameter determination method, device and computer equipment
CN109241330A (en) * 2018-08-20 2019-01-18 北京百度网讯科技有限公司 The method, apparatus, equipment and medium of key phrase in audio for identification
CN109407504A (en) * 2018-11-30 2019-03-01 华南理工大学 A kind of personal safety detection system and method based on smartwatch
CN109858280A (en) * 2019-01-21 2019-06-07 深圳昂楷科技有限公司 A kind of desensitization method based on machine learning, device and desensitization equipment
CN110008470A (en) * 2019-03-19 2019-07-12 阿里巴巴集团控股有限公司 The sensibility stage division and device of report
CN110134966A (en) * 2019-05-21 2019-08-16 中电健康云科技有限公司 A kind of sensitive information determines method and device
CN110275958A (en) * 2019-06-26 2019-09-24 北京市博汇科技股份有限公司 Site information recognition methods, device and electronic equipment
CN110472234A (en) * 2019-07-19 2019-11-19 平安科技(深圳)有限公司 Sensitive text recognition method, device, medium and computer equipment
CN110737770A (en) * 2018-07-03 2020-01-31 百度在线网络技术(北京)有限公司 Text data sensitivity identification method and device, electronic equipment and storage medium
CN110750981A (en) * 2019-10-16 2020-02-04 杭州安恒信息技术股份有限公司 High-accuracy website sensitive word detection method based on machine learning
CN110784330A (en) * 2018-07-30 2020-02-11 华为技术有限公司 Method and device for generating application recognition model
CN110826320A (en) * 2019-11-28 2020-02-21 上海观安信息技术股份有限公司 Sensitive data discovery method and system based on text recognition
WO2020063071A1 (en) * 2018-09-27 2020-04-02 厦门快商通信息技术有限公司 Sentence vector calculation method based on chi-square test, and text classification method and system
CN111291570A (en) * 2018-12-07 2020-06-16 北京国双科技有限公司 Method and device for realizing element identification in judicial documents
CN111898060A (en) * 2020-07-14 2020-11-06 大汉软件股份有限公司 Content automatic monitoring method based on deep learning
CN112069500A (en) * 2020-09-17 2020-12-11 杭州安恒信息技术股份有限公司 Application software detection method, device and medium
CN112434331A (en) * 2020-11-20 2021-03-02 百度在线网络技术(北京)有限公司 Data desensitization method, device, equipment and storage medium
CN113220980A (en) * 2020-02-06 2021-08-06 北京沃东天骏信息技术有限公司 Article attribute word recognition method, device, equipment and storage medium
CN113434775A (en) * 2021-07-15 2021-09-24 北京达佳互联信息技术有限公司 Method and device for determining search content
US20230012656A1 (en) * 2021-04-02 2023-01-19 Sift Science, Inc. Systems and methods for automated labeling of subscriber digital event data in a machine learning-based digital threat mitigation platform
CN117197538A (en) * 2023-08-16 2023-12-08 哈尔滨工业大学 Bayesian convolution neural network structure apparent damage identification method based on Gaussian distribution weight sampling

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070276773A1 (en) * 2006-03-06 2007-11-29 Murali Aravamudan Methods and systems for selecting and presenting content on a first system based on user preferences learned on a second system
CN102012985A (en) * 2010-11-19 2011-04-13 国网电力科学研究院 Sensitive data dynamic identification method based on data mining
CN102289522A (en) * 2011-09-19 2011-12-21 北京金和软件股份有限公司 Method of intelligently classifying texts
CN103744905A (en) * 2013-12-25 2014-04-23 新浪网技术(中国)有限公司 Junk mail judgment method and device
CN103761221A (en) * 2013-12-31 2014-04-30 北京京东尚科信息技术有限公司 System and method for identifying sensitive text messages
CN105631049A (en) * 2016-02-17 2016-06-01 北京奇虎科技有限公司 Method and system for recognizing defrauding short messages

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070276773A1 (en) * 2006-03-06 2007-11-29 Murali Aravamudan Methods and systems for selecting and presenting content on a first system based on user preferences learned on a second system
CN102012985A (en) * 2010-11-19 2011-04-13 国网电力科学研究院 Sensitive data dynamic identification method based on data mining
CN102289522A (en) * 2011-09-19 2011-12-21 北京金和软件股份有限公司 Method of intelligently classifying texts
CN103744905A (en) * 2013-12-25 2014-04-23 新浪网技术(中国)有限公司 Junk mail judgment method and device
CN103761221A (en) * 2013-12-31 2014-04-30 北京京东尚科信息技术有限公司 System and method for identifying sensitive text messages
CN105631049A (en) * 2016-02-17 2016-06-01 北京奇虎科技有限公司 Method and system for recognizing defrauding short messages

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108829680A (en) * 2018-06-22 2018-11-16 北京百悟科技有限公司 A kind of violation publicity detection method and device, computer readable storage medium
CN110737770A (en) * 2018-07-03 2020-01-31 百度在线网络技术(北京)有限公司 Text data sensitivity identification method and device, electronic equipment and storage medium
CN110737770B (en) * 2018-07-03 2023-01-20 百度在线网络技术(北京)有限公司 Text data sensitivity identification method and device, electronic equipment and storage medium
CN110784330B (en) * 2018-07-30 2022-04-05 华为技术有限公司 Method and device for generating application recognition model
CN110784330A (en) * 2018-07-30 2020-02-11 华为技术有限公司 Method and device for generating application recognition model
CN109063155A (en) * 2018-08-10 2018-12-21 广州锋网信息科技有限公司 Language model parameter determination method, device and computer equipment
CN109241330A (en) * 2018-08-20 2019-01-18 北京百度网讯科技有限公司 The method, apparatus, equipment and medium of key phrase in audio for identification
US11308937B2 (en) 2018-08-20 2022-04-19 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for identifying key phrase in audio, device and medium
WO2020063071A1 (en) * 2018-09-27 2020-04-02 厦门快商通信息技术有限公司 Sentence vector calculation method based on chi-square test, and text classification method and system
CN109407504A (en) * 2018-11-30 2019-03-01 华南理工大学 A kind of personal safety detection system and method based on smartwatch
CN109407504B (en) * 2018-11-30 2021-05-14 华南理工大学 Personal safety detection system and method based on smart watch
CN111291570A (en) * 2018-12-07 2020-06-16 北京国双科技有限公司 Method and device for realizing element identification in judicial documents
CN109858280A (en) * 2019-01-21 2019-06-07 深圳昂楷科技有限公司 A kind of desensitization method based on machine learning, device and desensitization equipment
CN110008470A (en) * 2019-03-19 2019-07-12 阿里巴巴集团控股有限公司 The sensibility stage division and device of report
CN110134966A (en) * 2019-05-21 2019-08-16 中电健康云科技有限公司 A kind of sensitive information determines method and device
CN110275958A (en) * 2019-06-26 2019-09-24 北京市博汇科技股份有限公司 Site information recognition methods, device and electronic equipment
CN110275958B (en) * 2019-06-26 2021-07-27 北京市博汇科技股份有限公司 Website information identification method and device and electronic equipment
CN110472234A (en) * 2019-07-19 2019-11-19 平安科技(深圳)有限公司 Sensitive text recognition method, device, medium and computer equipment
CN110750981A (en) * 2019-10-16 2020-02-04 杭州安恒信息技术股份有限公司 High-accuracy website sensitive word detection method based on machine learning
CN110826320B (en) * 2019-11-28 2023-10-13 上海观安信息技术股份有限公司 Sensitive data discovery method and system based on text recognition
CN110826320A (en) * 2019-11-28 2020-02-21 上海观安信息技术股份有限公司 Sensitive data discovery method and system based on text recognition
CN113220980A (en) * 2020-02-06 2021-08-06 北京沃东天骏信息技术有限公司 Article attribute word recognition method, device, equipment and storage medium
CN111898060A (en) * 2020-07-14 2020-11-06 大汉软件股份有限公司 Content automatic monitoring method based on deep learning
CN112069500A (en) * 2020-09-17 2020-12-11 杭州安恒信息技术股份有限公司 Application software detection method, device and medium
CN112434331B (en) * 2020-11-20 2023-08-18 百度在线网络技术(北京)有限公司 Data desensitization method, device, equipment and storage medium
CN112434331A (en) * 2020-11-20 2021-03-02 百度在线网络技术(北京)有限公司 Data desensitization method, device, equipment and storage medium
US20230012656A1 (en) * 2021-04-02 2023-01-19 Sift Science, Inc. Systems and methods for automated labeling of subscriber digital event data in a machine learning-based digital threat mitigation platform
US11645386B2 (en) * 2021-04-02 2023-05-09 Sift Science, Inc. Systems and methods for automated labeling of subscriber digital event data in a machine learning-based digital threat mitigation platform
CN113434775A (en) * 2021-07-15 2021-09-24 北京达佳互联信息技术有限公司 Method and device for determining search content
CN113434775B (en) * 2021-07-15 2024-03-26 北京达佳互联信息技术有限公司 Method and device for determining search content
CN117197538A (en) * 2023-08-16 2023-12-08 哈尔滨工业大学 Bayesian convolution neural network structure apparent damage identification method based on Gaussian distribution weight sampling

Similar Documents

Publication Publication Date Title
CN107818077A (en) A kind of sensitive content recognition methods and device
CN107391760B (en) User interest recognition methods, device and computer readable storage medium
US20170063893A1 (en) Learning detector of malicious network traffic from weak labels
CN105630767B (en) The comparative approach and device of a kind of text similarity
CN107992596A (en) A kind of Text Clustering Method, device, server and storage medium
CN106489149A (en) A kind of data mask method based on data mining and mass-rent and system
CN107491435A (en) Method and device based on Computer Automatic Recognition user feeling
WO2019179010A1 (en) Data set acquisition method, classification method and device, apparatus, and storage medium
CN112686022A (en) Method and device for detecting illegal corpus, computer equipment and storage medium
CN107391675A (en) Method and apparatus for generating structure information
CN113762377B (en) Network traffic identification method, device, equipment and storage medium
CN108491866A (en) Porny identification method, electronic device and readable storage medium storing program for executing
CN110347840A (en) Complain prediction technique, system, equipment and the storage medium of text categories
CN110019790A (en) Text identification, text monitoring, data object identification, data processing method
CN116489152B (en) Linkage control method and device for Internet of things equipment, electronic equipment and medium
CN109766441A (en) File classification method, apparatus and system
CN110458296A (en) The labeling method and device of object event, storage medium and electronic device
CN112580108A (en) Signature and seal integrity verification method and computer equipment
CN112884121A (en) Traffic identification method based on generation of confrontation deep convolutional network
CN108536673A (en) Media event abstracting method and device
Shrivastav et al. Network traffic classification using semi-supervised approach
CN108268461A (en) A kind of document sorting apparatus based on hybrid classifer
WO2024055603A1 (en) Method and apparatus for identifying text from minor
CN117371049A (en) Machine-generated text detection method and system based on blockchain and generated countermeasure network
CN107506407A (en) A kind of document classification, the method and device called

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180320