CN106874253A - Recognize the method and device of sensitive information - Google Patents

Recognize the method and device of sensitive information Download PDF

Info

Publication number
CN106874253A
CN106874253A CN201510919548.0A CN201510919548A CN106874253A CN 106874253 A CN106874253 A CN 106874253A CN 201510919548 A CN201510919548 A CN 201510919548A CN 106874253 A CN106874253 A CN 106874253A
Authority
CN
China
Prior art keywords
information
participle
sensitive information
sensitive
text message
Prior art date
Application number
CN201510919548.0A
Other languages
Chinese (zh)
Inventor
付星辉
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to CN201510919548.0A priority Critical patent/CN106874253A/en
Publication of CN106874253A publication Critical patent/CN106874253A/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering

Abstract

The embodiment of the invention discloses a kind of method for recognizing sensitive information, including:Target information is received, the text message included in the target information is extracted;The cryptographic Hash of the text message is calculated, when the cryptographic Hash of the text message is different from the cryptographic Hash of default feature-sensitive information, participle is carried out to the text message and is obtained participle set;The cryptographic Hash of the participle in the participle set is calculated, the cryptographic Hash of the participle in the participle set generates the similarity of the target information and default feature-sensitive information;Semantic analysis is carried out according to the similarity and/or to the text message judge that the target information is sensitive information.The present invention also accordingly discloses a kind of device for recognizing sensitive information.Whether the method and apparatus of above-mentioned identification sensitive information are have recognition accuracy higher in the judgement of sensitive information in the content issued to user.

Description

Recognize the method and device of sensitive information

Technical field

The present invention relates to field of computer technology, more particularly to a kind of method and device for recognizing sensitive information.

Background technology

In the internet social networking application of existing web2.0, the content of application is no longer issued and pushed away by server Send, but more voluntarily issued and interacted by user.For example, user can share net by mobile phone photograph Other users are sent on network, can editor forum theme, blog, forum post, the content of text such as microblogging It is shared with other users.However, the content that user shares there may be illegal or not meet the code of ethic Risk, for example, the content such as thick mouth, violence, salaciousness, swindle, accordingly, it would be desirable to the content of user's issue Carry out the identification and interception of sensitive information.

In the method for existing online interception of sensitive information, more single text Similarity algorithm strategy is generally used Interception of sensitive information is found as full text md5 is similar, although this method accuracy rate is very high, but sensitive The recall rate of information depends critically upon the scale of existing sensitive information feature database, and sensitive information is easy to There is mutation, this Similarity algorithm is difficult effectively to find similar text message, the discovery to sensitive information Recall rate is low, and only has regular hour hysteresis quality by the method for artificial addition sensitive information feature, very It is difficult to resolve certainly message mutation problem.

Therefore, the method for the online interception of sensitive information in conventional art is due to artificial addition sensitive information feature The reason for regular hour hysteresis quality so that the degree of accuracy of identification sensitive information is not high, for mutation and Approximate sensitive information cannot be recognized exactly.

The content of the invention

It is the method for online interception of sensitive information in conventional art due to artificial addition sensitive information based on this The reason for feature has regular hour hysteresis quality so that the technology that the degree of accuracy of identification sensitive information is not high is asked Topic, spy is there is provided a kind of method for recognizing sensitive information.

A kind of method for recognizing sensitive information, including:

Target information is received, the text message included in the target information is extracted;

The cryptographic Hash of the text message is calculated, in cryptographic Hash and the default feature-sensitive of the text message When the cryptographic Hash of information is different, participle is carried out to the text message and obtains participle set;

The cryptographic Hash of the participle in the participle set is calculated, the Hash of the participle in the participle set Value generates the similarity of the target information and default feature-sensitive information;

Carrying out semantic analysis according to the similarity and/or to the text message judges the target information as quick Sense information.

Wherein in one embodiment, described in the cryptographic Hash generation of the participle in the participle set The step of similarity of target information and default feature-sensitive information, includes:

In calculating the participle set, the participle with the Hash values match of the participle of default feature-sensitive information The shared ratio in the participle set;

The similarity of the target information and default feature-sensitive information is generated according to the ratio.

Wherein in one embodiment, described in the cryptographic Hash generation of the participle in the participle set The step of similarity of target information and default feature-sensitive information, includes:

With reference to simhash algorithms, the cryptographic Hash of the participle in the participle set generates the target information A simhash values;

Calculate the difference of a simhash values and the 2nd simhash values of the default feature-sensitive information Value;

The similarity of the target information and default feature-sensitive information is generated according to the difference.

Wherein in one embodiment, it is described the step of extract the text message included in the target information it Also include afterwards:

When not including text message in the target information, the ID of the issue target information is obtained;

The behavioural characteristic data of the ID are obtained, the target according to the behavioural characteristic data judging Whether information is sensitive information.

Wherein in one embodiment, it is described calculate the text message cryptographic Hash the step of after also include:

When the cryptographic Hash of the text message is identical with the cryptographic Hash of default feature-sensitive information, institute is judged Target information is stated for sensitive information.

It is described to carry out semanteme according to the similarity and/or to the text message wherein in one embodiment The step of analysis judges the target information as sensitive information also includes:

The text feature of the text message is extracted according to default machine learning probabilistic model;

Using the text feature as input, according to the default machine learning probabilistic model by calculating The sensitive confidence level for stating target information carries out semantic analysis to the text message;

Whether the target information according to the similarity and/or sensitive confidence declaration is sensitive information.

Wherein in one embodiment, the target according to the similarity and/or sensitive confidence declaration Also include after the step of whether information is sensitive information:

If the target information is judged as sensitive information, using the target information as feature-sensitive information Storage.

Wherein in one embodiment, it is described the step of extract the text message included in the target information it Also include afterwards:

Filter out the symbolic information and redundant semantic information in the text message.

Additionally, the method for the online interception of sensitive information in for conventional art is due to artificial addition sensitive information spy The reason for levying with regular hour hysteresis quality so that recognize the degree of accuracy of sensitive information technical problem not high, Spy is there is provided a kind of device for recognizing sensitive information.

A kind of device for recognizing sensitive information, including:

Text message extraction module, for receiving target information, extracts the text included in the target information Information;

Full text Hash identification module, the cryptographic Hash for calculating the text message;

Word-dividing mode, for the cryptographic Hash of the cryptographic Hash in the text message and default feature-sensitive information When different, participle carried out to the text message and obtains participle set;

Similarity calculation module, the cryptographic Hash for calculating the participle in the participle set, according to described point The cryptographic Hash of the participle in set of words generates the similarity of the target information and default feature-sensitive information;

Sensitive information determination module, for carrying out semantic point according to the similarity and/or to the text message Analysis judges that the target information is sensitive information.

Wherein in one embodiment, the similarity calculation module is additionally operable to calculate in the participle set, The ratio shared in the participle set with the participle of the Hash values match of the participle of default feature-sensitive information Example;

The similarity of the target information and default feature-sensitive information is generated according to the ratio.

Wherein in one embodiment, the similarity calculation module is additionally operable to combine simhash algorithms, according to The cryptographic Hash of the participle in the participle set generates a simhash values of the target information;Calculate described The difference of the first simhash values and the 2nd simhash values of the default feature-sensitive information;According to described Difference generates the similarity of the target information and default feature-sensitive information.

Wherein in one embodiment, described device also includes Activity recognition module, in target letter When not including text message in breath, the ID of the issue target information is obtained;Obtain user's mark The behavioural characteristic data of knowledge, whether target information is sensitive information according to the behavioural characteristic data judging.

Wherein in one embodiment, the full text Hash identification module is additionally operable in the Kazakhstan of the text message When uncommon value is identical with the cryptographic Hash of default feature-sensitive information, judge that the target information is sensitive information.

Wherein in one embodiment, described device also includes semantics recognition module, for according to default machine The text feature of text message described in device learning probability model extraction;Using the text feature as input, root According to the default machine learning probabilistic model by calculating the sensitive confidence level of the target information to the text This information carries out semantic analysis;

The sensitive information determination module is additionally operable to the mesh according to the similarity and/or sensitive confidence declaration Whether mark information is sensitive information.

Wherein in one embodiment, the semantics recognition module is additionally operable to be judged as in the target information During sensitive information, then using the target information as feature-sensitive information Store.

Wherein in one embodiment, the text message extraction module is additionally operable to filter out the text message In symbolic information and redundant semantic information.

Implement the embodiment of the present invention, will have the advantages that:

After employing the method and apparatus of above-mentioned identification sensitive information, in first calculating the target information of input The cryptographic Hash of text message, carries out full text Hash comparison so that the feature in target information with feature database is quick Sense INFORMATION OF INCOMPLETE it is consistent when, can by target information participle and calculate participle Hash be worth to target believe The similarity of the feature-sensitive information in breath and feature database, semantic analysis is carried out then in conjunction with to target information Analysis result judges whether target information is sensitive information, so as to when the judgement of sensitive information is carried out, adopt With multiple means, compared in combination with full text Hash, similarity is compared and the semantic mode for comparing, and Conventional art is compared, even if in the case of target information and feature-sensitive INFORMATION OF INCOMPLETE identical, it is also possible to Approximate or mutation sensitive information is identified without failing to judge, so as to improve the degree of accuracy of identification.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to implementing Example or the accompanying drawing to be used needed for description of the prior art are briefly described, it should be apparent that, describe below In accompanying drawing be only some embodiments of the present invention, for those of ordinary skill in the art, do not paying On the premise of going out creative work, other accompanying drawings can also be obtained according to these accompanying drawings.

Wherein:

Fig. 1 is a kind of flow chart of the method for recognizing sensitive information in one embodiment;

Fig. 2 is the process schematic of the simhash values of calculating target information in one embodiment;

Fig. 3 is to incorporate the systems function diagram of various RMs in one embodiment;

Fig. 4 is a kind of schematic diagram of the device for recognizing sensitive information in one embodiment;

Fig. 5 is that the structure of the computer equipment of the method for operation aforementioned identification sensitive information in one embodiment is shown It is intended to.

Specific embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clearly Chu, it is fully described by, it is clear that described embodiment is only a part of embodiment of the invention, rather than Whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not making creation Property work under the premise of the every other embodiment that is obtained, belong to the scope of protection of the invention.

The method of the online interception of sensitive information in for conventional art has due to artificial addition sensitive information feature The reason for regular hour hysteresis quality so that recognize the degree of accuracy of sensitive information technical problem not high, spy carries A kind of method for recognizing sensitive information is supplied, the realization of the method can be dependent on computer program, the computer Program can run on the computer system based on von Neumann system, and the computer system can be social Network web sites or application, game on line website or application, the application of online forum etc. provide the user content issue The website of platform or the server of mobile phone app.

Specifically, as shown in figure 1, it is a kind of recognize sensitive information method, including:

Step S102:Target information is received, the text message included in the target information is extracted.

As it was previously stated, during the application scenarios of the application are for web2.0 applications, providing the user content distribution platform Website or mobile phone app, perform the website that provides the user content distribution platform or mobile phone app of this method Server.User is sent to server by the content that webpage or mobile phone app clients are input into by terminal, Other users, the content discipline inspection commission target information of user input are transmitted to by server again.

For example, in a microblogging application, user have taken a photo by mobile phone microblogging client, and Explanatory note with the addition of to the photo, then issued on microblogging, then the photo and explanatory note are user The content of issue, the photo and explanatory note of the user issue that server is received are and need to judge that it is The no mesh to contain the sensitive information of the flames such as violence, pornographic, anti-party anti-the-people, swindle, multiple level marketing Mark information.

As it was previously stated, can not only include text message in the content of user's issue, it is also possible to comprising picture, sound Frequency and video information.In the present embodiment, can be selected according to the multiple media types of the content included in target information Corresponding sensitive information means of identification is selected, even text message is contained in target information, then to text message It is identified, if not including text message in target information, for example, only uploads a pictures or a video, Then the behavioural characteristic according to the user for issuing the target information is judged.

That is, after the step of the text message included in extracting the target information, server can When not including text message in the target information, the ID of the issue target information is obtained;Obtain The behavioural characteristic data of the ID are taken, the target information according to the behavioural characteristic data judging is No is sensitive information.

ID corresponding content issue number of times, content issue frequency can be obtained, gone by report number of times etc. It is the possibility of sensitive information to be characterized data and calculate target information, when the possibility is more than threshold value, is then sentenced The fixed target information is sensitive information.

If for example, certain user has issued a large amount of pictures in a short time, and the number of times that picture is reported is more, And number of times that the user's history is reported is also more, then server can determine that the content of user issue is sensitivity Information, so as to be shielded to it.

And in the present embodiment, if in the target information that i.e. server is received in the content that can be issued with user Comprising content of text, then follow-up step S104 can be continued executing with.

Step S104:Calculate the cryptographic Hash of the text message, the text message cryptographic Hash with it is default Feature-sensitive information cryptographic Hash it is different when, participle is carried out to the text message and obtains participle set.

In the present embodiment, if the cryptographic Hash of text message is complete with the cryptographic Hash of default feature-sensitive information It is identical, then can directly judge that target information is sensitive information, because cryptographic Hash is identical, means target information It is identical with the feature-sensitive information in feature database, identification hit.For example, text message can be calculated directly MD5 check codes or SHA1 check codes, then judge the MD5 yards or SHA1 whether with the spy for prestoring Levy sensitive information MD5 yards or SHA1 is identical, if identical, then it represents that text message and feature-sensitive information It is identical, so as to the identical target of feature-sensitive information in the feature database that quickly recognizes and prestore Information.And the target letter different from the cryptographic Hash of default feature-sensitive information of the cryptographic Hash for text message Breath, just proceeds follow-up identification process (similarity identification and/or semantics recognition), hence for complete Target information with feature-sensitive information is quickly recognized that the meaningless calculating of reduction improves execution efficiency.

In the present embodiment, in the cryptographic Hash and default feature-sensitive information of the text message being calculated When cryptographic Hash is different, then participle is carried out to text message.For example, in an application scenarios for forum, using The text message of family issue may be the more multiple level marketing advertisement of content of text, then can be by the text of the multiple level marketing advertisement Content carries out participle, obtains word one by one, and the plurality of word is then that the text message for extracting is carried out The participle set that participle is obtained.In the present embodiment, various increase income or the non-participle instrument increased income enters can be used Row participle, such as StandardAnalyzer, ChineseAnalyzer, CJKAnalyzer etc. are increased income participle work Tool.

Preferably, after the text message included in extracting the target information, during participle, Can also be pre-processed, be filtered out the symbolic information and redundant semantic information in the text message.For example, It is to be added with more emoticon or punctuation mark that user issues content by microblogging, due to emoticon It is not related to sensitive information substantially with punctuation mark, can be filtered in advance, so as to reduce amount of calculation.In addition, For the participle for repeating of obvious clerical mistake, can be filtered, so as to amount of calculation can be reduced.

Step S106:The cryptographic Hash of the participle in the participle set is calculated, according in the participle set The cryptographic Hash of participle generates the similarity of the target information and default feature-sensitive information.

Target information is target information and default feature-sensitive with the similarity of default feature-sensitive information Quantization means of the information on similarity degree.In the present embodiment, the Hash of the participle in participle set Value has two ways to calculate the similarity of target information and default feature-sensitive information.

The first, the ratio of the participle according to matching generates similarity, specially:

In calculating the participle set, the participle with the Hash values match of the participle of default feature-sensitive information The shared ratio in the participle set;The target information and default feature are generated according to the ratio The similarity of sensitive information.

If for example, in target information include 10 participles, wherein there is 8 participles to appear in feature-sensitive letter In breath, that is to say, that this 8 cryptographic Hash of participle (such as MD5 yards), the participle with feature-sensitive information In 8 participles cryptographic Hash it is identical, then the participle for having 80% in participle set is and feature-sensitive information Participle identical participle, then can be according to the 80% generation similarity.

Second, the simhash values of the text message of target information are calculated according to simhash algorithms, according to this Simhash values generate similarity, specially:

With reference to simhash algorithms, the cryptographic Hash of the participle in the participle set generates the target information A simhash values;Calculate the second of a simhash values and the default feature-sensitive information The difference of simhash values;The phase of the target information and default feature-sensitive information is generated according to the difference Like degree.

For example, if shown in Fig. 2, Fig. 2 illustrates the calculating process of the simhash values of target information, wherein, Default simhash values are as shown in Figure 26.In Fig. 2,1 to n be participle set in the 1st extremely N-th participle, the weight coefficient W of each participle1To WnParticiple as in participle set is in target information The number of times of middle appearance, can in advance using specific hash algorithm calculate each the 1st it is each to n-th participle From the cryptographic Hash of 6, each of the simhash values of target information is then calculated with this.

When the i-th bit of the simhash values for arriving target information is calculated, then the cryptographic Hash for getting each participle exists Value in its i-th bit.For example, in fig. 2,

The cryptographic Hash of the 1st participle is:100110;

The cryptographic Hash of the 2nd participle is:110000;

……

The cryptographic Hash of n-th participle is:001001;

When the 1st of the simhash values for arriving target information is calculated, then get:

The value of 1st cryptographic Hash of participle on the 1st is 1;

The value of 2nd cryptographic Hash of participle on the 1st is 1;

Value of n-th cryptographic Hash of participle on the 1st is 0;

Then value of the cryptographic Hash according to each participle in its i-th bit generates the symbol of the weight coefficient of the participle Number, then sue for peace, you can the sign for obtaining numerical value according to the summation obtains the simhash values of target information The numerical value of i-th bit.

As above in example, because value of the 1st cryptographic Hash of participle on the 1st is 1;2nd Kazakhstan of participle Value of the uncommon value on the 1st is 1;Value of n-th cryptographic Hash of participle on the 1st is 0;Then according to life Into weight coefficient W1To WnThe summation expression formula that respective symbol is obtained is:

+W1+W2…-Wn

If it is more than 0, the 1st of the simhash values of target information is 1, is otherwise 0.

Therefore the simhash values being calculated also are the numerical value of one 6, it is necessary to explanation, simhash The big I of value is arbitrarily set, but generally may be configured as 32 or 64 sizes.Preferably, it is also possible to use Minhash algorithms carry out similar screening to target information in advance, and simhash is then carried out again, so as to can reduce Amount of calculation.

If the simhash values of the target information being calculated are 111001, and in 1 feature-sensitive information, It is 111000 with the simhash values of the immediate feature-sensitive information of the simhash values, then the two only poor is 000001, you can according to 000001 generation similarity.

Need explanation when, above two calculate similarity mode not mutual exclusion, can be in same embodiment In use, that is to say, that generate similarity when, can calculate simultaneously with default feature-sensitive information point The simhash of the shared ratio in the participle set of the participle of the Hash values match of word and target information The difference of value, combination (such as after weighted average) generation similarity then according to the two.

Step S108:Semantic analysis is carried out according to the similarity and/or to the text message and judges the mesh Mark information is sensitive information.

In the present embodiment, because similarity is a quantized value, therefore, can determine whether whether similarity is more than Threshold value, if most of participle duplicates with the participle in feature-sensitive information in then meaning target information, because This similarity is higher, now can determine that target information is sensitive information.

Optionally, the relatively low target information of similarity is calculated for foregoing, due to artificially adding for joining The real-time for examining the feature-sensitive information of comparison is not enough, it is understood that there may be although certain target information and the spy for having deposited The similarity for levying sensitive information is relatively low, and the difference of participle is also larger, but still is possible of sensitive information, Therefore the risk of Lou identification is still present, therefore, can proceed with identification, recognition accuracy is improved.

Specifically, after the step of extracting the text message included in the target information, server can also enter Row semantics recognition, i.e., extract the text feature of the text message according to default machine learning probabilistic model; Using the text feature as input, the target letter is calculated according to the default machine learning probabilistic model The sensitive confidence level of breath carries out semantic analysis to the text message;According to the similarity and/or sensitive confidence Degree judges whether the target information is sensitive information.

The text feature of text message may include that the participle matched with default feature critical word and participle occur The data such as order.The sample that sensitive information can be pre-entered carries out machine learning, so that server sets up machine Device learning probability model, after machine learning probabilistic model training is finished, then the text feature that will can be extracted It is input in the machine learning probabilistic model and calculates confidence level, when confidence level is more than threshold value, then it represents that machine Recognize successfully, so as to can determine that target information for sensitive information, otherwise, it is determined that the target information is non-sensitive Information.

In the present embodiment, when whether target information is judged as sensitive information, similarity and sensitivity can be combined Confidence level judges whether target information is sensitive information.For example, similarity analysis can be first carried out, if similar Degree is higher than first threshold, then directly judge that target information is sensitive information, if similarity is less than Second Threshold, Then directly judge that target information is non-sensitive information, if similarity is between first threshold and Second Threshold, Semantics recognition then is carried out to target information.Semantics recognition can also be first carried out, if sensitive confidence level is higher than the 3rd Threshold value, then directly judge that target information is sensitive information, if sensitive confidence level is less than the 4th threshold value, directly Judge that target information is non-sensitive information, if similarity is between the 3rd threshold value and the 4th threshold value, to mesh Mark information carries out similarity identification.(because the machine learning amount of calculation of semantics recognition is larger, then preferably can be first Carry out similarity identification carries out semantics recognition again).In another embodiment, can also comprehensive similarity and sensitivity Confidence level, when the two is satisfied by default condition, identification target information is sensitive information.That is, In multiple application scenarios of this method, designer can be according to the Stringency examined sensitive information voluntarily The conditional parameter met required for similarity and sensitive confidence level is set, so as to different elasticity plans can be used Slightly sensitive information is identified.

Preferably, the step of whether target information according to the sensitive confidence declaration is sensitive information it Also include afterwards:

If the target information is judged as sensitive information, using the target information as feature-sensitive information Storage.

If that is, successfully have identified sensitive information by machine learning probabilistic model, by the sensitivity Information is added in default feature database as feature-sensitive information Store as the reference sample for comparing, subsequently If there is the user input content again, being compared beforehand through full text MD5 then can rapidly identify sensitive information, And the process recognized without follow-up cumbersome participle matching identification and machine learning, so as to reduce amount of calculation, Improve execution efficiency.

It should be noted that as shown in figure 3, one integration sensitive information identification server on, it is above-mentioned Sensitive information identification process include full text MD5 compare accurate RM, participle after MD5 compare or Similarity identification mode, the semantics recognition mode based on machine learning that simhash values are compared, according to issue mesh The Activity recognition mode that the behavioural characteristic data of the ID of mark information are identified can be applied to same system In system.Within the system, the content to user's issue is first pre-processed, removal semantically redundancy repetition Whether participle, removes punctuation mark and emoticon, then according to content comprising text message selection identification side Formula, the housing choice behavior identification if not comprising text message.If comprising text message, can successively carry out accurate Identification, similar identification and semantics recognition three phases, the final accurate identification of basis, similar identification and semantic knowledge Whether other result judgement target information is sensitive information.This kind of method for incorporating various RMs is System has recognition accuracy higher due to being identified from multiple dimensions.

The method of the online interception of sensitive information in for conventional art has due to artificial addition sensitive information feature The reason for regular hour hysteresis quality so that recognize the degree of accuracy of sensitive information technical problem not high, spy carries Supplied it is a kind of recognize sensitive information device, as shown in figure 4, the device include text message extraction module 102, Full text Hash identification module 104, word-dividing mode 106, similarity calculation module 108 and sensitive information judge Module 110, wherein:

Text message extraction module 102, for receiving target information, extracts the text included in the target information This information.

Full text Hash identification module 104, the cryptographic Hash for calculating the text message.

Word-dividing mode 106, for the Hash of the cryptographic Hash in the text message and default feature-sensitive information When being worth different, participle is carried out to the text message and obtains participle set.

Similarity calculation module 108, the cryptographic Hash for calculating the participle in the participle set, according to described The cryptographic Hash of the participle in participle set generates the similarity of the target information and default feature-sensitive information.

Sensitive information determination module 110, for carrying out language according to the similarity and/or to the text message Justice analysis judges that the target information is sensitive information.

In one embodiment, it is and pre- during similarity calculation module 106 is additionally operable to calculate the participle set If feature-sensitive information participle Hash values match the ratio shared in the participle set of participle;

The similarity of the target information and default feature-sensitive information is generated according to the ratio.

In one embodiment, similarity calculation module 106 is additionally operable to combine simhash algorithms, according to described The cryptographic Hash of the participle in participle set generates a simhash values of the target information;Calculate described first The difference of simhash values and the 2nd simhash values of the default feature-sensitive information;According to the difference Generate the similarity of the target information and default feature-sensitive information.

In one embodiment, as shown in figure 4, the device also includes Activity recognition module 112, in institute When stating in target information not comprising text message, the ID of the issue target information is obtained;Obtain institute State the behavioural characteristic data of ID, according to the behavioural characteristic data judging target information whether be Sensitive information.

In one embodiment, as shown in figure 4, full text Hash identification module 104 is additionally operable in the text When the cryptographic Hash of information is identical with the cryptographic Hash of default feature-sensitive information, judge the target information as quick Sense information.

In one embodiment, as shown in figure 4, the device also includes semantics recognition module 114, for basis Default machine learning probabilistic model extracts the text feature of the text message;Using the text feature as Input, the sensitive confidence level of the target information is calculated to institute according to the default machine learning probabilistic model Stating text message carries out semantic analysis;The sensitive information determination module be additionally operable to according to the similarity and/or Whether target information described in sensitive confidence declaration is sensitive information.

In one embodiment, semantics recognition module 114 is additionally operable to be judged as sensitivity in the target information During information, then using the target information as feature-sensitive information Store.

In one embodiment, during text message extraction module 102 is additionally operable to filter out the text message Symbolic information and redundant semantic information.

Implement the embodiment of the present invention, will have the advantages that:

After employing the method and apparatus of above-mentioned identification sensitive information, in first calculating the target information of input The cryptographic Hash of text message, carries out full text Hash comparison so that the feature in target information with feature database is quick Sense INFORMATION OF INCOMPLETE it is consistent when, can by target information participle and calculate participle Hash be worth to target believe The similarity of the feature-sensitive information in breath and feature database, semantic analysis is carried out then in conjunction with to target information Analysis result judges whether target information is sensitive information, so as to when the judgement of sensitive information is carried out, adopt With multiple means, compared in combination with full text Hash, similarity is compared and the semantic mode for comparing, and Conventional art is compared, even if in the case of target information and feature-sensitive INFORMATION OF INCOMPLETE identical, it is also possible to Approximate or mutation sensitive information is identified without failing to judge, so as to improve the degree of accuracy of identification.

In one embodiment, as shown in figure 5, Fig. 5 illustrates a kind of above-mentioned identification sensitive information of operation The terminal 10 of the computer system based on von Neumann system of method.The computer system can be intelligent hand The terminal devices such as machine, panel computer, palm PC, notebook computer or PC.Specifically, can wrap Include outer input interface 1001, processor 1002, memory 1003 and the output connected by system bus Interface 1004.Wherein, outer input interface 1001 optionally can at least include network interface 10012.Storage Device 1003 may include external memory 10032 (such as hard disk, CD or floppy disk etc.) and built-in storage 10034. Output interface 1004 can at least include the equipment such as display screen 10042.

In the present embodiment, the operation of this method is based on computer program, the program file of the computer program It is stored in the external memory 10032 of the foregoing computer system 10 based on von Neumann system, operationally It is loaded into built-in storage 10034, is then compiled as being transferred to be held in processor 1002 after machine code OK, so that the text message for being formed in the computer system 10 based on von Neumann system in logic is extracted Module 102, full text Hash identification module 104, word-dividing mode 106, similarity calculation module 108 and quick Sense information judging module 110.And in the method implementation procedure of above-mentioned identification sensitive information, the parameter of input is equal Received by outer input interface 1001, and be transferred to be cached in memory 1003, be then input to treatment Processed in device 1002, the result data for the treatment of or be cached in memory 1003 is subsequently processed, Or be passed to output interface 1004 and exported.

Above disclosed is only present pre-ferred embodiments, can not limit the present invention's with this certainly Interest field, therefore the equivalent variations made according to the claims in the present invention, still belong to the scope that the present invention is covered.

Claims (16)

1. it is a kind of recognize sensitive information method, it is characterised in that including:
Target information is received, the text message included in the target information is extracted;
The cryptographic Hash of the text message is calculated, in cryptographic Hash and the default feature-sensitive of the text message When the cryptographic Hash of information is different, participle is carried out to the text message and obtains participle set;
The cryptographic Hash of the participle in the participle set is calculated, the Hash of the participle in the participle set Value generates the similarity of the target information and default feature-sensitive information;
Carrying out semantic analysis according to the similarity and/or to the text message judges the target information as quick Sense information.
2. it is according to claim 1 it is a kind of recognize sensitive information method, it is characterised in that described The target information and default feature-sensitive information are generated according to the cryptographic Hash of the participle in the participle set The step of similarity, includes:
In calculating the participle set, the participle with the Hash values match of the participle of default feature-sensitive information The shared ratio in the participle set;
The similarity of the target information and default feature-sensitive information is generated according to the ratio.
3. it is according to claim 1 it is a kind of recognize sensitive information method, it is characterised in that described The target information and default feature-sensitive information are generated according to the cryptographic Hash of the participle in the participle set The step of similarity, includes:
With reference to simhash algorithms, the cryptographic Hash of the participle in the participle set generates the target information A simhash values;
Calculate the difference of a simhash values and the 2nd simhash values of the default feature-sensitive information Value;
The similarity of the target information and default feature-sensitive information is generated according to the difference.
4. it is according to claim 1 it is a kind of recognize sensitive information method, it is characterised in that it is described to carry Also include after the step of taking the text message included in the target information:
When not including text message in the target information, the ID of the issue target information is obtained;
The behavioural characteristic data of the ID are obtained, the target according to the behavioural characteristic data judging Whether information is sensitive information.
5. it is according to claim 1 it is a kind of recognize sensitive information method, it is characterised in that the meter Also include after the step of calculating the cryptographic Hash of the text message:
When the cryptographic Hash of the text message is identical with the cryptographic Hash of default feature-sensitive information, institute is judged Target information is stated for sensitive information.
6. it is according to claim 1 it is a kind of recognize sensitive information method, it is characterised in that described Semantic analysis is carried out according to the similarity and/or to the text message judge that the target information is sensitive information The step of also include:
The text feature of the text message is extracted according to default machine learning probabilistic model;
Using the text feature as input, according to the default machine learning probabilistic model by calculating The sensitive confidence level for stating target information carries out semantic analysis to the text message;
Whether the target information according to the similarity and/or sensitive confidence declaration is sensitive information.
7. it is according to claim 6 it is a kind of recognize sensitive information method, it is characterised in that described After the step of according to the similarity and/or target information described in sensitive confidence declaration whether being sensitive information also Including:
If the target information is judged as sensitive information, using the target information as feature-sensitive information Storage.
8. it is according to claim 1 it is a kind of recognize sensitive information method, it is characterised in that it is described to carry Also include after the step of taking the text message included in the target information:
Filter out the symbolic information and redundant semantic information in the text message.
9. it is a kind of recognize sensitive information device, it is characterised in that including:
Text message extraction module, for receiving target information, extracts the text included in the target information Information;
Full text Hash identification module, the cryptographic Hash for calculating the text message;
Word-dividing mode, for the cryptographic Hash of the cryptographic Hash in the text message and default feature-sensitive information When different, participle carried out to the text message and obtains participle set;
Similarity calculation module, the cryptographic Hash for calculating the participle in the participle set, according to described point The cryptographic Hash of the participle in set of words generates the similarity of the target information and default feature-sensitive information;
Sensitive information determination module, for carrying out semantic point according to the similarity and/or to the text message Analysis judges that the target information is sensitive information.
10. it is according to claim 9 it is a kind of recognize sensitive information device, it is characterised in that it is described Similarity calculation module is additionally operable to calculate in the participle set, with the participle of default feature-sensitive information The shared ratio in the participle set of the participle of Hash values match;
The similarity of the target information and default feature-sensitive information is generated according to the ratio.
11. a kind of devices for recognizing sensitive information according to claim 9, it is characterised in that described Similarity calculation module is additionally operable to combine simhash algorithms, the cryptographic Hash of the participle in the participle set Generate a simhash values of the target information;Calculate a simhash values and the default spy Levy the difference of the 2nd simhash values of sensitive information;According to the difference generate the target information with it is default The similarity of feature-sensitive information.
12. a kind of devices for recognizing sensitive information according to claim 9, it is characterised in that described Device also includes Activity recognition module, during for not including text message in the target information, obtains hair The ID of target information described in cloth;The behavioural characteristic data of the ID are obtained, according to the row It is characterized whether target information described in data judging is sensitive information.
13. a kind of devices for recognizing sensitive information according to claim 9, it is characterised in that described Full text Hash identification module is additionally operable to the Kazakhstan of the cryptographic Hash and default feature-sensitive information in the text message When uncommon value is identical, calculate the cryptographic Hash of the text message, judge the text message cryptographic Hash whether with The cryptographic Hash of default feature-sensitive information is identical, if so, then judging that the target information is sensitive information.
14. a kind of devices for recognizing sensitive information according to claim 9, it is characterised in that described Device also includes semantics recognition module, for extracting the text envelope according to default machine learning probabilistic model The text feature of breath;Using the text feature as input, according to the default machine learning probabilistic model Semantic analysis is carried out to the text message by the sensitive confidence level for calculating the target information;
The sensitive information determination module is additionally operable to the mesh according to the similarity and/or sensitive confidence declaration Whether mark information is sensitive information.
15. a kind of devices for recognizing sensitive information according to claim 14, it is characterised in that described Semantics recognition module is additionally operable to when the target information is judged as sensitive information, then by the target information As feature-sensitive information Store.
16. a kind of devices for recognizing sensitive information according to claim 9, it is characterised in that described Text message extraction module is additionally operable to filter out symbolic information and redundant semantic information in the text message.
CN201510919548.0A 2015-12-11 2015-12-11 Recognize the method and device of sensitive information CN106874253A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510919548.0A CN106874253A (en) 2015-12-11 2015-12-11 Recognize the method and device of sensitive information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510919548.0A CN106874253A (en) 2015-12-11 2015-12-11 Recognize the method and device of sensitive information

Publications (1)

Publication Number Publication Date
CN106874253A true CN106874253A (en) 2017-06-20

Family

ID=59177054

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510919548.0A CN106874253A (en) 2015-12-11 2015-12-11 Recognize the method and device of sensitive information

Country Status (1)

Country Link
CN (1) CN106874253A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107943954A (en) * 2017-11-24 2018-04-20 杭州安恒信息技术有限公司 Detection method, device and the electronic equipment of webpage sensitive information
CN108287823A (en) * 2018-02-07 2018-07-17 平安科技(深圳)有限公司 Message data processing method, device, computer equipment and storage medium
CN108491458A (en) * 2018-03-02 2018-09-04 深圳市联软科技股份有限公司 A kind of sensitive document detection method, medium and equipment
CN109597987A (en) * 2018-10-25 2019-04-09 阿里巴巴集团控股有限公司 A kind of text restoring method, device and electronic equipment
CN110084065A (en) * 2019-04-29 2019-08-02 北京口袋时尚科技有限公司 Data desensitization method and device
CN111008264A (en) * 2018-10-10 2020-04-14 腾讯科技(深圳)有限公司 Audit item storage method and device, electronic equipment and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103226576A (en) * 2013-04-01 2013-07-31 杭州电子科技大学 Comment spam filtering method based on semantic similarity
EP2657852A1 (en) * 2010-12-24 2013-10-30 Peking University Founder Group Co., Ltd Method and device for filtering harmful information
CN103441924A (en) * 2013-09-03 2013-12-11 盈世信息科技(北京)有限公司 Method and device for spam filtering based on short text
CN104714938A (en) * 2013-12-12 2015-06-17 联想(北京)有限公司 Message processing method and electronic device
CN104866478A (en) * 2014-02-21 2015-08-26 腾讯科技(深圳)有限公司 Detection recognition method and device of malicious text
CN105022815A (en) * 2015-07-13 2015-11-04 腾讯科技(深圳)有限公司 Information interception method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2657852A1 (en) * 2010-12-24 2013-10-30 Peking University Founder Group Co., Ltd Method and device for filtering harmful information
CN103226576A (en) * 2013-04-01 2013-07-31 杭州电子科技大学 Comment spam filtering method based on semantic similarity
CN103441924A (en) * 2013-09-03 2013-12-11 盈世信息科技(北京)有限公司 Method and device for spam filtering based on short text
CN104714938A (en) * 2013-12-12 2015-06-17 联想(北京)有限公司 Message processing method and electronic device
CN104866478A (en) * 2014-02-21 2015-08-26 腾讯科技(深圳)有限公司 Detection recognition method and device of malicious text
CN105022815A (en) * 2015-07-13 2015-11-04 腾讯科技(深圳)有限公司 Information interception method and device

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107943954A (en) * 2017-11-24 2018-04-20 杭州安恒信息技术有限公司 Detection method, device and the electronic equipment of webpage sensitive information
CN108287823A (en) * 2018-02-07 2018-07-17 平安科技(深圳)有限公司 Message data processing method, device, computer equipment and storage medium
CN108491458A (en) * 2018-03-02 2018-09-04 深圳市联软科技股份有限公司 A kind of sensitive document detection method, medium and equipment
CN111008264A (en) * 2018-10-10 2020-04-14 腾讯科技(深圳)有限公司 Audit item storage method and device, electronic equipment and medium
CN109597987A (en) * 2018-10-25 2019-04-09 阿里巴巴集团控股有限公司 A kind of text restoring method, device and electronic equipment
CN110084065A (en) * 2019-04-29 2019-08-02 北京口袋时尚科技有限公司 Data desensitization method and device

Similar Documents

Publication Publication Date Title
Zannettou et al. On the origins of memes by means of fringe web communities
US9449271B2 (en) Classifying resources using a deep network
US9300672B2 (en) Managing user access to query results
US20160132515A1 (en) Social genome
Pandita et al. {WHYPER}: Towards automating risk assessment of mobile applications
US10552759B2 (en) Iterative classifier training on online social networks
US10650034B2 (en) Categorizing users based on similarity of posed questions, answers and supporting evidence
US20180173698A1 (en) Knowledge Base for Analysis of Text
KR100996311B1 (en) Method and system for detecting spam user created contentucc
US9940370B2 (en) Corpus augmentation system
US20160196491A1 (en) Method For Recommending Content To Ingest As Corpora Based On Interaction History In Natural Language Question And Answering Systems
JP5744228B2 (en) Method and apparatus for blocking harmful information on the Internet
US20120254333A1 (en) Automated detection of deception in short and multilingual electronic messages
CN106570144B (en) The method and apparatus of recommendation information
US8880548B2 (en) Dynamic search interaction
AU2015310494A1 (en) Sentiment rating system and method
CN104217160A (en) Method and system for detecting Chinese phishing website
US10489830B2 (en) Aggregation of rating indicators
EP2753024B1 (en) System and method for continuously monitoring and searching social networking media
US8606795B2 (en) Frequency based keyword extraction method and system using a statistical measure
WO2014126657A1 (en) Latent semantic analysis for application in a question answer system
EP2866421B1 (en) Method and apparatus for identifying a same user in multiple social networks
Isacenkova et al. Inside the scam jungle: A closer look at 419 scam email operations
US20160132904A1 (en) Influence score of a brand
US10109023B2 (en) Social media events detection and verification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Fu Xinghui

Inventor after: Sheng Lihua

Inventor after: Lv Lei

Inventor before: Fu Xinghui