CN106250769A

CN106250769A - The source code data detection method of a kind of multistage filtering and device

Info

Publication number: CN106250769A
Application number: CN201610618081.0A
Authority: CN
Inventors: 邸宏宇; 王志海; 魏效征; 张静; 何晋昊; 喻波; 安鹏
Original assignee: Beijing Wondersoft Technology Co Ltd
Current assignee: Beijing Wondersoft Technology Co Ltd
Priority date: 2016-07-30
Filing date: 2016-07-30
Publication date: 2016-12-21
Anticipated expiration: 2036-07-30
Also published as: CN106250769B

Abstract

The invention discloses source code data detection method and the device of a kind of multistage filtering, this device includes: file type detection filtering module, is used for judging whether input file is file types；Morphological analysis filtering module, extracts the morphology mark in described file and determines respective weights, calculates weight score summation, it is judged that whether described weight score summation exceedes appointment threshold value；Syntactic analysis filters module, from the text of described file intercepting designated length as suspicious text, extracts the grammatical phrases and expression formula comprised in described suspicious text, it is judged that described phrase and the expression formula significance level to constituting source code；Semantic analysis filtering module, extracts the semantic feature of described text, and with the semantic feature specifying core source code, it is carried out similarity analysis；Containing source code data protection module, the file comprising source program data is done protecting sensitive data；No source code mark module, does no source code labelling to described file.By such scheme, improve the accuracy to source code detection, strengthen the safeguard protection dynamics to source code.

Description

The source code data detection method of a kind of multistage filtering and device

Technical field

The present invention relates to source code Data Detection Technology field, be specifically related to the source code Data Detection of a kind of multistage filtering Method and device.

Background technology

As R & D design enterprise, the data such as design documentation, drawing and source code are the core Intelletual Asset of enterprise, are also These core datas are carried out the most important thing that effective management and control is enterprise information security work by the core competitiveness place of enterprise. Wherein source code data are owing to, presented in text or text fragments, being more easy to mix or be embedded in the text literary composition of routine In part, and then run off, divulge a secret or the situation of the harm enterprise information security such as uncontrolled diffusion.These source code loss of datas The most occurrence cause of situation be internal staff be not intended to operation, also have minority deliberately divulging a secret and looking forward to from internal staff The malicious attack in portion out of trade.The generation of these event of data loss, may be after bringing on a disaster property for R & D design enterprise Really.Therefore enterprise need the distribution to source code data, store, circulate, outgoing carries out omnibearing management and control, and source code data Detection method is to realize the basis of these management and control.

The development experience of data security management and control technology DSM (Data Encryption Software), DSA (data security isolation), DLP (data leak protection) three phases.When carrying out the protection of source code data, because the process of source code is called sufficiently complex, If carried out encryption, easily damage code or affect systematic function, so DSM is not particularly suited for the protection of source code data.Mesh The front main source code protection used is all based on DSA's or DLP.Source code is not encrypted by DSA, and is to ensure that source generation Code only circulates in the data safety zone isolated, when not obtaining examination & approval and allowing, it is impossible to carry out any type of outgoing with outer Pass.DLP has broken away from the sensitive data " encryption entirely " of DSM and the prisoner's cage formula information security management and control strategy of " isolation entirely " of DSA, passes through Classification classification to file, it is achieved that network protection and the terminal protection to sensitive data.And the Core Feature of DLP is to file Or the content of data stream is identified, realize the prevention and control to loss of data by identification.Knot based on source code Data Detection Really, the source code management and control of DLP can realize the isolation of sensitive code, the obstruction of outgoing code, internal code and network public generation The functions such as the differentiation of code.

The detection of source code data is substantially text data detection, and current text data detection technique includes basis detection Technology and senior detection technique.

Basis detection technique includes the methods such as document properties detection, keyword match, regular expression, these detection techniques It is not related to the semanteme of document, structure and logical implication, there is simple efficient feature, it can be difficult to realize the most careful dividing Class.

Senior detection technique includes that precise information mates (Exact Data Matching, EDM), document fingerprint (FingerPrint, essence is a kind of Index data matching, IDM method), machine learning (or referred to as statistical learning) Method.Wherein data accurately mate be based on the logic of file and architectural feature are analysed in depth after, extract its pattern special Levy the coupling sorting technique carried out；Document fingerprint is based on the semantic analysis to document, uses hashing algorithm based on key word Acquisition document fingerprint is as file index (or claiming informative abstract), the most similar by index matching detection document or document snippet Method；And machine learning method is all to obtain the data of file internal or semantic statistical nature by learning algorithm, thus enter Row mode identification.

Because source code is based on certain procedures design language and writes and form, there is morphology, grammer, semantic feature, so Precise information coupling can be utilized to carry out the detection of source code data, but the most effective accurately match information must be designed Handling process.Multiple technologies means in above-mentioned text data detection technique are combined into by the detection method proposed in the present invention The information process of a kind of multistage filtering, carries out repeatedly information filtering, it is achieved to source code from different Information Granularity levels The accurate detection of data.

In prior art, technology immediate with the present invention is a kind of existing patented technology, and it passes through intercepting network data Stream, obtains character stream by described network data flow is carried out protocol analysis；Obtain the detection corresponding with program language preset Character string and/or syntactic analysis built-in function；Described parsing is judged according to described detection character string and/or syntactic analysis built-in function To character stream whether comprise source code, the most then block described network data flow.

Above-mentioned prior art has the disadvantage in that

(1) network data flow is only intercepted by above-mentioned patent method therefor, is not particularly suited for for locally stored file Whether comprise source code data to detect, it is impossible to know the distribution situation of local source code.

(2) judging in above-mentioned patent whether character stream comprises the foundation of source code is that " that presets is corresponding with program language Detection character string and/or syntactic analysis built-in function ", not file type, morphology mark attribute and the semantic content to character stream It is analyzed, the limited precision to source code Data Detection.

(3) in the highest enterprise of software development process standardization degree, also exist that developing instrument is numerous, code spice is each Feature different, that identifier name is lack of standardization, code mark word is complicated.Detect these source codes, only rely on file attribute detection, The basis such as keyword match, matching regular expressions detection technique can also exist the not accurate enough problem in location and (be difficult to determine one Piece common document comprises the position of source code data slot)；Use the method such as document fingerprint, statistical learning also exist again due to Sample acquisition imperfect and cause the defect that detection method versatility is not enough.

(4) the senior detection technique of above-mentioned text data is when the text longer to length carries out semantic analysis, can be to literary composition This full text calculates, and has bigger amount of calculation, and the testing result output used time can be caused longer.

Summary of the invention

Explanation of nouns:

Source code: also refer to source program, refers to according to certain programming language specification writing, the most compiled text File or text fragments, be series of human readable computer language instruction.Source code translation become computer can perform The process of binary command be compiling.

Morphological analysis: character string sequence is converted to the process of morphology mark (token) sequence.Morphology note in source program Number include keyword, identifier, constant, operator, boundary's symbol etc..

Keyword: keyword is the word to different programming language compilers with particular meaning, is only used for representing data Type or carry out program flow control, and cannot function as other purposes and use that (keyword cannot act as identifier and carrys out labelling constant Name, variable name, method name, function name, program name, class name, bag name and parameter name).

Identifier: in programming language, identifier is the name that user carries out using when source code is write, and becomes for labelling Amount, constant, function, method, class, object, statement block, file, parameter etc..Identifier can be letter, numeral, underscore, spy The combination of different symbol (such as ' $ ').

Constant: be changeless numerical value in source code, including integer, full mold, Boolean type etc..

Operator: be the symbol that specific operation is described, forms expression formula for connecting different operand.Including arithmetic Operator, relational operator, assignment operator, logical operator, concatenation operator etc..

Boundary accords with: the special symbol of labelling source statement block range limit.Boundary's symbol is typically all there is (such as () { } in pairs [] " " ' ' etc.), in source code, the separator of labelling Statement Completion can also process as boundary's symbol (such as dividing in C and C Plus Plus Number).

Syntactic analysis: morphology signature sequence is accorded with according to operator and boundary and forms all kinds of grammatical phrases, distinguish corresponding Grammatical category, carries out syntactic consistency inspection simultaneously.

Semantic analysis: semanteme is description and the logical expressions in text to things.Text is comprised by semantic analysis exactly Semantic identification, use statistical analysis or machine learning method, set up a kind of computation model, excavate the general of text mid-deep strata time Read.

LSH:Locality-Sensitive Hashing local sensitivity Hash, urtext is carried out hash operations by one The method generating informative abstract (or claiming index).Informative abstract is used to replace urtext when carrying out text comparison or retrieval, Comparison calculation amount can be substantially reduced and improve effectiveness of retrieval.The information that the similar text of content generates after LSH changes is plucked Also to have the highest similarity, therefore LSH may be used for the fields such as text similarity detection, Webpage search.

DSM:Data Security Manager data safety management, because use the earliest is all data encryption technology Data in enterprise are managed safely, so generally referring to Data Encryption Software at present at enterprise information security field DSM.

DSA:Data Security Area data security isolation, be current data Anticompromise Technique effective means it One.By physics (disk, storage device, network) and software (dividing logical security district) partition method, build data place of safety Territory so that the file that source code, drawing etc. comprise sensitive information can only operate in place of safety, it is allowed to sensitive data is not Circulate with in the place of safety of terminal, sensitive data outgoing is audited.

DLP:Data Loss Prevention data loss prevention, or claim data leak protection (Data Leakage Prevention) it is enterprise information security and the title of data guard system of current message area main flow.DLP is by necessarily Data process and analyze method, in conjunction with the information security management strategy of enterprise, electronic information all in enterprise and data are entered Row classification classification management and control, prevents the information assets in enterprise or critical data to run off, divulge a secret or uncontrolled diffusion.

For solving above-mentioned technical problem, the present invention proposes and achieves the source code Data Detection side of a kind of multistage filtering Method and device, the method is by dividing the detection filtration of the file type of the text of input or data stream, document format conversion, morphology Analysis is filtered, suspicious text chunk intercepts, syntactic analysis filters, semantic analysis filters this range of information processing procedure, can differentiate Whether data Han source code in the text inputted or data stream, it is achieved that for source code file or containing source code snippet The detection function of text.

For solving above-mentioned technical problem, the invention provides the source code data detection method of a kind of multistage filtering, the party Method comprises the following steps:

(1) file type detection is filtered, including: judge whether input file is file types, if it is, by this article Part is judged to comprise the file of source code data, proceeds to step (5), otherwise proceeds to step (2)；

(2) morphological analysis is filtered, including: described file consolidation is converted into normative document, extracts in described normative document Morphology mark, and set up corresponding weight, according to morphology mark in file described in described weight calculation for different morphology marks Weight score summation, it is judged that whether described weight score summation exceedes appointment threshold value, if it is, be judged to comprise by this document The file of source code data, proceeds to step (5), otherwise proceeds to step (3)；

(3) syntactic analysis filters, including: intercept the text of designated length as suspicious text, extraction institute from described file State the grammatical phrases and expression formula comprised in suspicious text.According to grammatical phrases level in syntax tree, it is judged that its important journey Degree；Computational complexity according to expression formula judges its significance level.If described grammatical phrases or expression formula are for source code structure The significance level become exceedes appointment threshold value, then this document is judged to the file comprising source code data, proceeds to step (5), no Then proceed to step (4)；

(4) semantic analysis filters, including: extract the semantic feature of described text, by itself and the language specifying core source code Justice feature carries out similarity analysis, if similar, this document be judged to the file comprising source code data, proceed to step (5), Otherwise proceed to step (6)；

(5) file comprising source program data is done protecting sensitive data, detection of end；

(6) described file is done no source code labelling, detection of end.

Further, described in described step (1), input file is: this locality carries out storing or the file of network outgoing.

Further, described morphology mark includes: keyword, identifier, operator and the boundary's symbol in specific program design language.

Further, the suspicious text that described step (3) intercepts designated length from described file specifically includes: according to performing institute State the device performance of data detection method, determine described designated length, and intercept described weight more than the morphology note specifying threshold value After number, the text of described designated length comprises this morphology mark as suspicious text, this suspicious text of intercepting.

Further, described step (3) extracts the grammatical phrases comprised in described suspicious text and expression formula specifically includes: Described suspicious text is carried out syntactic analysis, and the grammer of design language rule of adjacent morphology token groups synthesis being in order is short Language and expression formula.

Further, described step (4) utilize keyword word frequency statistics or LSH method to extract the semanteme of described normative document Feature, is configured to the semantic feature of described appointment core source code sensitive information summary Sample Storehouse, enters described suspicious text Row semantic feature extraction, it is thus achieved that its informative abstract, carries out phase by this informative abstract with the sample in sensitive information summary Sample Storehouse Like property analysis, if the described informative abstract of described suspicious text and a certain sample approximation in described informative abstract Sample Storehouse, then will This document is judged to comprise the file of source code data.

For solving above-mentioned technical problem, the invention provides the source code data detection device of a kind of multistage filtering, this dress Put and include:

File type detection filtering module, is used for judging whether input file is file types；

Morphological analysis filtering module, is converted into normative document by described file consolidation, extracts the word in described normative document Method mark also sets up corresponding weight for different morphology marks, calculates morphology mark in described file total to the weight score of weight With, it is judged that whether described weight score summation exceedes appointment threshold value；

Syntactic analysis filters module, and from the text of described file intercepting designated length as suspicious text, extraction is described can Doubt the grammatical phrases and expression formula comprised in text, it is judged that described phrase and expression formula have practical significance source code to constituting Significance level；

Semantic analysis filtering module, extracts the semantic feature of described text, by itself and the semantic spy specifying core source code Levy and carry out similarity analysis；

Protecting sensitive data module, does protecting sensitive data to the file comprising source program data；

No source code mark module, does no source code labelling to described file.

Further, described input file is: this locality carries out storing or the file of network outgoing.

Further, the suspicious text from described file intercepting designated length specifically includes: according to the property of described detection device Can, determine described designated length, and intercept described weight more than the text of described designated length after the morphology mark specifying threshold value As suspicious text, this suspicious text of intercepting comprises described morphology mark.

Further, the grammatical phrases and the expression formula that comprise in the described suspicious text of described extraction specifically include: to described can Doubtful text carries out syntactic analysis, the grammatical phrases of design language rule of adjacent morphology token groups synthesis being in order and expression Formula.

Further, described semantic analysis filtering module utilizes keyword word frequency statistics or LSH method to extract described standard literary composition The semantic feature of part, the semantic feature of described appointment core source code is configured to sensitive information summary Sample Storehouse, to described can Doubtful text carries out semantic feature extraction, it is thus achieved that its informative abstract, by this informative abstract and the sample in sensitive information summary Sample Storehouse Originally similarity analysis is carried out, if the described informative abstract of described suspicious text is near with a certain sample in described informative abstract Sample Storehouse Seemingly, then this document is judged to the file comprising source code data.

The technology of the present invention effect:

Improve the order of accuarcy of source code Data Detection, strengthen enterprise's ability for source code data security management and control.Logical Cross the particular combination of multistage detection filtering module, it is possible to achieve the source that distinct program design language, different designs style are write The detection of code so that source code detection method has versatility.The design utilizing suspicious text to intercept, efficiently avoid language The defect that justice analysis phase amount of calculation is excessive, accelerates the information processing rate of detection process.

Accompanying drawing explanation

Fig. 1 is that source code of the present invention detects flowchart.

Detailed description of the invention

The source code data detection method of the multistage filtering proposed in the present invention, for the source in business data security management and control Code data detection demand, solve judge from a large amount of text documents and data stream wherein which be source code file or The problem containing source code file.The programming language specification that this method is write from source code is set about, according to source code literary composition This morphology, grammer, semanteme and file format feature, have employed file type detection, morphological analysis, syntactic analysis, semanteme point The multistage filtering data processing structure of analysis, defines a kind of novel source code Data Detection solution.

The first order filters: file type detection is filtered.The file of enterprises storage or network outgoing is carried out files classes Type detects, and is filtered by file suffixes name coupling, is judged to comprise source program data by the file with source program suffix name.

The second level is filtered: morphological analysis is filtered.From morphological analysis aspect, utilize keyword in distinct program design language, The rule that operator, boundary's symbol occur, can solve code in source code and write different style, identifier name nonstandard detection difficulty Topic.First it is converted into txt form by unified to the file after the first order filters or data stream；Next carries out morphological analysis, inspection Survey and the most whether comprise the keyword of source program, special identifier symbol, operator and boundary's symbol；Then it is that different morphology mark sets up power Weight, calculates the weight score summation of morphology mark in txt file；Discrimination threshold is finally set, if there is word in a txt file The weight score of method mark exceedes threshold value, then be judged to comprise source program data.

The third level is filtered: syntactic analysis filters.That brings for avoiding the text that length is longer is carried out semantic analysis is big Amount calculates, and carried out suspicious text intercepting, be only analyzed the text chunk intercepted, thus significantly reduce before syntactic analysis Amount of calculation.From text, extract specific grammatical phrases and expression formula by syntactic analysis, as in suspicious text chunk be The no criterion comprising source program data.

The fourth stage filters: semantic analysis filters.Use the semantic analysis side such as word frequency statistics and LSH (local sensitivity Hash) Method, extracts the semantic feature of suspicious text chunk, and the semantic feature of its core source code with this enterprise is carried out similarity analysis, The revised source code in local is carried out so that it is determined that whether suspicious text comprises.

The source code data detection method information process of the multistage filtering that the present invention proposes is as it is shown in figure 1, wherein wrap Included file type detection filtration, document format conversion, morphological analysis is filtered, suspicious text intercepts, syntactic analysis filters, semanteme Analyze and filter, containing source code data protection operations, no source code labelling totally 8 processing modules.At this series of information Reason, it is possible to achieve to the detection whether comprising source code data in file or data stream.Below to these 8 message processing modules by One is introduced:

(1) file type detection is filtered

For the file of input detection device, detect whether its suffix name is institute's labelling in business data security management and control strategy The source code data file (source code format such as such as * .c, * .cpp, * .h, * .hpp, * .py, * .vbs, * .java, * .jar Type).The most then judge that this document is as source code file；Otherwise enter next stage information filtering to process.

(2) document format conversion

By text (such as forms such as doc, docx, pdf, rtf) or unified turn of the textstream of input detection device It is changed to txt text document form, it is simple to subsequent module is uniformly processed.

(3) morphological analysis is filtered

Text document after (2) module is changed is carried out morphological analysis, therefrom extracts in specific program design language Keyword, identifier, operator and boundary symbol.Significance level according to different morphology marks sets up weight (such as some enterprises The identifier of the library file of middle key should give higher weight, and " if ", " then " that also use in normal text etc. Keyword then should give relatively low weight), calculate the weight score summation of morphology mark in txt file.Can be according to txt file Length sets dynamic discrimination threshold (when generic-document length increases, this threshold value correspondingly increases).If a txt file goes out The weight score of existing morphology mark exceedes threshold value, then be judged to comprise source program data；Otherwise enter at next stage information filtering Reason.

(4) suspicious text intercepts

According to performing the memory size of source code data detection device and calculating speed, it is possible to specify intercept suspicious text Length.According to morphological analysis result, after position occurs in high weight morphology mark, intercept designated length text as suspicious text. (intercept in text and comprise this high weight morphology mark)

(5) syntactic analysis filters

Suspicious text is carried out syntactic analysis, the language of design language rule of adjacent morphology token groups synthesis being in order Method phrase and expression formula.According to grammatical phrases level in syntax tree, it is judged that its significance level；Computing according to expression formula is multiple Miscellaneous degree judges its significance level.Whether the significance level constituted for source code according to described grammatical phrases or expression formula exceedes finger Determine threshold value, if it does, be then judged to comprise source program data；Otherwise enter next stage information filtering to process.

(6) semantic analysis filters

Utilize the methods such as keyword word frequency statistics or LSH to extract the semantic feature of text, by the core of this enterprise The semantic feature of source code is configured to sensitive information summary Sample Storehouse.Suspicious text is carried out semantic feature extraction, it is thus achieved that its letter Breath summary, carries out similarity analysis by it with sample in sensitive information summary Sample Storehouse.If the informative abstract of suspicious text and sample A certain sample approximation in this storehouse, then it is assumed that comprise core source code data in this suspicious text；Do nothing otherwise should to suspicious text Source code labelling.

(7) data protection operations Han source code

If the file of an input detection device or data stream being judged as containing source code data, then tackle this document Or the storage of data stream, preserve, transmit, outgoing does protecting sensitive data (as encrypted, block, examining, record etc.).

(8) no source code labelling

If the file of an input detection device or data stream are judged as not containing source code data, then can be to this File or data stream make no source code labelling, in the case of this document or data stream do not make editor's change, it is not necessary to enter it Row source code data protection operations.

Specific embodiment 1

Developer downloads the CPP source code file client computer to oneself on SPID server, Newly-increased file is scanned by the data fail-safe software in client.

When the file type detection filtering module of origin code data detection device, find that this document suffix is " .cpp ", is therefore judged to source code file by this document.This file is only allowed to protect according to this result of determination data fail-safe software It is stored in the place of safety of client computer.

Specific embodiment 2

Multistage source code program, when writing user's service manual, is pasted in text document by certain developer, and turns Pdf form is become to preserve.The data fail-safe software that this enterprise installs carries out scan full hard disk, is input to this document in this device carry out Detection.

The file type detection filtering module detecting device through origin code data detects, the non-source code file of this document, But it is belonging to text file format, is therefore carried out document format conversion and subsequent treatment.Txt file after conversion is carried out After morphological analysis, find that the weight score summation of the morphology mark wherein comprised exceedes discrimination threshold, be therefore judged as comprising Source code data.After obtaining this testing result, the spread scope of this this file of enterprise response is defined.

Specific embodiment 3

Certain steady Personnel Who Left wants by some enterprise key source codes in addition, these source codes for this batch modification The suffix of file and keyword therein, when these files are dumped to USB flash disk, data fail-safe software is by the most defeated for these files Enter and detect in this device.

The file type detection filtering module detecting device through origin code data detects, the non-source code file of this document, But it is belonging to text file format, is therefore carried out document format conversion and subsequent treatment.Txt file after conversion is carried out After morphological analysis, find that the weight score summation of the morphology mark wherein comprised, not less than discrimination threshold, therefore forwards to carry out language Method is analyzed.In the syntactic analysis phase, find this document comprises multiple expression formula, therefore judge that this document comprises source code number According to.Data fail-safe software is after obtaining this testing result, it should forbid that this document dumps to USB flash disk.

Specific embodiment 4

The core source code conversion pseudo-code of enterprise is described, by mail by outside paper by certain developer when Paper Writing Sending out, this data stream is input in this device detect by corporate mail gateway.

Through the file type detection filtering module detection of origin code data detection device, this data stream non-source code literary composition Part, but it is belonging to text formatting, therefore carried out document format conversion and subsequent treatment.Txt file after conversion is carried out After morphological analysis, find that the weight score summation of the morphology mark wherein comprised, not less than discrimination threshold, therefore forwards to carry out language Method is analyzed.In the syntactic analysis phase, do not find sensitive grammatical phrases and expression formula, therefore proceed to semantic analysis.At semantic analysis In the stage, in describing because of its pseudo-code, do not rename identifier, after through employing identifier as the semantic analysis of word frequency statistics, Find that this section of word and enterprise key source code have Semantic Similarity, therefore judge that this document comprises source code data.Enterprise Mail Gateway, after obtaining this testing result, should forbid the direct outgoing of this mail.

Source code detection technique scheme known today be all use only file type detection, morphological analysis, grammer divide Analysis, the combination of one or both technological means in semantic analysis, have no as in invention by multiple technologies means integrated application structure The multistage filtering information processing scheme become.

By the source code detection technique solution of the multistage filtering that the present invention proposes, the inspection of source code data can be improved The order of accuarcy surveyed, strengthens enterprise's ability for source code data security management and control.Specific by multistage detection filtering module Combination, it is possible to achieve the detection to the source code that distinct program design language, different designs style are write so that source code detects Method has versatility.The design utilizing suspicious text to intercept, efficiently avoid excessive the lacking of semantic analysis stage amount of calculation Fall into, accelerate the information processing rate of detection process.

The foregoing is only presently preferred embodiments of the present invention, be not intended to limit protection scope of the present invention.All Within the spirit and principles in the present invention, any amendment, equivalent and the improvement etc. made, the guarantor in the present invention all should be protected Within the scope of protecting.

Claims

1. a source code data detection method for multistage filtering, the method comprises the following steps:

(1) file type detection is filtered, including: judge whether input file is file types, if it is, sentenced by this document It is set to the file comprising source code data, proceeds to step (5), otherwise proceed to step (2)；

(2) morphological analysis is filtered, including: described file consolidation is converted into normative document, extracts the word in described normative document Method mark, and set up corresponding weight for different morphology marks, according to the adding of morphology mark in file described in described weight calculation Power score summation, it is judged that whether described weight score summation exceedes appointment threshold value, if it is, be judged to this document comprise source generation The file of code data, proceeds to step (5), otherwise proceeds to step (3)；

(3) syntactic analysis filters, including: from the text of described file intercepting designated length as suspicious text, extraction is described can Doubt the grammatical phrases and expression formula comprised in text, it is judged that the important journey that described grammatical phrases or expression formula are constituted for source code Whether degree exceedes appointment threshold value, if it is, this document is judged to the file comprising source code data, proceeds to step (5), otherwise Proceed to step (4)；

(4) semantic analysis filters, including: extract the semantic feature of described text, by itself and the semantic spy specifying core source code Levy and carry out similarity analysis, if similar, this document is judged to the file comprising source code data, proceeds to step (5), otherwise Proceed to step (6)；

(6) described file is done no source code labelling, detection of end.

Method the most according to claim 1, described in described step (1), input file is: this locality carries out storing or network The file of outgoing.

Method the most according to claim 1, described morphology mark includes: the keyword in specific program design language, mark Know symbol, operator and boundary's symbol.

Method the most according to claim 1, the suspicious text that described step (3) intercepts designated length from described file is concrete Including: according to the device performance of the described data detection method of execution, determine described designated length, and intercept described weight more than referring to Determining after the morphology mark of threshold value that the text of described designated length is as suspicious text, this suspicious text of intercepting comprises this morphology note Number.

Method the most according to claim 4, described step (3) is extracted the grammatical phrases comprised in described suspicious text and Expression formula specifically includes: described suspicious text is carried out syntactic analysis, design of adjacent morphology token groups synthesis being in order The grammatical phrases of language rule and expression formula.

6., according to the method described in claim 4 or 5, the described grammatical phrases of described judgement or expression formula are constituted for source code Significance level is particularly as follows: according to described grammatical phrases level in syntax tree, it is judged that its significance level, according to described expression formula Computational complexity, it is judged that its significance level.

Method the most according to claim 1, utilizes keyword word frequency statistics or local sensitivity Hash in described step (4) (LSH) method extracts the semantic feature of described normative document, and the semantic feature of described appointment core source code is configured to sensitivity Informative abstract Sample Storehouse, carries out semantic feature extraction to described suspicious text, it is thus achieved that its informative abstract, by this informative abstract and institute The sample stated in sensitive information summary Sample Storehouse carries out similarity analysis, if the described informative abstract of described suspicious text is with described In informative abstract Sample Storehouse, a certain sample approximation, then be judged to the file comprising source code data by this document.

8. a source code data detection device for multistage filtering, this device includes:

File type detection filtering module, is used for judging whether input file is file types, thus judges that this document is No comprise source code data；

Morphological analysis filtering module, is converted into normative document by described file consolidation, extracts the morphology note in described normative document Number and set up corresponding weight for different morphology marks, calculate the morphology mark weight score summation to weight in described file, Judge whether described weight score summation exceedes appointment threshold value, thus judge whether this document comprises source code data；

Syntactic analysis filters module, from the text of described file intercepting designated length as suspicious text, extracts described suspicious literary composition The grammatical phrases comprised in Ben and expression formula, it is judged that the significance level that described grammatical phrases or expression formula are constituted for source code is The no appointment threshold value that exceedes, thus judge whether this document comprises source code data；

Semantic analysis filtering module, extracts the semantic feature of described text, it is entered with the semantic feature specifying core source code Row similarity analysis, thus judge whether this document comprises source code data；

No source code mark module, does no source code labelling to described file.

Device the most according to claim 8, described input file is: this locality carries out storing or the file of network outgoing.

Device the most according to claim 8, described morphology mark includes: the keyword in specific program design language, mark Know symbol, operator and boundary's symbol.

11. devices according to claim 8, the suspicious text intercepting designated length from described file specifically includes: according to The performance of described detection device, determines described designated length, and intercepts described weight more than institute after the morphology mark specifying threshold value The text stating designated length comprises described morphology mark as suspicious text, this suspicious text of intercepting.

12. devices according to claim 10, extract the grammatical phrases comprised in described suspicious text and expression formula is concrete Including: described suspicious text is carried out syntactic analysis, design language rule of adjacent morphology token groups synthesis being in order Grammatical phrases and expression formula.

13. devices according to claim 8, described semantic analysis filtering module utilizes keyword word frequency statistics or local quick Sense Hash (LSH) method extracts the semantic feature of described normative document, the semantic feature of described appointment core source code is constructed Become sensitive information summary Sample Storehouse, described suspicious text is carried out semantic feature extraction, it is thus achieved that its informative abstract, this information is plucked Similarity analysis is carried out with the sample in described sensitive information summary Sample Storehouse, if the described informative abstract of described suspicious text Approximate with a certain sample in described informative abstract Sample Storehouse, then this document is judged to the file comprising source code data.