CN106250769A - The source code data detection method of a kind of multistage filtering and device - Google Patents
The source code data detection method of a kind of multistage filtering and device Download PDFInfo
- Publication number
- CN106250769A CN106250769A CN201610618081.0A CN201610618081A CN106250769A CN 106250769 A CN106250769 A CN 106250769A CN 201610618081 A CN201610618081 A CN 201610618081A CN 106250769 A CN106250769 A CN 106250769A
- Authority
- CN
- China
- Prior art keywords
- source code
- file
- text
- data
- document
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/57—Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
Abstract
The invention discloses source code data detection method and the device of a kind of multistage filtering, this device includes: file type detection filtering module, is used for judging whether input file is file types;Morphological analysis filtering module, extracts the morphology mark in described file and determines respective weights, calculates weight score summation, it is judged that whether described weight score summation exceedes appointment threshold value;Syntactic analysis filters module, from the text of described file intercepting designated length as suspicious text, extracts the grammatical phrases and expression formula comprised in described suspicious text, it is judged that described phrase and the expression formula significance level to constituting source code;Semantic analysis filtering module, extracts the semantic feature of described text, and with the semantic feature specifying core source code, it is carried out similarity analysis;Containing source code data protection module, the file comprising source program data is done protecting sensitive data;No source code mark module, does no source code labelling to described file.By such scheme, improve the accuracy to source code detection, strengthen the safeguard protection dynamics to source code.
Description
Technical field
The present invention relates to source code Data Detection Technology field, be specifically related to the source code Data Detection of a kind of multistage filtering
Method and device.
Background technology
As R & D design enterprise, the data such as design documentation, drawing and source code are the core Intelletual Asset of enterprise, are also
These core datas are carried out the most important thing that effective management and control is enterprise information security work by the core competitiveness place of enterprise.
Wherein source code data are owing to, presented in text or text fragments, being more easy to mix or be embedded in the text literary composition of routine
In part, and then run off, divulge a secret or the situation of the harm enterprise information security such as uncontrolled diffusion.These source code loss of datas
The most occurrence cause of situation be internal staff be not intended to operation, also have minority deliberately divulging a secret and looking forward to from internal staff
The malicious attack in portion out of trade.The generation of these event of data loss, may be after bringing on a disaster property for R & D design enterprise
Really.Therefore enterprise need the distribution to source code data, store, circulate, outgoing carries out omnibearing management and control, and source code data
Detection method is to realize the basis of these management and control.
The development experience of data security management and control technology DSM (Data Encryption Software), DSA (data security isolation), DLP
(data leak protection) three phases.When carrying out the protection of source code data, because the process of source code is called sufficiently complex,
If carried out encryption, easily damage code or affect systematic function, so DSM is not particularly suited for the protection of source code data.Mesh
The front main source code protection used is all based on DSA's or DLP.Source code is not encrypted by DSA, and is to ensure that source generation
Code only circulates in the data safety zone isolated, when not obtaining examination & approval and allowing, it is impossible to carry out any type of outgoing with outer
Pass.DLP has broken away from the sensitive data " encryption entirely " of DSM and the prisoner's cage formula information security management and control strategy of " isolation entirely " of DSA, passes through
Classification classification to file, it is achieved that network protection and the terminal protection to sensitive data.And the Core Feature of DLP is to file
Or the content of data stream is identified, realize the prevention and control to loss of data by identification.Knot based on source code Data Detection
Really, the source code management and control of DLP can realize the isolation of sensitive code, the obstruction of outgoing code, internal code and network public generation
The functions such as the differentiation of code.
The detection of source code data is substantially text data detection, and current text data detection technique includes basis detection
Technology and senior detection technique.
Basis detection technique includes the methods such as document properties detection, keyword match, regular expression, these detection techniques
It is not related to the semanteme of document, structure and logical implication, there is simple efficient feature, it can be difficult to realize the most careful dividing
Class.
Senior detection technique includes that precise information mates (Exact Data Matching, EDM), document fingerprint
(FingerPrint, essence is a kind of Index data matching, IDM method), machine learning (or referred to as statistical learning)
Method.Wherein data accurately mate be based on the logic of file and architectural feature are analysed in depth after, extract its pattern special
Levy the coupling sorting technique carried out;Document fingerprint is based on the semantic analysis to document, uses hashing algorithm based on key word
Acquisition document fingerprint is as file index (or claiming informative abstract), the most similar by index matching detection document or document snippet
Method;And machine learning method is all to obtain the data of file internal or semantic statistical nature by learning algorithm, thus enter
Row mode identification.
Because source code is based on certain procedures design language and writes and form, there is morphology, grammer, semantic feature, so
Precise information coupling can be utilized to carry out the detection of source code data, but the most effective accurately match information must be designed
Handling process.Multiple technologies means in above-mentioned text data detection technique are combined into by the detection method proposed in the present invention
The information process of a kind of multistage filtering, carries out repeatedly information filtering, it is achieved to source code from different Information Granularity levels
The accurate detection of data.
In prior art, technology immediate with the present invention is a kind of existing patented technology, and it passes through intercepting network data
Stream, obtains character stream by described network data flow is carried out protocol analysis;Obtain the detection corresponding with program language preset
Character string and/or syntactic analysis built-in function;Described parsing is judged according to described detection character string and/or syntactic analysis built-in function
To character stream whether comprise source code, the most then block described network data flow.
Above-mentioned prior art has the disadvantage in that
(1) network data flow is only intercepted by above-mentioned patent method therefor, is not particularly suited for for locally stored file
Whether comprise source code data to detect, it is impossible to know the distribution situation of local source code.
(2) judging in above-mentioned patent whether character stream comprises the foundation of source code is that " that presets is corresponding with program language
Detection character string and/or syntactic analysis built-in function ", not file type, morphology mark attribute and the semantic content to character stream
It is analyzed, the limited precision to source code Data Detection.
(3) in the highest enterprise of software development process standardization degree, also exist that developing instrument is numerous, code spice is each
Feature different, that identifier name is lack of standardization, code mark word is complicated.Detect these source codes, only rely on file attribute detection,
The basis such as keyword match, matching regular expressions detection technique can also exist the not accurate enough problem in location and (be difficult to determine one
Piece common document comprises the position of source code data slot);Use the method such as document fingerprint, statistical learning also exist again due to
Sample acquisition imperfect and cause the defect that detection method versatility is not enough.
(4) the senior detection technique of above-mentioned text data is when the text longer to length carries out semantic analysis, can be to literary composition
This full text calculates, and has bigger amount of calculation, and the testing result output used time can be caused longer.
Summary of the invention
Explanation of nouns:
Source code: also refer to source program, refers to according to certain programming language specification writing, the most compiled text
File or text fragments, be series of human readable computer language instruction.Source code translation become computer can perform
The process of binary command be compiling.
Morphological analysis: character string sequence is converted to the process of morphology mark (token) sequence.Morphology note in source program
Number include keyword, identifier, constant, operator, boundary's symbol etc..
Keyword: keyword is the word to different programming language compilers with particular meaning, is only used for representing data
Type or carry out program flow control, and cannot function as other purposes and use that (keyword cannot act as identifier and carrys out labelling constant
Name, variable name, method name, function name, program name, class name, bag name and parameter name).
Identifier: in programming language, identifier is the name that user carries out using when source code is write, and becomes for labelling
Amount, constant, function, method, class, object, statement block, file, parameter etc..Identifier can be letter, numeral, underscore, spy
The combination of different symbol (such as ' $ ').
Constant: be changeless numerical value in source code, including integer, full mold, Boolean type etc..
Operator: be the symbol that specific operation is described, forms expression formula for connecting different operand.Including arithmetic
Operator, relational operator, assignment operator, logical operator, concatenation operator etc..
Boundary accords with: the special symbol of labelling source statement block range limit.Boundary's symbol is typically all there is (such as () { } in pairs
[] " " ' ' etc.), in source code, the separator of labelling Statement Completion can also process as boundary's symbol (such as dividing in C and C Plus Plus
Number).
Syntactic analysis: morphology signature sequence is accorded with according to operator and boundary and forms all kinds of grammatical phrases, distinguish corresponding
Grammatical category, carries out syntactic consistency inspection simultaneously.
Semantic analysis: semanteme is description and the logical expressions in text to things.Text is comprised by semantic analysis exactly
Semantic identification, use statistical analysis or machine learning method, set up a kind of computation model, excavate the general of text mid-deep strata time
Read.
LSH:Locality-Sensitive Hashing local sensitivity Hash, urtext is carried out hash operations by one
The method generating informative abstract (or claiming index).Informative abstract is used to replace urtext when carrying out text comparison or retrieval,
Comparison calculation amount can be substantially reduced and improve effectiveness of retrieval.The information that the similar text of content generates after LSH changes is plucked
Also to have the highest similarity, therefore LSH may be used for the fields such as text similarity detection, Webpage search.
DSM:Data Security Manager data safety management, because use the earliest is all data encryption technology
Data in enterprise are managed safely, so generally referring to Data Encryption Software at present at enterprise information security field DSM.
DSA:Data Security Area data security isolation, be current data Anticompromise Technique effective means it
One.By physics (disk, storage device, network) and software (dividing logical security district) partition method, build data place of safety
Territory so that the file that source code, drawing etc. comprise sensitive information can only operate in place of safety, it is allowed to sensitive data is not
Circulate with in the place of safety of terminal, sensitive data outgoing is audited.
DLP:Data Loss Prevention data loss prevention, or claim data leak protection (Data Leakage
Prevention) it is enterprise information security and the title of data guard system of current message area main flow.DLP is by necessarily
Data process and analyze method, in conjunction with the information security management strategy of enterprise, electronic information all in enterprise and data are entered
Row classification classification management and control, prevents the information assets in enterprise or critical data to run off, divulge a secret or uncontrolled diffusion.
For solving above-mentioned technical problem, the present invention proposes and achieves the source code Data Detection side of a kind of multistage filtering
Method and device, the method is by dividing the detection filtration of the file type of the text of input or data stream, document format conversion, morphology
Analysis is filtered, suspicious text chunk intercepts, syntactic analysis filters, semantic analysis filters this range of information processing procedure, can differentiate
Whether data Han source code in the text inputted or data stream, it is achieved that for source code file or containing source code snippet
The detection function of text.
For solving above-mentioned technical problem, the invention provides the source code data detection method of a kind of multistage filtering, the party
Method comprises the following steps:
(1) file type detection is filtered, including: judge whether input file is file types, if it is, by this article
Part is judged to comprise the file of source code data, proceeds to step (5), otherwise proceeds to step (2);
(2) morphological analysis is filtered, including: described file consolidation is converted into normative document, extracts in described normative document
Morphology mark, and set up corresponding weight, according to morphology mark in file described in described weight calculation for different morphology marks
Weight score summation, it is judged that whether described weight score summation exceedes appointment threshold value, if it is, be judged to comprise by this document
The file of source code data, proceeds to step (5), otherwise proceeds to step (3);
(3) syntactic analysis filters, including: intercept the text of designated length as suspicious text, extraction institute from described file
State the grammatical phrases and expression formula comprised in suspicious text.According to grammatical phrases level in syntax tree, it is judged that its important journey
Degree;Computational complexity according to expression formula judges its significance level.If described grammatical phrases or expression formula are for source code structure
The significance level become exceedes appointment threshold value, then this document is judged to the file comprising source code data, proceeds to step (5), no
Then proceed to step (4);
(4) semantic analysis filters, including: extract the semantic feature of described text, by itself and the language specifying core source code
Justice feature carries out similarity analysis, if similar, this document be judged to the file comprising source code data, proceed to step (5),
Otherwise proceed to step (6);
(5) file comprising source program data is done protecting sensitive data, detection of end;
(6) described file is done no source code labelling, detection of end.
Further, described in described step (1), input file is: this locality carries out storing or the file of network outgoing.
Further, described morphology mark includes: keyword, identifier, operator and the boundary's symbol in specific program design language.
Further, the suspicious text that described step (3) intercepts designated length from described file specifically includes: according to performing institute
State the device performance of data detection method, determine described designated length, and intercept described weight more than the morphology note specifying threshold value
After number, the text of described designated length comprises this morphology mark as suspicious text, this suspicious text of intercepting.
Further, described step (3) extracts the grammatical phrases comprised in described suspicious text and expression formula specifically includes:
Described suspicious text is carried out syntactic analysis, and the grammer of design language rule of adjacent morphology token groups synthesis being in order is short
Language and expression formula.
Further, described step (4) utilize keyword word frequency statistics or LSH method to extract the semanteme of described normative document
Feature, is configured to the semantic feature of described appointment core source code sensitive information summary Sample Storehouse, enters described suspicious text
Row semantic feature extraction, it is thus achieved that its informative abstract, carries out phase by this informative abstract with the sample in sensitive information summary Sample Storehouse
Like property analysis, if the described informative abstract of described suspicious text and a certain sample approximation in described informative abstract Sample Storehouse, then will
This document is judged to comprise the file of source code data.
For solving above-mentioned technical problem, the invention provides the source code data detection device of a kind of multistage filtering, this dress
Put and include:
File type detection filtering module, is used for judging whether input file is file types;
Morphological analysis filtering module, is converted into normative document by described file consolidation, extracts the word in described normative document
Method mark also sets up corresponding weight for different morphology marks, calculates morphology mark in described file total to the weight score of weight
With, it is judged that whether described weight score summation exceedes appointment threshold value;
Syntactic analysis filters module, and from the text of described file intercepting designated length as suspicious text, extraction is described can
Doubt the grammatical phrases and expression formula comprised in text, it is judged that described phrase and expression formula have practical significance source code to constituting
Significance level;
Semantic analysis filtering module, extracts the semantic feature of described text, by itself and the semantic spy specifying core source code
Levy and carry out similarity analysis;
Protecting sensitive data module, does protecting sensitive data to the file comprising source program data;
No source code mark module, does no source code labelling to described file.
Further, described input file is: this locality carries out storing or the file of network outgoing.
Further, described morphology mark includes: keyword, identifier, operator and the boundary's symbol in specific program design language.
Further, the suspicious text from described file intercepting designated length specifically includes: according to the property of described detection device
Can, determine described designated length, and intercept described weight more than the text of described designated length after the morphology mark specifying threshold value
As suspicious text, this suspicious text of intercepting comprises described morphology mark.
Further, the grammatical phrases and the expression formula that comprise in the described suspicious text of described extraction specifically include: to described can
Doubtful text carries out syntactic analysis, the grammatical phrases of design language rule of adjacent morphology token groups synthesis being in order and expression
Formula.
Further, described semantic analysis filtering module utilizes keyword word frequency statistics or LSH method to extract described standard literary composition
The semantic feature of part, the semantic feature of described appointment core source code is configured to sensitive information summary Sample Storehouse, to described can
Doubtful text carries out semantic feature extraction, it is thus achieved that its informative abstract, by this informative abstract and the sample in sensitive information summary Sample Storehouse
Originally similarity analysis is carried out, if the described informative abstract of described suspicious text is near with a certain sample in described informative abstract Sample Storehouse
Seemingly, then this document is judged to the file comprising source code data.
The technology of the present invention effect:
Improve the order of accuarcy of source code Data Detection, strengthen enterprise's ability for source code data security management and control.Logical
Cross the particular combination of multistage detection filtering module, it is possible to achieve the source that distinct program design language, different designs style are write
The detection of code so that source code detection method has versatility.The design utilizing suspicious text to intercept, efficiently avoid language
The defect that justice analysis phase amount of calculation is excessive, accelerates the information processing rate of detection process.
Accompanying drawing explanation
Fig. 1 is that source code of the present invention detects flowchart.
Detailed description of the invention
The source code data detection method of the multistage filtering proposed in the present invention, for the source in business data security management and control
Code data detection demand, solve judge from a large amount of text documents and data stream wherein which be source code file or
The problem containing source code file.The programming language specification that this method is write from source code is set about, according to source code literary composition
This morphology, grammer, semanteme and file format feature, have employed file type detection, morphological analysis, syntactic analysis, semanteme point
The multistage filtering data processing structure of analysis, defines a kind of novel source code Data Detection solution.
The first order filters: file type detection is filtered.The file of enterprises storage or network outgoing is carried out files classes
Type detects, and is filtered by file suffixes name coupling, is judged to comprise source program data by the file with source program suffix name.
The second level is filtered: morphological analysis is filtered.From morphological analysis aspect, utilize keyword in distinct program design language,
The rule that operator, boundary's symbol occur, can solve code in source code and write different style, identifier name nonstandard detection difficulty
Topic.First it is converted into txt form by unified to the file after the first order filters or data stream;Next carries out morphological analysis, inspection
Survey and the most whether comprise the keyword of source program, special identifier symbol, operator and boundary's symbol;Then it is that different morphology mark sets up power
Weight, calculates the weight score summation of morphology mark in txt file;Discrimination threshold is finally set, if there is word in a txt file
The weight score of method mark exceedes threshold value, then be judged to comprise source program data.
The third level is filtered: syntactic analysis filters.That brings for avoiding the text that length is longer is carried out semantic analysis is big
Amount calculates, and carried out suspicious text intercepting, be only analyzed the text chunk intercepted, thus significantly reduce before syntactic analysis
Amount of calculation.From text, extract specific grammatical phrases and expression formula by syntactic analysis, as in suspicious text chunk be
The no criterion comprising source program data.
The fourth stage filters: semantic analysis filters.Use the semantic analysis side such as word frequency statistics and LSH (local sensitivity Hash)
Method, extracts the semantic feature of suspicious text chunk, and the semantic feature of its core source code with this enterprise is carried out similarity analysis,
The revised source code in local is carried out so that it is determined that whether suspicious text comprises.
The source code data detection method information process of the multistage filtering that the present invention proposes is as it is shown in figure 1, wherein wrap
Included file type detection filtration, document format conversion, morphological analysis is filtered, suspicious text intercepts, syntactic analysis filters, semanteme
Analyze and filter, containing source code data protection operations, no source code labelling totally 8 processing modules.At this series of information
Reason, it is possible to achieve to the detection whether comprising source code data in file or data stream.Below to these 8 message processing modules by
One is introduced:
(1) file type detection is filtered
For the file of input detection device, detect whether its suffix name is institute's labelling in business data security management and control strategy
The source code data file (source code format such as such as * .c, * .cpp, * .h, * .hpp, * .py, * .vbs, * .java, * .jar
Type).The most then judge that this document is as source code file;Otherwise enter next stage information filtering to process.
(2) document format conversion
By text (such as forms such as doc, docx, pdf, rtf) or unified turn of the textstream of input detection device
It is changed to txt text document form, it is simple to subsequent module is uniformly processed.
(3) morphological analysis is filtered
Text document after (2) module is changed is carried out morphological analysis, therefrom extracts in specific program design language
Keyword, identifier, operator and boundary symbol.Significance level according to different morphology marks sets up weight (such as some enterprises
The identifier of the library file of middle key should give higher weight, and " if ", " then " that also use in normal text etc.
Keyword then should give relatively low weight), calculate the weight score summation of morphology mark in txt file.Can be according to txt file
Length sets dynamic discrimination threshold (when generic-document length increases, this threshold value correspondingly increases).If a txt file goes out
The weight score of existing morphology mark exceedes threshold value, then be judged to comprise source program data;Otherwise enter at next stage information filtering
Reason.
(4) suspicious text intercepts
According to performing the memory size of source code data detection device and calculating speed, it is possible to specify intercept suspicious text
Length.According to morphological analysis result, after position occurs in high weight morphology mark, intercept designated length text as suspicious text.
(intercept in text and comprise this high weight morphology mark)
(5) syntactic analysis filters
Suspicious text is carried out syntactic analysis, the language of design language rule of adjacent morphology token groups synthesis being in order
Method phrase and expression formula.According to grammatical phrases level in syntax tree, it is judged that its significance level;Computing according to expression formula is multiple
Miscellaneous degree judges its significance level.Whether the significance level constituted for source code according to described grammatical phrases or expression formula exceedes finger
Determine threshold value, if it does, be then judged to comprise source program data;Otherwise enter next stage information filtering to process.
(6) semantic analysis filters
Utilize the methods such as keyword word frequency statistics or LSH to extract the semantic feature of text, by the core of this enterprise
The semantic feature of source code is configured to sensitive information summary Sample Storehouse.Suspicious text is carried out semantic feature extraction, it is thus achieved that its letter
Breath summary, carries out similarity analysis by it with sample in sensitive information summary Sample Storehouse.If the informative abstract of suspicious text and sample
A certain sample approximation in this storehouse, then it is assumed that comprise core source code data in this suspicious text;Do nothing otherwise should to suspicious text
Source code labelling.
(7) data protection operations Han source code
If the file of an input detection device or data stream being judged as containing source code data, then tackle this document
Or the storage of data stream, preserve, transmit, outgoing does protecting sensitive data (as encrypted, block, examining, record etc.).
(8) no source code labelling
If the file of an input detection device or data stream are judged as not containing source code data, then can be to this
File or data stream make no source code labelling, in the case of this document or data stream do not make editor's change, it is not necessary to enter it
Row source code data protection operations.
Specific embodiment 1
Developer downloads the CPP source code file client computer to oneself on SPID server,
Newly-increased file is scanned by the data fail-safe software in client.
When the file type detection filtering module of origin code data detection device, find that this document suffix is
" .cpp ", is therefore judged to source code file by this document.This file is only allowed to protect according to this result of determination data fail-safe software
It is stored in the place of safety of client computer.
Specific embodiment 2
Multistage source code program, when writing user's service manual, is pasted in text document by certain developer, and turns
Pdf form is become to preserve.The data fail-safe software that this enterprise installs carries out scan full hard disk, is input to this document in this device carry out
Detection.
The file type detection filtering module detecting device through origin code data detects, the non-source code file of this document,
But it is belonging to text file format, is therefore carried out document format conversion and subsequent treatment.Txt file after conversion is carried out
After morphological analysis, find that the weight score summation of the morphology mark wherein comprised exceedes discrimination threshold, be therefore judged as comprising
Source code data.After obtaining this testing result, the spread scope of this this file of enterprise response is defined.
Specific embodiment 3
Certain steady Personnel Who Left wants by some enterprise key source codes in addition, these source codes for this batch modification
The suffix of file and keyword therein, when these files are dumped to USB flash disk, data fail-safe software is by the most defeated for these files
Enter and detect in this device.
The file type detection filtering module detecting device through origin code data detects, the non-source code file of this document,
But it is belonging to text file format, is therefore carried out document format conversion and subsequent treatment.Txt file after conversion is carried out
After morphological analysis, find that the weight score summation of the morphology mark wherein comprised, not less than discrimination threshold, therefore forwards to carry out language
Method is analyzed.In the syntactic analysis phase, find this document comprises multiple expression formula, therefore judge that this document comprises source code number
According to.Data fail-safe software is after obtaining this testing result, it should forbid that this document dumps to USB flash disk.
Specific embodiment 4
The core source code conversion pseudo-code of enterprise is described, by mail by outside paper by certain developer when Paper Writing
Sending out, this data stream is input in this device detect by corporate mail gateway.
Through the file type detection filtering module detection of origin code data detection device, this data stream non-source code literary composition
Part, but it is belonging to text formatting, therefore carried out document format conversion and subsequent treatment.Txt file after conversion is carried out
After morphological analysis, find that the weight score summation of the morphology mark wherein comprised, not less than discrimination threshold, therefore forwards to carry out language
Method is analyzed.In the syntactic analysis phase, do not find sensitive grammatical phrases and expression formula, therefore proceed to semantic analysis.At semantic analysis
In the stage, in describing because of its pseudo-code, do not rename identifier, after through employing identifier as the semantic analysis of word frequency statistics,
Find that this section of word and enterprise key source code have Semantic Similarity, therefore judge that this document comprises source code data.Enterprise
Mail Gateway, after obtaining this testing result, should forbid the direct outgoing of this mail.
Source code detection technique scheme known today be all use only file type detection, morphological analysis, grammer divide
Analysis, the combination of one or both technological means in semantic analysis, have no as in invention by multiple technologies means integrated application structure
The multistage filtering information processing scheme become.
By the source code detection technique solution of the multistage filtering that the present invention proposes, the inspection of source code data can be improved
The order of accuarcy surveyed, strengthens enterprise's ability for source code data security management and control.Specific by multistage detection filtering module
Combination, it is possible to achieve the detection to the source code that distinct program design language, different designs style are write so that source code detects
Method has versatility.The design utilizing suspicious text to intercept, efficiently avoid excessive the lacking of semantic analysis stage amount of calculation
Fall into, accelerate the information processing rate of detection process.
The foregoing is only presently preferred embodiments of the present invention, be not intended to limit protection scope of the present invention.All
Within the spirit and principles in the present invention, any amendment, equivalent and the improvement etc. made, the guarantor in the present invention all should be protected
Within the scope of protecting.
Claims (13)
1. a source code data detection method for multistage filtering, the method comprises the following steps:
(1) file type detection is filtered, including: judge whether input file is file types, if it is, sentenced by this document
It is set to the file comprising source code data, proceeds to step (5), otherwise proceed to step (2);
(2) morphological analysis is filtered, including: described file consolidation is converted into normative document, extracts the word in described normative document
Method mark, and set up corresponding weight for different morphology marks, according to the adding of morphology mark in file described in described weight calculation
Power score summation, it is judged that whether described weight score summation exceedes appointment threshold value, if it is, be judged to this document comprise source generation
The file of code data, proceeds to step (5), otherwise proceeds to step (3);
(3) syntactic analysis filters, including: from the text of described file intercepting designated length as suspicious text, extraction is described can
Doubt the grammatical phrases and expression formula comprised in text, it is judged that the important journey that described grammatical phrases or expression formula are constituted for source code
Whether degree exceedes appointment threshold value, if it is, this document is judged to the file comprising source code data, proceeds to step (5), otherwise
Proceed to step (4);
(4) semantic analysis filters, including: extract the semantic feature of described text, by itself and the semantic spy specifying core source code
Levy and carry out similarity analysis, if similar, this document is judged to the file comprising source code data, proceeds to step (5), otherwise
Proceed to step (6);
(5) file comprising source program data is done protecting sensitive data, detection of end;
(6) described file is done no source code labelling, detection of end.
Method the most according to claim 1, described in described step (1), input file is: this locality carries out storing or network
The file of outgoing.
Method the most according to claim 1, described morphology mark includes: the keyword in specific program design language, mark
Know symbol, operator and boundary's symbol.
Method the most according to claim 1, the suspicious text that described step (3) intercepts designated length from described file is concrete
Including: according to the device performance of the described data detection method of execution, determine described designated length, and intercept described weight more than referring to
Determining after the morphology mark of threshold value that the text of described designated length is as suspicious text, this suspicious text of intercepting comprises this morphology note
Number.
Method the most according to claim 4, described step (3) is extracted the grammatical phrases comprised in described suspicious text and
Expression formula specifically includes: described suspicious text is carried out syntactic analysis, design of adjacent morphology token groups synthesis being in order
The grammatical phrases of language rule and expression formula.
6., according to the method described in claim 4 or 5, the described grammatical phrases of described judgement or expression formula are constituted for source code
Significance level is particularly as follows: according to described grammatical phrases level in syntax tree, it is judged that its significance level, according to described expression formula
Computational complexity, it is judged that its significance level.
Method the most according to claim 1, utilizes keyword word frequency statistics or local sensitivity Hash in described step (4)
(LSH) method extracts the semantic feature of described normative document, and the semantic feature of described appointment core source code is configured to sensitivity
Informative abstract Sample Storehouse, carries out semantic feature extraction to described suspicious text, it is thus achieved that its informative abstract, by this informative abstract and institute
The sample stated in sensitive information summary Sample Storehouse carries out similarity analysis, if the described informative abstract of described suspicious text is with described
In informative abstract Sample Storehouse, a certain sample approximation, then be judged to the file comprising source code data by this document.
8. a source code data detection device for multistage filtering, this device includes:
File type detection filtering module, is used for judging whether input file is file types, thus judges that this document is
No comprise source code data;
Morphological analysis filtering module, is converted into normative document by described file consolidation, extracts the morphology note in described normative document
Number and set up corresponding weight for different morphology marks, calculate the morphology mark weight score summation to weight in described file,
Judge whether described weight score summation exceedes appointment threshold value, thus judge whether this document comprises source code data;
Syntactic analysis filters module, from the text of described file intercepting designated length as suspicious text, extracts described suspicious literary composition
The grammatical phrases comprised in Ben and expression formula, it is judged that the significance level that described grammatical phrases or expression formula are constituted for source code is
The no appointment threshold value that exceedes, thus judge whether this document comprises source code data;
Semantic analysis filtering module, extracts the semantic feature of described text, it is entered with the semantic feature specifying core source code
Row similarity analysis, thus judge whether this document comprises source code data;
Protecting sensitive data module, does protecting sensitive data to the file comprising source program data;
No source code mark module, does no source code labelling to described file.
Device the most according to claim 8, described input file is: this locality carries out storing or the file of network outgoing.
Device the most according to claim 8, described morphology mark includes: the keyword in specific program design language, mark
Know symbol, operator and boundary's symbol.
11. devices according to claim 8, the suspicious text intercepting designated length from described file specifically includes: according to
The performance of described detection device, determines described designated length, and intercepts described weight more than institute after the morphology mark specifying threshold value
The text stating designated length comprises described morphology mark as suspicious text, this suspicious text of intercepting.
12. devices according to claim 10, extract the grammatical phrases comprised in described suspicious text and expression formula is concrete
Including: described suspicious text is carried out syntactic analysis, design language rule of adjacent morphology token groups synthesis being in order
Grammatical phrases and expression formula.
13. devices according to claim 8, described semantic analysis filtering module utilizes keyword word frequency statistics or local quick
Sense Hash (LSH) method extracts the semantic feature of described normative document, the semantic feature of described appointment core source code is constructed
Become sensitive information summary Sample Storehouse, described suspicious text is carried out semantic feature extraction, it is thus achieved that its informative abstract, this information is plucked
Similarity analysis is carried out with the sample in described sensitive information summary Sample Storehouse, if the described informative abstract of described suspicious text
Approximate with a certain sample in described informative abstract Sample Storehouse, then this document is judged to the file comprising source code data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610618081.0A CN106250769B (en) | 2016-07-30 | 2016-07-30 | A kind of the source code data detection method and device of multistage filtering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610618081.0A CN106250769B (en) | 2016-07-30 | 2016-07-30 | A kind of the source code data detection method and device of multistage filtering |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106250769A true CN106250769A (en) | 2016-12-21 |
CN106250769B CN106250769B (en) | 2019-08-16 |
Family
ID=57606163
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610618081.0A Active CN106250769B (en) | 2016-07-30 | 2016-07-30 | A kind of the source code data detection method and device of multistage filtering |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106250769B (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107222509A (en) * | 2017-07-17 | 2017-09-29 | 郑州云海信息技术有限公司 | A kind of guard method of network Web service data and device based on cloud storage |
CN108280357A (en) * | 2018-01-31 | 2018-07-13 | 云易天成(北京)安全科技开发有限公司 | Data leakage prevention method, system based on semantic feature extraction |
CN108734026A (en) * | 2018-05-25 | 2018-11-02 | 云易天成(北京)安全科技开发有限公司 | Data leakage prevention method, system, terminal and medium |
CN110336798A (en) * | 2019-06-19 | 2019-10-15 | 南京中新赛克科技有限责任公司 | Message matching filtering method and device based on DPI |
CN110795607A (en) * | 2019-10-29 | 2020-02-14 | 中国人民解放军32181部队 | Equipment guarantee data matching method and system based on multi-stage similarity calculation |
CN110990836A (en) * | 2019-12-18 | 2020-04-10 | 南京富士通南大软件技术有限公司 | Code leakage detection system and method based on natural language processing technology |
CN112000577A (en) * | 2020-08-25 | 2020-11-27 | 得到(天津)文化传播有限公司 | Code checking method and device, electronic equipment and storage medium |
CN112084786A (en) * | 2020-08-03 | 2020-12-15 | 东北大学 | DSL-based network configuration file testing method |
CN113111147A (en) * | 2020-01-13 | 2021-07-13 | 深信服科技股份有限公司 | Text type identification method and device, electronic equipment and storage medium |
CN113595981A (en) * | 2021-06-25 | 2021-11-02 | 新浪网技术(中国)有限公司 | Method and device for detecting threat of uploaded file and computer-readable storage medium |
CN114968351A (en) * | 2022-08-01 | 2022-08-30 | 北京大学 | Hierarchical multi-feature code homologous analysis method and system |
CN115758471A (en) * | 2022-12-12 | 2023-03-07 | 支付宝(杭州)信息技术有限公司 | Data processing method, device and equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101261604A (en) * | 2008-04-09 | 2008-09-10 | 中兴通讯股份有限公司 | Software quality evaluation apparatus and software quality evaluation quantitative analysis method |
CN101697121A (en) * | 2009-10-26 | 2010-04-21 | 哈尔滨工业大学 | Method for detecting code similarity based on semantic analysis of program source code |
CN103729580A (en) * | 2014-01-27 | 2014-04-16 | 国家电网公司 | Method and device for detecting software plagiarism |
CN104318162A (en) * | 2014-09-27 | 2015-01-28 | 深信服网络科技(深圳)有限公司 | Source code leakage detection method and device |
-
2016
- 2016-07-30 CN CN201610618081.0A patent/CN106250769B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101261604A (en) * | 2008-04-09 | 2008-09-10 | 中兴通讯股份有限公司 | Software quality evaluation apparatus and software quality evaluation quantitative analysis method |
CN101697121A (en) * | 2009-10-26 | 2010-04-21 | 哈尔滨工业大学 | Method for detecting code similarity based on semantic analysis of program source code |
CN103729580A (en) * | 2014-01-27 | 2014-04-16 | 国家电网公司 | Method and device for detecting software plagiarism |
CN104318162A (en) * | 2014-09-27 | 2015-01-28 | 深信服网络科技(深圳)有限公司 | Source code leakage detection method and device |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107222509A (en) * | 2017-07-17 | 2017-09-29 | 郑州云海信息技术有限公司 | A kind of guard method of network Web service data and device based on cloud storage |
CN108280357A (en) * | 2018-01-31 | 2018-07-13 | 云易天成(北京)安全科技开发有限公司 | Data leakage prevention method, system based on semantic feature extraction |
CN108734026A (en) * | 2018-05-25 | 2018-11-02 | 云易天成(北京)安全科技开发有限公司 | Data leakage prevention method, system, terminal and medium |
CN108734026B (en) * | 2018-05-25 | 2020-04-03 | 云易天成(北京)安全科技开发有限公司 | Data leakage prevention method, system, terminal and medium |
CN110336798B (en) * | 2019-06-19 | 2022-05-13 | 南京中新赛克科技有限责任公司 | Message matching filtering method and device based on DPI |
CN110336798A (en) * | 2019-06-19 | 2019-10-15 | 南京中新赛克科技有限责任公司 | Message matching filtering method and device based on DPI |
CN110795607A (en) * | 2019-10-29 | 2020-02-14 | 中国人民解放军32181部队 | Equipment guarantee data matching method and system based on multi-stage similarity calculation |
CN110990836B (en) * | 2019-12-18 | 2022-05-20 | 南京富士通南大软件技术有限公司 | Code leakage detection system and method based on natural language processing technology |
CN110990836A (en) * | 2019-12-18 | 2020-04-10 | 南京富士通南大软件技术有限公司 | Code leakage detection system and method based on natural language processing technology |
CN113111147A (en) * | 2020-01-13 | 2021-07-13 | 深信服科技股份有限公司 | Text type identification method and device, electronic equipment and storage medium |
CN112084786A (en) * | 2020-08-03 | 2020-12-15 | 东北大学 | DSL-based network configuration file testing method |
CN112000577A (en) * | 2020-08-25 | 2020-11-27 | 得到(天津)文化传播有限公司 | Code checking method and device, electronic equipment and storage medium |
CN112000577B (en) * | 2020-08-25 | 2023-12-26 | 得到(天津)文化传播有限公司 | Code checking method and device, electronic equipment and storage medium |
CN113595981A (en) * | 2021-06-25 | 2021-11-02 | 新浪网技术(中国)有限公司 | Method and device for detecting threat of uploaded file and computer-readable storage medium |
CN114968351A (en) * | 2022-08-01 | 2022-08-30 | 北京大学 | Hierarchical multi-feature code homologous analysis method and system |
CN114968351B (en) * | 2022-08-01 | 2022-10-21 | 北京大学 | Hierarchical multi-feature code homologous analysis method and system |
CN115758471A (en) * | 2022-12-12 | 2023-03-07 | 支付宝(杭州)信息技术有限公司 | Data processing method, device and equipment |
CN115758471B (en) * | 2022-12-12 | 2023-06-02 | 支付宝(杭州)信息技术有限公司 | Data processing method, device and equipment |
Also Published As
Publication number | Publication date |
---|---|
CN106250769B (en) | 2019-08-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106250769A (en) | The source code data detection method of a kind of multistage filtering and device | |
CN109684840B (en) | Android malicious software detection method based on sensitive calling path | |
US9665713B2 (en) | System and method for automated machine-learning, zero-day malware detection | |
CN109684838B (en) | Static code auditing system and method for Ether house intelligent contract | |
CN106845171A (en) | A kind of Android application codes protection mechanism discrimination method | |
Cui et al. | Vuldetector: Detecting vulnerabilities using weighted feature graph comparison | |
Shapira et al. | Content-based data leakage detection using extended fingerprinting | |
Akram et al. | Droidcc: A scalable clone detection approach for android applications to detect similarity at source code level | |
CN109829304B (en) | Virus detection method and device | |
Nagano et al. | Static analysis with paragraph vector for malware detection | |
Fan et al. | Ctdroid: leveraging a corpus of technical blogs for android malware analysis | |
Nichols et al. | Syntax-based improvements to plagiarism detectors and their evaluations | |
Lv et al. | Rtfm! automatic assumption discovery and verification derivation from library document for api misuse detection | |
Alzhrani et al. | Automated big text security classification | |
Gonzalez et al. | Authorship attribution of android apps | |
Mira et al. | Novel malware detection methods by using LCS and LCSS | |
Solanki et al. | Comparative study of software clone detection techniques | |
CN114048227A (en) | SQL statement anomaly detection method, device, equipment and storage medium | |
Singh et al. | Understanding research trends in android malware research using information modelling techniques | |
Alneyadi et al. | Word N-gram based classification for data leakage prevention | |
Jackson et al. | Locating SQL injection vulnerabilities in Java byte code using natural language techniques | |
Xiang et al. | PolicyChecker: Analyzing the GDPR Completeness of Mobile Apps' Privacy Policies | |
US9027144B1 (en) | Semantic-based business events | |
Cheng et al. | MSDetector: A Static PHP Webshell Detection System Based on Deep-Learning | |
CN109299610A (en) | Dangerous sensitizing input verifies recognition methods in Android system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |