CN113360522A

CN113360522A - Method and device for quickly identifying sensitive data

Info

Publication number: CN113360522A
Application number: CN202010145893.4A
Authority: CN
Inventors: 于策; 冯昊
Original assignee: Qianxin Technology Group Co Ltd; Secworld Information Technology Beijing Co Ltd
Current assignee: Qianxin Technology Group Co Ltd; Secworld Information Technology Beijing Co Ltd
Priority date: 2020-03-05
Filing date: 2020-03-05
Publication date: 2021-09-07
Anticipated expiration: 2040-03-05
Also published as: CN113360522B

Abstract

The invention discloses a method and a device for quickly identifying sensitive data, relates to the technical field of data processing, and aims to solve the problem of low identification efficiency of sensitive data in the prior art. The method mainly comprises the following steps: generating an identification strategy of the data to be identified according to a preset identification rule, and selecting a priority identification rule in the identification strategy; extracting rule data corresponding to the rule type in the data to be identified according to the rule type of the priority identification rule; scanning rule data according to a priority identification rule to obtain a priority scanning result; and when the priority scanning result can calculate the identification strategy result of the identification strategy, determining whether the identification strategy result is the identification result of whether the data to be identified is sensitive data. The method is mainly applied to the process of monitoring the network environment.

Description

Method and device for quickly identifying sensitive data

Technical Field

The invention relates to the technical field of data processing, in particular to a method and a device for quickly identifying sensitive data.

Background

With the development of internet technology, data transmission through a network is a mainstream way for information transfer, and the transfer of data may involve sensitive data. Due to the particularity of the sensitive data, it is necessary to identify whether the transmission data includes the sensitive data, and perform operations such as transmission limitation, encryption, analysis, and the like on the transmission data according to the identification result.

In the prior art, all sensitive data are generally scanned, a large amount of CPU and memory resources are wasted in the scanning process, and a large amount of unnecessary information exists in a priority scanning result, so that the efficiency of sensitive data identification is reduced.

Disclosure of Invention

In view of this, the present invention provides a method and an apparatus for quickly identifying sensitive data, and mainly aims to solve the problem of low efficiency of identifying sensitive data in the prior art.

According to an aspect of the present invention, there is provided a method for quickly identifying sensitive data, including:

generating an identification strategy of data to be identified according to a preset identification rule, and selecting a priority identification rule in the identification strategy;

extracting rule data corresponding to the rule type in the data to be identified according to the rule type of the priority identification rule;

scanning the rule data according to a priority identification rule to obtain a priority scanning result;

and when the identification strategy result of the identification strategy can be obtained by calculation according to the priority scanning result, determining whether the identification strategy result is the identification result of whether the data to be identified is sensitive data.

Further, the selecting a priority identification rule in the identification policy includes:

and sequentially selecting the prior identification rules in the identification strategies according to the sequence of the execution efficiency of the preset identification rules from high to low.

Further, the extracting, according to the rule type of the priority identification rule, the rule data corresponding to the rule type in the data to be identified includes:

searching a type identifier corresponding to the rule type of the preferential identification rule, wherein the type identifier comprises at least one or more of the following combinations: a channel protocol class identifier, an attribute class identifier and a content class identifier;

extracting rule data corresponding to the type identifier from the data to be identified, wherein the rule data is in one-to-one correspondence with the type identifier, and the rule data comprises at least one or more of the following combinations: channel protocol data, attribute data, content data.

Further, the scanning the rule data according to the priority identification rule to obtain a priority scanning result includes:

scanning the rule data according to a preset scanning algorithm in the priority identification rule to obtain the priority scanning result, wherein the preset scanning algorithm is a character string matching algorithm or an artificial intelligence algorithm.

Further, before determining that the identification policy result is an identification result of whether the data to be identified is sensitive data, the method further includes:

extracting rule summary information of preset scanning algorithms and rule types in each preset identification rule according to an information summary algorithm, and establishing a mapping relation between the rule summary information and the preset identification rules;

in the preset identification rule, searching a duplicate identification rule which is the same as the rule abstract information of the priority identification rule according to the mapping relation;

and assigning the scanning result corresponding to the copy identification rule as the priority scanning result.

Further, the logical operators in the identification strategy comprise AND operation, and/or OR operation, and/or non-operation; the priority scan results include misses and hits; the default scanning result of the preset identification rule which does not acquire the scanning result in the identification strategy is uncertain;

judging whether the priority scanning result can be calculated to obtain an identification strategy result of the identification strategy, wherein the judging step comprises the following steps:

in the identification strategy, according to a combination method of the logical operator, searching the logical operator corresponding to the preferential identification rule;

if the logical operator corresponding to the preferential identification rule is AND operation, if the preferential scanning result is 'miss', determining that the logical operator corresponding to the preferential scanning result can be calculated to obtain a logical operation identification result, and acquiring the logical operation identification result 'miss'; and/or the presence of a gas in the gas,

if the logical operator corresponding to the preferential identification rule is OR operation, determining that the logical operator corresponding to the preferential scanning result can be calculated to obtain a logical operation identification result if the preferential scanning result is hit, and acquiring the logical operation identification result of hit; and/or the presence of a gas in the gas,

if the logical operator corresponding to the preferential scanning rule is not operated, determining that the logical operator corresponding to the preferential scanning result can be calculated to obtain a logical operation identification result if the preferential scanning result is uncertain, and obtaining the logical operation identification result if the preferential scanning result is not hit, determining that the logical operator corresponding to the preferential scanning result can be calculated to obtain a logical operation identification result if the preferential scanning result is not hit, and obtaining the logical operation identification result if the preferential scanning result is hit, and determining that the logical operator corresponding to the preferential scanning result can be calculated to obtain a logical operation identification result if the preferential scanning result is hit, and obtaining the logical operation identification result if the preferential scanning result is not hit; and/or the presence of a gas in the gas,

if the logic operator corresponding to the priority scanning result is not determined to be capable of calculating to obtain a logic operation identification result, determining that an identification strategy result cannot be calculated according to the priority scanning result; and/or the presence of a gas in the gas,

if the logic operator corresponding to the preferential scanning result can be calculated to obtain a logic operation identification result, searching the logic operator corresponding to the logic operation identification result according to a combination method of the logic operator in the identification strategy; and/or the presence of a gas in the gas,

and if the logical operation identification result can be continuously judged according to the operator type of the logical operator corresponding to the logical operation identification result until the logical operation identification result does not have the corresponding logical operator and can be calculated to obtain the logical operation identification result, determining whether the identification strategy result can be calculated according to the priority scanning result, otherwise, determining that the identification strategy result cannot be calculated according to the priority scanning result.

Further, after determining that the identification policy result is an identification result of whether the data to be identified is sensitive data, the method further includes:

when the identification strategy result of the identification strategy cannot be obtained through calculation according to the priority scanning result, recording the priority scanning result corresponding to the priority identification rule, re-extracting the priority identification rule in the identification strategy, obtaining a secondary priority scanning result corresponding to the re-extracted priority identification rule, and judging whether the identification strategy result of the identification strategy can be obtained through calculation according to the priority scanning result and the secondary priority scanning result.

if the identification strategy result is not hit, determining that the data to be identified is not sensitive data;

and if the identification strategy result is hit, determining that the data to be identified is sensitive data, and processing the data to be identified by adopting a mode of prohibiting transmission, generating an alarm prompt pop window or tracing the source of the data to be identified.

According to another aspect of the present invention, there is provided an apparatus for rapidly identifying sensitive data, comprising:

the generating module is used for generating an identification strategy of the data to be identified according to a preset identification rule and selecting a priority identification rule in the identification strategy;

the extraction module is used for extracting rule data corresponding to the rule type in the data to be identified according to the rule type of the priority identification rule;

the acquisition module is used for scanning the rule data according to a priority identification rule to acquire a priority scanning result;

and the determining module is used for determining whether the identification strategy result is the identification result of the data to be identified as sensitive data or not when the identification strategy result of the identification strategy can be obtained by calculating the priority scanning result.

According to still another aspect of the present invention, a storage medium is provided, and the storage medium stores at least one executable instruction, which causes a processor to perform operations corresponding to the method for rapidly identifying sensitive data as described above.

According to still another aspect of the present invention, there is provided a computer apparatus including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the method for quickly identifying the sensitive data.

According to a further aspect of the present invention, a computer program product is provided, comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions, characterized in that, when the program instructions are executed by a computer, the computer is caused to perform the method steps of quickly identifying sensitive data.

By the technical scheme, the technical scheme provided by the embodiment of the invention at least has the following advantages:

the invention provides a method and a device for quickly identifying sensitive data, which are characterized in that an identification strategy of data to be identified is generated according to a preset identification rule, a priority identification rule in the identification strategy is selected, rule data corresponding to the rule type in the data to be identified is extracted according to the rule type of the priority identification rule, a priority scanning result is obtained by scanning the rule data according to the priority identification rule, and finally when the priority scanning result can be calculated to obtain the identification strategy result of the identification strategy, the identification strategy result is determined to be the identification result of whether the data to be identified is the sensitive data. Compared with the prior art, the embodiment of the invention selects the rule data corresponding to the rule type from the data to be identified, if the identification strategy result can be determined according to the rule data, other data in the data to be identified are not scanned, the rule data as few as possible are extracted, the data to be identified as few as possible are scanned, and the identification strategy result of whether the data to be identified is sensitive data is obtained according to the priority scanning result as few as possible, so that the resource occupation of a computer system is reduced, and the identification efficiency of the sensitive data is improved.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a flow chart illustrating a method for quickly identifying sensitive data according to an embodiment of the present invention;

FIG. 2 is a flow chart illustrating another method for quickly identifying sensitive data according to an embodiment of the present invention;

FIG. 3 is a block diagram illustrating an apparatus for quickly identifying sensitive data according to an embodiment of the present invention;

FIG. 4 is a block diagram illustrating another apparatus for fast identification of sensitive data according to an embodiment of the present invention;

fig. 5 shows a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Whether the network transmission data can threaten the national security, whether the network is completely threatened and whether the network environment is influenced is monitored and judged, and the method can be generally summarized as identifying whether the transmission data comprises sensitive data. The data transmitted through the network is more and more, the time required for monitoring the network transmission data is longer and longer, so that the real-time performance in the data transmission process is poor, the user experience is influenced, and the problem of urgently solving is caused by improving the identification efficiency of the sensitive data.

An embodiment of the present invention provides a method for quickly identifying sensitive data, as shown in fig. 1, where the method includes:

101. and generating an identification strategy of the data to be identified according to a preset identification rule, and selecting a priority identification rule in the identification strategy.

The preset identification rule includes a scanning algorithm for identifying whether the data to be identified is sensitive data, and a rule type. The preset identification rule can be a protocol rule, a file size rule or a keyword rule, and the like. Different preset identification rules, corresponding preset scanning algorithms and rule types are different, and exemplarily, if the preset identification rules are protocol rules, the preset scanning algorithms adopt a character string matching algorithm, and the rule types are set to be channel protocol types. If the preset identification rule is a keyword rule, the preset scanning algorithm can also adopt a multi-mode character string matching algorithm, and the rule type is set to be a content type.

The identification strategy at least comprises one preset identification rule, if the identification strategy comprises a plurality of preset identification rules, the plurality of preset identification rules can be completely different rules or incompletely same rules, and the difference of the preset identification rules contained in the identification strategy is limited in the embodiment of the invention. The identification strategy is used for carrying out logic operation on an identification result of identifying the data to be identified according to a preset identification rule. The specific logic operation method is set according to actual requirements, for example, if the data to be recognized includes "character string 1" or "character string 2", it indicates that the data to be recognized is sensitive data, that is, the preset recognition rule for recognizing "character string 1" and the preset recognition rule for recognizing "character string 2" are required to be subjected to or operation, and then a conclusion whether the data to be recognized is sensitive data can be obtained. The method of calculating the logical operation is not limited in the present application.

The number of the preset identification rules in the identification strategy can be multiple, the identification strategy can be formed by adopting a flat organization mode among the preset identification rules, the identification strategy can also be formed by adopting a layered organization mode, the influence of different organization modes on the identification performance of the sensitive data is small, and the organization mode of the preset identification rules in the identification strategy is not limited in the embodiment of the application. Assuming that the identification policy includes three preset identification rules, including rule 1, rule 2 and rule 2, the logical operation between the preset identification rules is "rule 1& rule 2 |)! Rule 3 ", this identification strategy is a tiled organization. Assuming that the identification policy includes a rule group 1 and a rule group 2, the preset identification rule in the rule group 1 includes a rule 1 and a rule 2, wherein the logical relationship between the rule 1 and the rule 2 is "&", the logical relationship between the rule group 1 and the rule group 2 is "|", and the identification policy is hierarchically organized. The different organization in the identification strategy amounts to defining a calculation priority between the various preset identification rules.

And selecting a priority identification rule from each preset identification rule of the identification strategy according to a preset selection rule. The preset selection rule can be sequentially selected according to the arrangement sequence of the preset identification rule from left to right in the identification strategy, or sequentially selected according to the sequence of the algorithm complexity of the scanning algorithm in the preset identification rule from low to high, or sequentially selected according to the sequence of the data volume generally corresponding to the rule type in the preset identification rule from small to large. In the embodiment of the present invention, the selection manner of the priority identification rule is not limited.

102. And extracting the rule data corresponding to the rule type in the data to be identified according to the rule type of the priority identification rule.

Preferentially identifying rules, namely presetting identification rules, wherein the rule data comprises channel protocol data, attribute data and content data, and if the data to be identified comprises the rule data corresponding to the rule type, extracting corresponding rule data; and if the rule data corresponding to the rule type is not included in the data to be identified, setting the data content of the rule data to be null.

103. And scanning the rule data according to the priority identification rule to obtain a priority scanning result.

The priority scan result is used to indicate the relationship of the rule data to the sensitive data, and includes a miss and a hit. A miss refers to the rule data not being sensitive data, and a hit refers to the rule data being sensitive data. If the data content of the rule data is empty, the priority scanning result is a miss. And marking the priority identification rule of the scanned rule data so as to judge which preset identification rules in the identification strategy do not execute the scanning instruction. The priority scanning result is a scanning result obtained by scanning regular data in the data to be identified, and is not a scanning result obtained by scanning all the data to be identified, the obtained scanning results are the same, but the data volumes to be scanned are different, that is, the mode of scanning part of the data to be identified in the application obtains an accurate scanning result with the least data scanning volume, and the scanning efficiency is improved on the premise of ensuring the accuracy.

104. And when the priority scanning result can calculate the identification strategy result of the identification strategy, determining whether the identification strategy result is the identification result of whether the data to be identified is sensitive data.

If the identification strategy result can be calculated according to the priority scanning result, the identification strategy result can be obtained by only extracting the rule data and only scanning the rule data, and the identification strategy result is uniquely determined. The recognition strategy results are similar to the priority scan results, both in miss and hit situations. Assuming that the identification policy includes rule 1 and rule 2, the logical relationship between rule 1 and rule 2 is "&", if rule 1 is determined to be a priority identification rule, the result of scanning the rule data corresponding to the priority rule according to rule 1 is "miss", and the result of identifying the policy according to the "&" operation rule is "miss". Assuming that the identification policy includes rule 1 and rule 2, the logical relationship between rule 1 and rule 2 is "|", if it is determined that rule 1 is a priority identification rule, then the result of scanning the rule data corresponding to the priority rule according to rule 1 is "hit", and the result of identifying the policy according to the "&" operation rule is "hit".

The recognition result, like the recognition policy result and the priority scan result, includes both a miss and a hit. A miss means that the data to be identified is not sensitive data, and a hit means that the data to be identified is sensitive data. If the identification strategy result of the identification strategy can not be calculated according to the priority scanning result, recording the priority scanning result corresponding to the priority identification rule, and re-extracting the priority identification rule in the identification strategy, namely continuously and repeatedly acquiring a new priority scanning result and judging the new priority scanning result in a circulating manner until the identification strategy result can be calculated according to the priority scanning result. And judging whether the identification strategy result can be obtained by calculation or not according to all the priority scanning results obtained for the second time and later.

In the process of identifying sensitive data, a plurality of identification strategies can be included, an interrupt mode can be adopted among the identification strategies, the interrupt mode refers to that scanning is stopped when any one strategy is hit, and an audit mode refers to that all the identification strategies are scanned. Different scanning modes can be selected according to actual requirements to identify sensitive data. If the requirement on the speed of identifying the sensitive data is high, identifying the sensitive data by adopting an interrupt mode, and if the positions of all the sensitive data in the data to be identified need to be identified, identifying the sensitive data by adopting an audit mode.

The invention provides a method for quickly identifying sensitive data, which comprises the steps of generating an identification strategy of data to be identified according to a preset identification rule, selecting a priority identification rule in the identification strategy, extracting rule data corresponding to the rule type in the data to be identified according to the rule type of the priority identification rule, scanning the rule data according to the priority identification rule to obtain a priority scanning result, and finally determining whether the identification strategy result is the identification result of the data to be identified as the sensitive data or not when the priority scanning result can be calculated to obtain the identification strategy result of the identification strategy. Compared with the prior art, the embodiment of the invention selects the rule data corresponding to the rule type from the data to be identified, if the identification strategy result can be determined according to the rule data, other data in the data to be identified are not scanned, the rule data as few as possible are extracted, the data to be identified as few as possible are scanned, and the identification strategy result of whether the data to be identified is sensitive data is obtained according to the priority scanning result as few as possible, so that the resource occupation of a computer system is reduced, and the identification efficiency of the sensitive data is improved.

Another method for quickly identifying sensitive data is provided in an embodiment of the present invention, as shown in fig. 2, the method includes:

201. and generating an identification strategy of the data to be identified according to a preset identification rule, and selecting a priority identification rule in the identification strategy.

The identification strategy is used for carrying out logic operation on an identification result of identifying the data to be identified according to a preset identification rule. The preset identification rule comprises a preset scanning algorithm, and the rule type of the preset identification rule comprises a channel protocol class, an attribute class and a content class. Each rule type is provided with a corresponding preset scanning algorithm and rule data, the rule types are in one-to-one correspondence with the preset scanning algorithms, and the rule types are also in one-to-one correspondence with the rule data types in the same data to be identified. And selecting a proper scanning algorithm according to different rule types so as to efficiently identify the rule data in the data to be identified.

The priority identification rule refers to a preset identification rule which identifies that the scanning result is not obtained in the strategy. And extracting the priority identification rules in the identification strategy according to the sequence of the execution efficiency of the preset identification rules from high to low. Scanning algorithms of preset identification rules have different algorithm complexity. Generally, the less complex the scanning algorithm, the more efficient the execution of the corresponding preset recognition rule. If the complexity of the scanning algorithm of the preset identification rule is the same, the smaller the data amount of the rule data corresponding to the rule type is, the higher the execution efficiency of the corresponding preset identification rule is. The data size of the rule data is judged according to the data size theoretically corresponding to the rule type, for example, compared with the attribute class and the content class, the attribute class only describes the file attribute, the data size of the attribute class is generally considered to be small, the content class relates to the specific content of the file, and the data size of the content class is considered to be larger than that of the attribute class. Therefore, the arrangement sequence of the execution efficiency of the preset identification rules from high to low refers to an arrangement sequence of the algorithm complexity of the scanning algorithm from low to high, and an arrangement sequence of the data size of the rule type from small to large theoretically when the algorithm complexity is the same.

202. And extracting the rule data corresponding to the rule type in the data to be identified according to the rule type of the priority identification rule.

In order to represent specific data to be identified, data types such as a data transmission protocol, an encoding mode, a correction mode, a data file name, a memory size occupied by the data, a data format, data content and the like are generally adopted for representation, the data types can be divided into three types of attribute data, channel protocol data and content data, and in order to cover all data content of the data to be identified, corresponding set rule data comprises the attribute data, the channel protocol data and the content data. And respectively setting type identifications corresponding to each type of data.

According to the rule type of the priority identification rule, extracting the rule data corresponding to the rule type in the data to be identified specifically comprises the following steps: searching a type identifier corresponding to the rule type of the preferential identification rule, wherein the type identifier comprises at least one or more of the following combinations: a channel protocol class identifier, an attribute class identifier and a content class identifier; extracting rule data corresponding to the type identifier from the data to be identified, wherein the rule data is in one-to-one correspondence with the type identifier, and the rule data comprises at least one or more of the following combinations: channel protocol data, attribute data, content data. The data to be recognized is usually binary data, and for specific type identifications of the rule data corresponding to different rule types, the rule data corresponding to the type identifications are extracted based on the type identifications.

In the step, if the type identifier is a channel protocol type identifier, extracting channel protocol data in the data to be identified; if the type identification is an attribute type identification, extracting attribute data in the data to be identified; and if the type is the content type identification, extracting text content data in the data to be identified. Wherein the attribute data, i.e. metadata class, includes file name, file size, file binary format data, etc. For example, an HTTP protocol, a TCP protocol, a UDP protocol, or the like may be used in transmitting data, and if data to be identified transmitted using the HTTP protocol is set as sensitive data, only a transmission protocol used in the data to be identified needs to be extracted.

203. And scanning the rule data according to a preset scanning algorithm in the priority identification rule to obtain a priority scanning result.

The preset scanning algorithm includes, but is not limited to, a string matching algorithm or an artificial intelligence algorithm. The character string matching algorithm is to judge whether the rule data is the same as the preset character string, for example, if the preset character string is 'Beijing earthquake', the rule data is attribute data, and if four continuous Chinese characters of 'Beijing earthquake' are detected in the attribute data in the data to be identified in the scanning process, the priority scanning result is 'hit'. Artificial intelligence algorithms are commonly used to find semantics, mood, or attitude of data content. Assuming that the rule data is content data, judging whether the emotion is 'happy' according to the emotion, extracting the emotion in the content data of the data to be identified according to an artificial intelligence algorithm in the scanning process, and if the emotion is 'happy', the result of preferential scanning is 'hit'. The specific method adopted by the artificial intelligence algorithm in the embodiment of the application is not limited.

The priority scan result is used to indicate the relationship of the rule data to the sensitive data, and includes a miss and a hit. Specifically, rule data are scanned according to a preset scanning algorithm in the priority identification rule, and a priority scanning result is obtained.

204. And extracting rule summary information of a preset scanning algorithm and a rule type in each preset identification rule according to an information summary algorithm, and establishing a mapping relation between the rule summary information and the preset identification rule.

The number of preset identification rules of the identification policy is at least 2. And extracting rule abstract information of the preset identification rule through an information abstract algorithm by taking a preset scanning algorithm and a rule type in the preset identification rule as basic information. After extracting the rule summary information, a mapping relationship between the rule summary information and the preset identification rule needs to be established. The information summarization algorithm may adopt an MD5 algorithm.

205. And searching for a duplicate identification rule which is the same as the rule abstract information of the prior identification rule according to the mapping relation in the preset identification rule.

The duplicate identification rule refers to the same rule in the preset identification rules as the priority identification rule. Firstly, rule summary information of a prior identification rule is obtained, then the rule summary information of the preset identification rule is compared one by one, and when the rule summary information with the same rule summary information of the prior identification rule is found, a copy identification rule of a non-prior identification rule is determined according to the preset identification rule corresponding to the summary information.

206. And assigning the scanning result corresponding to the copy identification rule as a priority scanning result.

The number of preset identification rules for an identification policy is at least 2, and it is only necessary to look up duplicate identification rules. And extracting rule summary information of the preset identification rule through an MD5 algorithm based on the preset scanning algorithm and the rule type in the preset identification rule. After extracting the rule summary information, a mapping relationship between the rule summary information and the preset identification rule needs to be established. And searching for a duplicate identification rule which is the same as the prior identification rule in the identification strategy according to the mapping relation between the rule abstract information and the preset identification rule, and then assigning the scanning result of the duplicate identification rule as a prior scanning result. Through the assignment process, repeated scanning of the same preset identification rule is avoided, so that the efficiency of identifying sensitive data is improved.

207. And when the priority scanning result can calculate the identification strategy result of the identification strategy, determining whether the identification strategy result is the identification result of whether the data to be identified is sensitive data.

The process of calculating whether the identification strategy result can be calculated according to the priority scanning result is similar to the commonly considered logic operation process, wherein the logic operators in the identification strategy comprise AND operation, OR operation and NOT operation, the priority scanning result comprises miss and hit, the default scanning result of the preset identification rule of the scanning result which is not obtained in the identification strategy is uncertain, the uncertain scanning result is uncertain, special explanation is needed in the logic operation, and if the priority scanning result is hit or miss, the logic operation is the same as the common operation process.

The judging process specifically comprises the following steps: in the identification strategy, a logical operator corresponding to the priority identification rule is searched according to a combination method of the logical operator;

if the logical operator corresponding to the priority identification rule is AND operation, if the priority scanning result is 'miss', determining that the logical operator corresponding to the priority scanning result can be calculated to obtain a logical operation identification result, and acquiring the logical operation identification result 'miss';

if the logical operator corresponding to the priority identification rule is OR operation, determining that the logical operator corresponding to the priority scanning result can be calculated to obtain a logical operation identification result if the priority scanning result is hit, and acquiring the logical operation identification result of hit;

if the logical operator corresponding to the priority scanning rule is not operated, determining that the logical operator corresponding to the priority scanning result can be calculated to obtain a logical operation identification result if the priority scanning result is uncertain, and obtaining the logical operation identification result if the priority scanning result is not hit, determining that the logical operator corresponding to the priority scanning result can be calculated to obtain a logical operation identification result if the priority scanning result is not hit, and obtaining the logical operation identification result if the priority scanning result is hit;

if the logic operator corresponding to the priority scanning result can be calculated to obtain the logic operation identification result, determining that the identification strategy result cannot be calculated according to the priority scanning result;

if the logic operator corresponding to the preferential scanning result can be calculated to obtain a logic operation identification result, searching the logic operator corresponding to the logic operation identification result according to a combination method of the logic operator in the identification strategy;

and if the logical operation identification result can be continuously judged according to the operator type of the logical operator corresponding to the logical operation identification result until the logical operation identification result does not have the corresponding logical operator and can be calculated to obtain the logical operation identification result, determining whether the identification strategy result can be calculated according to the priority scanning result, or else determining that the identification strategy result cannot be calculated according to the priority scanning result.

And if the identification strategy result is 'miss', the data to be identified is not sensitive data, and if the identification strategy result is 'hit', the data to be identified is sensitive data. And if the data to be identified is sensitive data, processing the data to be identified by adopting a mode of prohibiting transmission, generating an alarm prompt popup window or tracing the source of the data to be identified.

208. When the identification strategy result of the identification strategy cannot be obtained by calculation according to the priority scanning result, recording the priority scanning result corresponding to the priority identification rule, re-extracting the priority identification rule in the identification strategy, obtaining a secondary priority scanning result corresponding to the re-extracted priority identification rule, and judging whether the identification strategy result of the identification strategy can be obtained by calculation according to the priority scanning result and the secondary priority scanning result.

If the first scanning result of the first selected priority identification rule can not determine the identification strategy result, recording the first scanning result obtained by the first scanning, selecting the priority identification rule for the second time, obtaining the second scanning result, then obtaining the determined identification strategy result by judging according to the first scanning result and the second scanning result, continuously recording the scanned priority scanning result, and continuously obtaining the next priority scanning result of the priority identification rule until the identification strategy result can be obtained.

In summary, the application needs to traverse the two-layer loop structure of the identification strategy and the data to be identified when determining whether the data to be identified is sensitive data. Assuming that the identification policy includes three preset identification rules, including rule 1, rule 2 and rule 2, the logical operation between the preset identification rules is "rule 1& rule 2 |)! Rule 3 ", assuming that the extracted priority identification rule is rule 3, the rule type of rule 3 is a channel protocol type identifier, extracting channel type data corresponding to the channel type identifier in the data to be identified, and then judging whether the channel type data is sensitive data by a scanning algorithm in rule 3. At this time, the default scanning result of the rule 1 and the rule 2 is "uncertain", and assuming that the result of scanning the data to be identified according to the rule 3 is hit, the logical operation result of the identification policy is equivalent to calculating "uncertain & uncertain | miss", the logical operation result of the identification policy cannot be obtained, and it is necessary to continue to extract the priority identification rule in the identification policy. Assuming that the result of scanning the data to be identified according to rule 3 is a miss, the logical operation result of the identification policy is equivalent to calculating "uncertain & uncertain | hit", the logical operation result of the identification policy cannot be obtained, and the priority identification rule in the identification policy needs to be continuously extracted. The priority identification rules are extracted from all the rules in the cyclic identification strategy, and then the data in the data to be identified are circulated to obtain the rule data. And then extracting the prior identification rule from the rule 1 and the rule 2, and similarly repeating the process until the result of the identification strategy is calculated. And if the identification strategy result is 'hit', the data to be identified is sensitive data, and if the identification strategy result is 'miss', the data to be identified is not sensitive data.

Further, as an implementation of the method shown in fig. 1, an embodiment of the present invention provides a first apparatus for quickly identifying sensitive data, as shown in fig. 3, where the apparatus includes:

the generating module 31 is configured to generate an identification policy of data to be identified according to a preset identification rule, and select a priority identification rule in the identification policy;

the extracting module 32 is configured to extract rule data corresponding to the rule type from the data to be identified according to the rule type of the priority identification rule;

an obtaining module 33, configured to scan the rule data according to a priority identification rule to obtain a priority scanning result;

a determining module 34, configured to determine, when the result of the priority scanning can calculate an identification policy result of the identification policy, that the identification policy result is an identification result of whether the data to be identified is sensitive data.

The invention provides a device for quickly identifying sensitive data, which generates an identification strategy of data to be identified according to a preset identification rule, selects a priority identification rule in the identification strategy, extracts rule data corresponding to the rule type in the data to be identified according to the rule type of the priority identification rule, scans the rule data according to the priority identification rule to obtain a priority scanning result, and finally determines whether the identification strategy result is the identification result of the data to be identified as the sensitive data or not when the priority scanning result can be calculated to obtain the identification strategy result of the identification strategy. Compared with the prior art, the embodiment of the invention selects the rule data corresponding to the rule type from the data to be identified, if the identification strategy result can be determined according to the rule data, other data in the data to be identified are not scanned, the rule data as few as possible are extracted, the data to be identified as few as possible are scanned, and the identification strategy result of whether the data to be identified is sensitive data is obtained according to the priority scanning result as few as possible, so that the resource occupation of a computer system is reduced, and the identification efficiency of the sensitive data is improved.

Further, as an implementation of the method shown in fig. 2, an embodiment of the present invention provides a first apparatus for quickly identifying sensitive data, as shown in fig. 4, where the apparatus includes:

the generating module 41 is configured to generate an identification policy of data to be identified according to a preset identification rule, and select a priority identification rule in the identification policy;

an extracting module 42, configured to extract, according to a rule type of the priority identification rule, rule data corresponding to the rule type in the data to be identified;

an obtaining module 43, configured to scan the rule data according to a priority identification rule to obtain a priority scanning result;

a determining module 44, configured to determine, when the result of the priority scanning can calculate an identification policy result of the identification policy, that the identification policy result is an identification result of whether the data to be identified is sensitive data.

Further, the generating module 41 is configured to:

Further, the extraction module 42 includes:

a searching unit 421, configured to search for a type identifier corresponding to a rule type of the priority identification rule, where the type identifier includes a combination of at least one or more of the following: a channel protocol class identifier, an attribute class identifier and a content class identifier;

an extracting unit 422, configured to extract, from the data to be identified, rule data corresponding to the type identifier, where the rule data corresponds to the type identifier one to one, and the rule data includes at least one or more of the following combinations: channel protocol data, attribute data, content data.

Further, the obtaining module 43 is configured to:

Further, the apparatus further comprises:

the abstract extracting module 45 is configured to, before the identification policy result is determined to be the identification result of whether the data to be identified is sensitive data, extract rule abstract information of a preset scanning algorithm and a rule type in each preset identification rule according to an information abstract algorithm, and establish a mapping relationship between the rule abstract information and the preset identification rule;

a relation searching module 46, configured to search, in the preset identification rule, a duplicate identification rule that is the same as the rule summary information of the priority identification rule according to the mapping relation;

and a result assignment module 47, configured to assign a scanning result corresponding to the copy identification rule as the priority scanning result.

a recording module 48, configured to record a priority scanning result corresponding to the priority identification rule when the priority scanning result cannot calculate an identification policy result of the identification policy, re-extract the priority identification rule in the identification policy, obtain a secondary priority scanning result corresponding to the re-extracted priority identification rule, and determine whether the identification policy result of the identification policy can be calculated according to the priority scanning result and the secondary priority scanning result.

Further, the apparatus further comprises:

the determining module 44 is further configured to determine that the data to be identified is not sensitive data if the identification policy result is a miss after the identification policy result is determined to be the identification result of whether the data to be identified is sensitive data;

the determining module 44 is further configured to determine that the data to be identified is sensitive data if the identification policy result is a hit, and process the data to be identified by prohibiting transmission, generating an alarm prompt pop window, or tracing the source of the data to be identified.

According to an embodiment of the present invention, a storage medium is provided, and the storage medium stores at least one executable instruction, and the computer executable instruction can execute the method for quickly identifying sensitive data in any method embodiment.

Fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present invention, and the specific embodiment of the present invention does not limit the specific implementation of the computer device.

As shown in fig. 5, the computer apparatus may include: a processor (processor)502, a Communications Interface 504, a memory 506, and a communication bus 508.

Wherein: the processor 502, communication interface 504, and memory 506 communicate with one another via a communication bus 508.

A communication interface 504 for communicating with network elements of other devices, such as clients or other servers.

The processor 502 is configured to execute the program 510, and may specifically execute the relevant steps in the above-described method embodiment for quickly identifying sensitive data.

In particular, program 510 may include program code that includes computer operating instructions.

The processor 502 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement an embodiment of the present invention. The computer device includes one or more processors, which may be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

And a memory 506 for storing a program 510. The memory 506 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 510 may specifically be used to cause the processor 502 to perform the following operations:

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for quickly identifying sensitive data, comprising:

2. The method of claim 1, wherein the selecting the preferred identification rule in the identification policy comprises:

3. The method of claim 1, wherein the extracting, according to the rule type of the priority identification rule, rule data corresponding to the rule type in the data to be identified comprises:

4. The method of claim 1, wherein scanning the rule data according to the priority identification rule to obtain a priority scanning result comprises:

5. The method of claim 1, wherein before determining the identification policy result as the identification result of whether the data to be identified is sensitive data, the method further comprises:

6. The method of claim 1, wherein logical operators in the recognition strategy include and operations, and/or operations, and/or non-operations; the priority scan results include misses and hits; the default scanning result of the preset identification rule which does not acquire the scanning result in the identification strategy is uncertain;

7. The method according to claim 1, wherein after determining that the identification policy result is an identification result of whether the data to be identified is sensitive data, the method further comprises:

when the identification strategy result of the identification strategy cannot be obtained by calculation according to the priority scanning result, recording the priority scanning result corresponding to the priority identification rule, re-extracting the priority identification rule in the identification strategy, obtaining a secondary priority scanning result corresponding to the re-extracted priority identification rule, and judging whether the identification strategy result of the identification strategy can be obtained by calculation according to the priority scanning result and the secondary priority scanning result.

8. The method according to claim 1, wherein after determining that the identification policy result is an identification result of whether the data to be identified is sensitive data, the method further comprises:

9. An apparatus for rapidly identifying sensitive data, comprising:

10. A storage medium having stored therein at least one executable instruction for causing a processor to perform operations corresponding to the method for rapidly identifying sensitive data according to any one of claims 1 to 8.

11. A computer device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the corresponding operation of the method for quickly identifying sensitive data according to any one of claims 1-8.

12. A computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions, characterized in that the program instructions, when executed by a computer, cause the computer to perform the method steps of quickly identifying sensitive data according to any of claims 1-8.