WO2023077815A1

WO2023077815A1 - Method and device for processing sensitive data

Info

Publication number: WO2023077815A1
Application number: PCT/CN2022/099611
Authority: WO
Inventors: 彭永杰
Original assignee: 深圳前海微众银行股份有限公司
Priority date: 2021-11-03
Filing date: 2022-06-17
Publication date: 2023-05-11
Also published as: CN114048512A

Abstract

Disclosed in the present invention are a method and device for processing sensitive data. The method comprises: receiving sampling data sent by a plurality of application interfaces, performing hash processing on feature information corresponding to each application interface, and determining an interface identifier of each application interface; determining, on the basis of the maximum total processing data volume of the sampling data in a preset duration, the total data volume at a current moment in a current period duration, and a preset processing condition, sample data corresponding to each interface identifier; determining conversion data corresponding to each piece of data in each piece of sample data, the conversion data comprising field names and corresponding values corresponding to the field names; and performing sensitive type identification on the corresponding values in each piece of conversion data, and obtaining sensitive types corresponding to all the corresponding values in each piece of conversion data. The method can effectively reduce the influence of sudden increase or change of the data volume and types of application interfaces on the sensitive type identification of the data, and quickly and simply complete sensitive type sorting of the sampling data corresponding to each application interface.

Description

A method and device for processing sensitive data

Cross References to Related Applications

This application claims the priority of the Chinese patent application with the application number 202111294701.7 and the application title "A Method and Device for Processing Sensitive Data" filed with the China Patent Office on November 03, 2021, the entire contents of which are incorporated herein by reference Applying.

technical field

Embodiments of the present invention relate to the field of financial technology (Fintech), and in particular, to a method and device for processing sensitive data.

Background technique

With the development of computer technology, more and more technologies are applied in the financial field, and the traditional financial industry is gradually transforming into financial technology. However, due to the security and real-time requirements of the financial industry, higher requirements are also placed on technology.

At present, with the rapid development of cloud computing and big data, Internet services have brought great convenience to people's lives, but at the same time, they have also brought many security problems. At present, Internet services provide various functional interfaces both internally and externally. If some interfaces involving sensitive data are compromised or their own problems lead to sensitive data leakage, it may cause huge security risks to users and enterprises. Therefore, in order to strengthen the governance, operation and protection of sensitive data, the risk identification and distribution flow of interface sensitive data assets become particularly important.

However, when processing sensitive data in the prior art, it is generally necessary to directly analyze the acquired sensitive data. In this way, a large amount of sensitive data needs to be processed, resulting in a slow overall processing speed. When the amount of data increases, it is impossible to process the newly added sensitive data in an accurate and timely manner, that is, the overall processing efficiency of sensitive data is low.

Contents of the invention

The present invention provides a method and device for processing sensitive data, which is used to effectively reduce the impact on the sensitive type identification of data due to the sudden increase or change of data volume and application interface category, and quickly and simply complete the sampling data corresponding to each application interface Sensitive types of carding.

In a first aspect, the present invention provides a method for processing sensitive data. The method includes: receiving sampled data sent by multiple application interfaces, performing hash processing on the feature information corresponding to each of the application interfaces, and determining the The interface identification of the application interface; based on the maximum total amount of data processed for the sampled data within the preset time period, the total amount of data at the current moment within the current cycle time period, and the preset processing conditions, determine the sample data corresponding to each of the interface identifications; determine Conversion data corresponding to each piece of data in each of the sample data, the conversion data includes a field name and a corresponding value corresponding to the field name; sensitive type identification is performed on the corresponding value in each piece of conversion data, and each piece of conversion data is obtained. Sensitivity types for all corresponding values in the transformed data described in this article.

In the above method, the interface identifier corresponding to each application interface that sends sampled data is calculated, and the system services corresponding to different application interfaces are distinguished based on the interface identifier, so that the influence of different data volumes corresponding to different system services can be reduced, and further To a certain extent, it can reduce the impact of data skew on the identification of sensitive types of subsequent data, and can use samples instead of full data to sort out the sampled data sent by multiple application interfaces, which can greatly reduce the amount of data to be processed , improve the speed of data processing, thereby reducing the cost of manpower and machine resources, and then improving the identification efficiency of sensitive data.

In a possible implementation manner, the preset processing condition is expressed in the following manner:

Among them, X is used to represent the number of types of the application interface, K _{App_ID} is used to represent the sample data volume of each type of application interface within the preset duration, and K _MAX is used to represent the maximum processing of the sampled data within the preset duration total amount of data.

In the above method, the amount of sample data and the number of application interface types of each type of application interface are constrained, so that the sample data corresponding to each type of application interface can be covered as much as possible, and it is effectively guaranteed. Stabilization of the identification basis for the identification of sensitive types of subsequent data.

In a possible implementation manner, the sample data identified by each interface is determined based on the maximum total amount of data processed for the sampled data within the preset time period, the total amount of data at the current moment in the current cycle time period, and the preset processing conditions , including: determining whether the current cycle duration is the cycle duration of the first determination of the sample data of each of the interface identifiers; when determining that the current cycle duration is the first determination of the cycle duration of the sample data of each of the interface identifiers, Determine the ratio of the total amount of interface data corresponding to any of the interface identifiers at the current moment in the current cycle duration to the total amount of data at the current moment in the current cycle duration; compare the ratio with the maximum total amount of processed data Multiply to obtain the total amount of initial data of the initial sample data of any one of the interface identifiers; determine the first interface of the first interface data corresponding to any one of the interface identifiers at any time after the current moment within the duration of the current cycle The total amount of data; when it is determined that the total amount of any one of the first interface data is not greater than the corresponding initial data amount, determine the first probability that each piece of data in the first interface data is returned to the corresponding array , and based on the first probability and the data in the first interface data, obtain the first data in the corresponding array; use the first data as sample data of any of the interface identifiers to determine Sample data for each of the described interface identifiers.

Based on the above method, when the total amount of the first interface data is not greater than the corresponding initial data amount, it can be determined that the sample data covers a more comprehensive sample data, that is, a relatively small amount of sample data with a relatively comprehensive sample data coverage is provided for subsequent The identification of the sensitive type of the sampling data sent by each application interface reduces the amount of data to be processed, thereby improving the processing speed of the sensitive data.

In a possible implementation manner, the method further includes: when it is determined that the total amount of data of the first interface identified by any of the interfaces is greater than the total amount of data of the initial sample data, determining that the first interface Each piece of data in the data is returned to the second probability in the corresponding array; based on the second probability and the data in the first interface data, obtain the second data in the corresponding array, the The second probability is different from the first probability; the second data is used as the sample data of any one of the interface identifiers to determine the sample data of each of the interface identifiers.

Based on the above method, when the total amount of the first interface data is greater than the corresponding initial data amount, it can be determined that the sample data covers a more comprehensive sample data, that is, a relatively small amount of sample data with a relatively comprehensive sample data coverage is provided for subsequent The identification of the sensitive type of the sampling data sent by the application interface reduces the amount of data to be processed, thereby improving the processing speed of the sensitive data.

In a possible implementation manner, the sample data identified by each interface is determined based on the maximum total amount of data processed for the sampled data within the preset time period, the total amount of data at the current moment in the current cycle time period, and the preset processing conditions , including when it is determined that the current cycle duration is not the first time to determine the cycle duration of the sample data of each of the interface identifiers, and when it is determined that the historical sample data is stored in the array corresponding to any of the interface identifiers, for all Process the historical sample data to obtain the sample identification of each piece of historical sample data; determine the total amount of data of the historical sample data corresponding to any of the interface identifications, and the total amount of data corresponding to any of the sample identifications, and based on the The total amount of data of the historical sample data and the total amount of data corresponding to the sample identification, determine the weight coefficient corresponding to any sample identification; When the total amount of sample data is used, determine the third probability that each piece of data in the first interface data is returned to the corresponding array; based on the third probability and the data in the first interface data, Obtain the third data in the corresponding array, and use the third data as the sample data of any of the interface identifiers to determine the sample data of any of the interface identifiers, and the third probability is the first The product of two probabilities and weight coefficients.

In the above method, by increasing the weight coefficient, the probability of the same type of data determined as a sample being determined as a sample again is reduced, the problem of data skew is minimized, and the coverage rate of sample data is improved.

In a possible implementation manner, performing sensitive type identification on corresponding values in each piece of converted data, and obtaining sensitive types corresponding to all corresponding values in each piece of converted data includes: The corresponding value of the corresponding value is initially identified, and the total number of times that all corresponding values are identified and the number of times that all corresponding values are identified as corresponding to each sensitive type are obtained; based on preset regular expressions or preset metadata keywords, for each Identify and match any corresponding value in the conversion data described in Article 1. When the matching is passed, verify any corresponding value based on a preset algorithm. When the verification is passed, the total number of times and any Accumulate the number of times of the sensitive type corresponding to the value to obtain the first total number and the first number; based on the first total number and the first number, obtain the first recognition rate; the recognition rate is used to characterize the The probability that the type of any corresponding value is a specific sensitive type; when it is determined that the first recognition rate is not less than the corresponding preset threshold, a label is added to any corresponding value, and the label is used to characterize the The type corresponding to any corresponding value is the specific sensitive type.

Based on the above method, when the first recognition rate corresponding to any corresponding value is not less than the corresponding preset threshold, the sensitivity type of the field corresponding to the corresponding value can be accurately determined.

In a possible implementation manner, the method further includes: when any corresponding value in each piece of converted data identifies a match and/or fails the check, accumulating the total times to obtain the second The total number of times; based on the second total number of times and the number of times of the sensitive type to which any corresponding value belongs, a second recognition rate is obtained; when it is determined that the second recognition rate is not less than the preset threshold value, then keep the The label corresponding to any corresponding value remains unchanged.

Based on the above method, it can be more accurately determined whether the label corresponding to the field marked with the label is accurate, and the accuracy rate of the label corresponding to the field is improved.

In a second aspect, the present invention provides a device for determining an access token, the device comprising:

The first processing unit is configured to receive sampled data sent by multiple application interfaces, perform hash processing on feature information corresponding to each of the application interfaces, and determine an interface identifier of each of the application interfaces;

A determining unit, configured to determine the sample data corresponding to each of the interface identifiers based on the maximum total amount of processed data for the sampled data within a preset time period, the total amount of data at the current moment within the current cycle time period, and preset processing conditions;

A second processing unit, configured to determine conversion data corresponding to each piece of data in each of the sample data, where the conversion data includes a field name and a corresponding value corresponding to the field name;

The obtaining unit is configured to perform sensitivity type identification on corresponding values in each piece of converted data, and obtain sensitive types corresponding to all corresponding values in each piece of converted data.

In a possible manner, the preset processing condition is expressed based on the following manner:

In a possible implementation manner, the determining unit is specifically configured to: determine whether the current cycle duration is the cycle duration for determining the sample data of each of the interface identifiers for the first time; When determining the cycle duration of the sample data of each of the interface identifiers, determine the total amount of interface data corresponding to any of the interface identifiers at the current moment within the current cycle duration, and the ratio of the total amount of data at the current moment within the current cycle duration Ratio; multiply the ratio by the total amount of maximum processed data to obtain the total amount of initial data of any initial sample data identified by the interface; determine any moment after the current moment within the duration of the current cycle, any A first interface data total amount of the first interface data corresponding to the interface identifier; when it is determined that the first interface data total amount is not greater than the corresponding initial data total amount, determine that each of the first interface data The piece of data is returned to the first probability in the corresponding array, and based on the first probability and the data in the first interface data, the first data in the corresponding array is obtained; the first data As the sample data of any of the interface identifiers, to determine the sample data of any of the interface identifiers.

In a possible implementation manner, the determining unit is further configured to: when it is determined that the total amount of the first interface data of any one of the interface identifiers is greater than the total amount of data of the initial sample data, determine that the Each piece of data in the first interface data is returned to the second probability in the corresponding array; based on the second probability and the data in the first interface data, obtain the second data in the corresponding array , the second probability is different from the first probability; the second data is used as the sample data of any one of the interface identifiers to determine the sample data of each of the interface identifiers.

In a possible implementation manner, the determining unit is specifically configured to: determine that any interface When the historical sample data is stored in the array corresponding to the identifier, the historical sample data is processed to obtain a sample identifier of each piece of historical sample data; determine the total amount of historical sample data corresponding to any one of the interface identifiers, and the total amount of data corresponding to any of the sample identifiers, and based on the total amount of data of the historical sample data and the total amount of data corresponding to the sample identifier, determine the weight coefficient corresponding to any sample identifier; When the total amount of the first interface data identified by the interface is greater than the total amount of data of the initial sample data, determine the third probability that each piece of data in the first interface data is returned to the corresponding array; based on The third probability and the data in the first interface data, obtain the third data in the corresponding array, and use the third data as the sample data of any of the interface identifiers to determine each of the The sample data identified by the interface, the third probability is the product of the second probability and a weight coefficient.

In a possible implementation manner, the obtaining unit is specifically configured to: perform initial identification processing on all corresponding values in the converted data, and obtain the total number of times that all corresponding values are identified and the number of times that all corresponding values are identified. The number of times identified as corresponding to each sensitive type; based on preset regular expressions or preset metadata keywords, identify and match any corresponding value in each piece of converted data, and when the match is passed, based on preset algorithms The any corresponding value is verified, and when the verification is passed, the total number of times and the number of sensitive types to which any corresponding value belongs are accumulated to obtain the first total number and the first number; based on the The first total number and the first number are used to obtain the first recognition rate; the recognition rate is used to characterize the probability that the type of any corresponding value is a specific sensitive type; when it is determined that the first recognition rate is not less than the corresponding If the preset threshold value is set, a label is added to the any corresponding value, and the label is used to indicate that the type corresponding to the any corresponding value is the specific sensitive type.

In a possible implementation manner, the obtaining unit is further configured to: when any corresponding value in each piece of the converted data identifies a match and/or fails the verification, accumulate the total times to obtain The second total number of times; based on the second total number of times and the number of sensitive types to which any corresponding value belongs, a second recognition rate is obtained; when it is determined that the second recognition rate is not less than the preset threshold, then keep The label corresponding to any corresponding value remains unchanged.

For the beneficial effects of the above-mentioned second aspect and each optional device of the second aspect, reference may be made to the beneficial effects of the above-mentioned first aspect and each optional method of the first aspect, which will not be repeated here.

In a third aspect, the present invention provides a computer device, including a program or an instruction, and when the program or instruction is executed, is used to execute the above-mentioned first aspect and each optional method of the first aspect.

In a fourth aspect, the present invention provides a storage medium, including a program or an instruction, and when the program or instruction is executed, is used to execute the above-mentioned first aspect and each optional method of the first aspect.

Description of drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the following will briefly introduce the drawings that need to be used in the description of the embodiments.

FIG. 1 is a schematic diagram of an application scenario provided by an embodiment of the present invention;

FIG. 2 is a schematic flowchart of steps of a method for processing sensitive data provided by an embodiment of the present invention;

Fig. 3 is a schematic structural diagram of an apparatus for processing sensitive data provided by an embodiment of the present invention.

Detailed ways

In order to better understand the above-mentioned technical solution, the above-mentioned technical solution will be described in detail below in conjunction with the accompanying drawings and specific implementation methods. It should be understood that the embodiments of the present invention and the specific features in the embodiments are detailed descriptions of the technical solution of the present invention. To illustrate, rather than limit, the technical solutions of the present invention, the embodiments of the present invention and the technical features in the embodiments may be combined without conflict.

It should be noted that the terms "first" and "second" in the specification and claims of the present invention are used to distinguish similar objects, but not necessarily used to describe a specific sequence or sequence. It is to be understood that the images so used are interchangeable under appropriate circumstances such that the embodiments of the invention described herein can be practiced in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples do not represent all implementations consistent with the present invention. Rather, they are merely examples of apparatuses and methods consistent with aspects of the invention as recited in the appended claims.

At present, with the explosive growth of Internet business, the number of distributed system services has increased sharply, and the inter-service interface call relationship is diverse and complicated, it is necessary to strengthen the governance, operation and protection of sensitive data in the data corresponding to each interface. Therefore, the identification of sensitive data in data becomes particularly important.

However, in the prior art, the sensitive types of the acquired data are generally identified directly, that is, the entire amount of data is identified and processed. In this way, not only the identification efficiency is low, but also more memory resources are consumed. And as the source and data volume of sensitive data increase, it is impossible to process the newly added sensitive data in an accurate and timely manner, that is, the overall processing efficiency of sensitive data is low.

In view of this, the embodiment of the present invention provides a method for processing sensitive data. Through this method, the interface identifier corresponding to each application interface can be calculated, and the system services corresponding to different application interfaces can be distinguished based on the interface identifier, thereby reducing the The impact of different data volumes corresponding to different system services can reduce the impact of data skew on the identification of sensitive types of subsequent data to a certain extent, and can use samples instead of full data to sort out the sampled data sent by multiple application interfaces , can greatly reduce the amount of data to be processed, increase the speed of data processing, thereby reducing the cost of manpower and machine resources, and then improve the identification efficiency of sensitive data.

After introducing the design ideas of the embodiments of the present invention, the following briefly introduces the application scenarios applicable to the technical solution for processing sensitive data in the embodiments of the present invention. It should be noted that the application scenarios described in the embodiments of the present invention are for clearer The description of the technical solutions of the embodiments of the present invention does not constitute a limitation to the technical solutions provided by the embodiments of the present invention. Those of ordinary skill in the art know that with the emergence of new application scenarios, the technical solutions provided by the embodiments of the present invention are applicable to similar The same applies to technical issues.

In the embodiment of the present invention, please refer to the schematic diagram of an application scenario shown in FIG. 1 , which includes a computer device 101 and an application server 102 , and the computer device 101 can communicate with the application server 102 . Specifically, for example, direct or indirect connection is performed through wired or wireless communication, which is not limited in the present invention. Wherein, the application server 102 includes an application server 102-1, an application server 102-2, . . . , and an application server 102-n, where n is a positive integer greater than 2.

In this scenario, the application server 102 can send data containing sensitive data to the computer device 101, so that the computer device 101 can process the received data, thereby obtaining the data type of the sensitive data in the received data, and realizing sorting out the sensitive data . In a specific implementation process, the computer device 101 may store the processing result of the received data in a corresponding database, and may also send the processing result of the received data to a data security platform deployed on other computer devices.

Wherein, the computer device 101 can be an independent physical server, or a server cluster or a distributed system composed of multiple physical servers, and can also provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, Cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network, content distribution network), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms, etc., but are not limited to this. The application server 102 may be a server deployed in a distributed system.

In order to further illustrate the solution of the method for processing sensitive data provided by the embodiment of the present invention, it will be described in detail below in conjunction with the accompanying drawings and specific implementation methods. Although the embodiments of the present invention provide the method operation steps as shown in the following embodiments or drawings, more or less operation steps may be included in the method based on conventional or creative efforts. In the steps that logically do not have a necessary causal relationship, the execution order of these steps is not limited to the execution order provided in the embodiment of the present invention. The method can be executed sequentially or in parallel according to the methods shown in the embodiments or drawings during the actual processing process or when the device is executed (for example, a parallel processor or an application environment for multi-thread processing).

The method for processing sensitive data in the embodiment of the present invention will be described below with reference to the method flowchart shown in FIG. 2 , and the method flow in the embodiment of the present invention will be introduced below.

Step 201: Receive sampling data sent by multiple application interfaces, perform hash processing on feature information corresponding to each application interface, and determine an interface identifier of each application interface.

In the embodiment of the present invention, the computer device may receive sampling data sent by multiple application interfaces. Specifically, the multiple application interfaces may be interfaces of different types, or interfaces of partly the same type and partly of different types. The present invention There is no restriction on this in implementation.

In addition, in an actual implementation process, the number of multiple application interfaces may also be updated based on time update. For example, at 9:31 am on June 17, 2021, there are 4 application interfaces that send sampling data to computer equipment, and at 9:32 am on June 17, 2021, 8 application interfaces send sampling data to computer equipment.

In the embodiment of the present invention, the computer device may determine the characteristic information of each application interface in the plurality of application interfaces, so as to determine the characteristic value corresponding to the characteristic information. Specifically, the method of determining the characteristic information may be determined based on the fact that multiple application interfaces carry their corresponding characteristic information when sending sampled data, or the computer device may send a request for acquiring characteristic information to the application server corresponding to the multiple application interfaces, Therefore, the characteristic information is obtained based on the feedback information of the corresponding application server, which is not limited in this embodiment of the present invention.

Specifically, the feature information may at least include: the service ID corresponding to the application interface; the scene ID, where the scene ID is, for example, the ID of the update scene; the packet type of the data, such as synchronous or asynchronous; the system number of the requester ;Responder system number.

In the embodiment of the present invention, a hash operation may be performed on the feature value corresponding to each application interface, so as to obtain the interface identifier of each application interface. It should be noted that each interface identifier is unique, that is, the corresponding application interface can be determined based on the interface identifier.

For example, assuming that the characteristic values corresponding to application interface 1 are: V1, V2, ..., Vn, where n is a positive integer greater than 2, it can be determined that the interface identifier corresponding to application interface 1 can be expressed as: APP_ID=HASH(V1 +V2+...+Vn).

It can be seen that in the embodiment of the present invention, the system services corresponding to different application interfaces are distinguished based on the interface identifier, so that the impact of different data volumes corresponding to different system services can be reduced, and the impact of data skew on subsequent data can be reduced to a certain extent. The impact of sensitive type recognition.

Step 202: Determine the sample data corresponding to each interface identifier based on the maximum total amount of data processed for the sampled data within the preset time period, the total amount of data at the current moment in the current cycle time period, and the preset processing conditions.

In the embodiment of the present invention, the computer device may periodically process the received sampling data based on a preset duration. For example, assuming that the preset duration is 1 minute, the received sampling data may be processed at a period of 1 minute. It should be noted that the preset duration may be determined based on actual implementation, which is not limited in this embodiment of the present invention.

In the embodiment of the present invention, in order to make the coverage of the finally determined sample data more comprehensive and adapt to the data source, that is, the data provided corresponding to the number of interfaces of multiple application interfaces or the total amount of data of each application interface Therefore, before determining the sample data corresponding to each application interface, it may be determined whether the current cycle duration is the cycle duration for which the sample data identified by each interface is determined for the first time.

In a possible implementation manner, when the current period is determined to be the period for which the sample data of each interface identifier is determined for the first time, the following steps may be used, but not limited to, to determine the initial corresponding sample data of any interface identifier:

Step a: determine the total amount of interface data corresponding to any interface identifier at the current moment in the current cycle duration, and the ratio of the total amount of data at the current moment in the current cycle duration;

Step b: Multiply the ratio by the maximum total amount of data processed for the sampled data within a preset period of time to obtain the total amount of initial data of the initial sample data identified by any interface;

In the embodiment of the present invention, it is assumed that the maximum total amount of data processed for sampling data within the preset duration is K _MAX , the total amount of data at the current moment in the current cycle duration is N, and any application interface at the current moment in the current cycle duration The corresponding total amount of interface data is N _{APP_ID} , so it can be determined that the total amount of initial data identified by each interface is:

Step c: determine the first interface data total amount of the first interface data corresponding to any interface identifier at any moment after the current moment in the current cycle duration;

Step d: When it is determined that the total amount of the first interface data is not greater than the corresponding initial data amount, determine the first probability that each piece of data in the first interface data is returned to the corresponding array, and based on the first probability and the first For the data in the interface data, obtain the first data in the corresponding array;

Step e: use the first data as the sample data of the interface identifier to determine the sample data of any interface identifier.

In the embodiment of the present invention, since the total amount of interface data N _{APP_ID} increases with the input of the sending data of the application interface corresponding to the interface identifier, it is assumed that at any time after the current time in the current period, any The total amount of first interface data corresponding to the interface identifier is: N′ _{APP_ID} .

Specifically, when N′ _{APP_ID(x)} ≤ K _{APP_ID(x)} , the first probability can be determined as:

Wherein, x may represent the sequence identifier of the application interface, for example, if the sequence identifier of the first application interface is 1, then the interface identifier of the first application interface is APP_ID(1).

Further, the computer device may obtain the first data in the corresponding array based on the first probability and the data in the first interface data. Then use the first data as the sample data of the interface identifier to determine the sample data of any interface identifier.

Step f: When it is determined that the total amount of first interface data identified by any interface is greater than the total amount of data of the initial sample data, determine the second probability that each piece of data in the first interface data is returned to the corresponding array.

Step g: based on the second probability and the data in the first interface data, obtain the second data in the corresponding array, the second probability is different from the first probability;

Step h: use the second data as the sample data of any interface identifier to determine the sample data of each interface identifier.

In the embodiment of the present invention, when N' _{APP_ID(x)} >K _{APP_ID(x)} , the second probability can be determined as:

Specifically, if the current data starts with

The probability is taken out, then continue to

The probability of replacing the existing data in the corresponding array, otherwise the array data remains unchanged. Therefore, the probability of retaining the current data is

In the embodiment of the present invention, the aforementioned solution for determining the sample data corresponding to the interface identifier needs to meet the preset processing conditions. Specifically, the preset processing conditions can be expressed in the following manner:

Among them, X is used to represent the number of types of application interfaces, K _{App_ID} is used to represent the sample data volume of each type of application interface within a preset time period, and K _MAX is used to represent the maximum total processing data of sampled data within a preset time period. quantity.

For example, assuming that the number of types of application interfaces is 3, the sum of K _{APP_ID(1)} , K _{APP_ID(2)} and K _{APP_ID(3)} is not greater than K _MAX .

It can be seen that when adding an application interface, for each identified application interface, the total amount of interface data remains unchanged, but the total amount of data becomes larger, that is, the value of sample data for each application interface will become smaller .

Specifically, assuming that application interface x is an application interface with confirmed sample data, if the total amount of sample data corresponding to application interface x is not greater than the reduced total amount of sample data determined at a time after the current time, then no adjustment is required The total amount of sample data is determined before, and the data probability of the subsequent feedback of the application interface x is: when the total amount of initial data corresponding to the application interface x is not greater than the total amount of interface data after the reduction, then based on

Return data to the array; when the total amount of initial data corresponding to the application interface x is greater than the total amount of interface data after reduction, based on

to return data to an array.

And, if the total amount of sample data corresponding to application interface x is greater than the total amount of reduced sample data determined at a time after the current moment, then the existing sample data needs to be reduced to the reduced total amount of sample data, and The probability of returning data to the array is unchanged.

In a possible implementation manner, when the current cycle duration is determined to be the cycle duration for determining the sample data of each interface identifier for the first time, the solution for determining the sample data of each interface identifier may include but not limited to the following steps:

Step A: When historical sample data is stored in the array corresponding to any interface identifier, process the historical sample data to obtain the sample identifier of each piece of historical sample data;

Step B: Determine the total amount of historical sample data corresponding to any interface identifier, and the total amount of data corresponding to any sample identifier, and based on the total amount of historical sample data and the total amount of data corresponding to the sample identifier, determine each The weight coefficient corresponding to the sample ID;

Step C: When it is determined that the total amount of first interface data identified by any interface is greater than the total amount of data in the initial sample data, determine the third probability that each piece of data in the first interface data is returned to the corresponding array;

Step D: Obtain the third data in the corresponding array based on the third probability and the data in the first interface data, and use the third data as the sample data of any interface identifier to determine the sample data of each interface identifier, the first The third probability is the product of the second probability and the weight coefficient.

In the embodiment of the present invention, since the APP_ID is calculated according to the characteristic information or attribute value of the application interface, and the identification of the sensitive type of data needs to be for each piece of data in the corresponding sending data of the application interface, that is, in each piece of data Therefore, when determining the sample data corresponding to the current interface, consider reducing the probability that the same type of data that has been determined as a sample will be determined as a sample again, minimize the problem of data skew, and improve the coverage of sample data.

Specifically, the data content of each piece of sample data in the historical sample data can be analyzed to obtain attributes such as the parameter list P and message length L of the message content, and the unique identifier of the message content can be calculated through a hash algorithm. The unique identifier may be called a sample identifier, and may be expressed as: BODY_ID=HASH(P+...+L).

Assuming that the total amount of historical sample data corresponding to any interface ID is expressed as: K _{APP_ID(ALL)} , and the total amount of data corresponding to any sample ID is expressed as: V _{BODY_ID} , then the corresponding weight coefficient of each sample ID can be determined It is: W _{BODY_ID} =1-V _{BODY_ID} /K _{APP_ID(ALL)} . It can be seen that when the total amount of data corresponding to V _{BODY_ID} is 0, then W _{BODY_ID} is 1.

In the embodiment of the present invention, when it is determined that N' _{APP_ID(x)} >K _{APP_ID(x)} , the third probability can be determined as:

Specifically, if the current data starts with

The probability is taken out, then continue to

The probability of replacing the existing elements in the corresponding array, otherwise the array elements remain unchanged. Therefore, the probability of retaining the current data is

In order to better illustrate the process of determining sample data, a specific processing procedure is taken as an example below to describe the manner of determining sample data provided in step 202.

In the embodiment of the present invention, it is assumed that the unit time, such as 1 minute, is the preset duration, and the current cycle duration is the cycle duration of the sample data of the application interface A determined for the first time, assuming that the maximum total amount of data processed is 100 pieces of data, and the application interface A The corresponding total amount of data is 0 data.

Then, at the first moment after the current moment in the current period, for example, 15:06:1 second, if the first piece of data sent by application interface A is received, it can be determined that the total amount of initial data of application interface A is: 1 /1*100=100 pieces, that is, the total amount of data of the first interface of application interface A, that is, 1 piece of data is not greater than the total amount of initial data of application interface A, that is, 100 pieces of data, so that the first piece of data of application interface A can be converted to The first probability, ie 1/1=1, is returned to the corresponding array.

At 15:06:2 seconds, if the second piece of data sent by application interface A is received, it can be determined that the total amount of initial data of application interface A is: 2/2*100=100 pieces, that is, the first piece of data of application interface A The total amount of interface data, that is, 2 pieces of data is not greater than the total amount of initial data of application interface A, which is 100, so that the second piece of data of application interface A can be returned to the corresponding array with the first probability that is 2/2=1.

At 15:06:13, if the 100th piece of data sent by application interface A is received, it can be determined that the total amount of initial data of application interface A is: 100/100*100=100 pieces, that is, the first data of application interface A The total amount of interface data, that is 100, is not greater than the total amount of initial data of application interface A, namely 100, so that the 100th piece of data of application interface A can be returned to the corresponding array with the first probability that is 100/100=1.

At 15:06:15, if the 101st piece of data sent by application interface A is received, it can be determined that the total amount of initial data of application interface A is: 101/101*100=100 pieces, and the first data of application interface A The total amount of interface data, that is, 101 pieces of data is greater than the initial total amount of data of application interface A, that is, 100 pieces, then for 101 pieces of data, they will be kept in the array with a probability of 100/101, and the original 100 pieces of data in the data will be stored in the array with a probability of 1 /100 probability of being selected for replacement.

Further, at 15:06:16, if the first piece of data sent by application interface B is received, it can be determined that the total amount of initial data of application interface B is: 1/102*100=1 piece, it should be noted that , in the actual calculation process, the final quantity can be determined by rounding up. It can be seen that the total amount of the first interface data of application interface B, i.e. 1 piece, is not greater than the corresponding initial data amount of 1 piece, so that the first piece of data of application interface B can be returned to in the corresponding array.

At 15:06:17, if the second piece of data sent by application interface B is received, it can be determined that the total amount of initial data of application interface B is: 2/103*100=2 pieces. It should be noted that, in actual During the calculation process, the final quantity can be determined by rounding up. It can be seen that the total amount of the first interface data of the application interface B, that is, 2 pieces of data, is not greater than the corresponding total amount of initial data, that is, 2 pieces of data, so that the second piece of data of the application interface B can be used with the first probability, that is, 2/2=1 Return to the corresponding array.

At 15:06:19, if the 11th piece of data sent by application interface B is received, it can be determined that the total amount of initial data of application interface B is: 11/112*100=10 pieces; it can be seen that the first piece of data of application interface B If the total amount of 11 pieces of data in an interface is greater than the corresponding initial data volume of 10 pieces of data, then the 11th piece of data will be retained in the array with a probability of 10/11, and the original 10 pieces of data in the array will be stored in the array with a probability of 10/11. A 1/10 chance of being selected for replacement.

At 15:06:20, if the 12th piece of data sent by application interface B is received, it can be determined that the total amount of initial data of application interface B is: 12/113*100=11 pieces (rounded up), it can be seen that, The total amount of sample data in the array of the total amount of initial data becomes 11 pieces, and the 11th piece of data sent by the received application interface B replaces one of the 10 pieces of data in the original array, that is, the data corresponding to the application interface B The number of data items in the array is less than the total amount of initial data corresponding to the application interface B. Therefore, the twelfth item of data of the application interface B can be returned to the corresponding array with the first probability of 1.

At 15:06:35, if the 102nd piece of data sent by application interface A is received, it can be determined that the total amount of the first interface data of application interface A is: 102/114*100=90 pieces, and application interface A corresponds to 100 pieces of data have been saved in the array, that is, the historical sample data is greater than the total amount of data of the first interface. Therefore, firstly, 90 of the elements of the array are reserved according to the probability of 90/100, and then for these 102 pieces of data, 90 The probability of /102 is retained, and the one in the array is selected with a probability of 1/90 to be replaced.

It can be seen that the above-mentioned method, that is, based on the improved pond sampling method, can carry out strong random sampling of streaming data, making the sample data coverage more comprehensive, more adaptable to changes in data sources, and improving the effectiveness of sensitive data identification and sorting and stability.

Step 203: Determine the conversion data corresponding to each piece of data in each sample data, the conversion data includes field names and corresponding values corresponding to the field names.

In the embodiment of the present invention, after determining the sample data corresponding to the multiple interface identifiers, the computer device can analyze and process each piece of data in each sample data. Specifically, the message formats such as JSON and XML can be converted into It is a KEY-VALUE key-value pair, that is, the conversion data including the field name and the corresponding value.

Step 204: Perform sensitive type identification on corresponding values in each piece of converted data, and obtain sensitive types corresponding to all corresponding values in each piece of converted data.

In the embodiment of the present invention, the computer device may perform initial recognition processing on the corresponding values in all the converted data, and obtain the total number of times that all corresponding values are recognized and the number of times that all corresponding values are recognized as corresponding to each sensitive type .

Further, the computer device can identify the sensitive type of each piece of data based on the identification strategy. Specifically, the identification strategy is based on metadata keyword matching and preset algorithm verification, or the identification strategy is based on preset regular expression matching and Default algorithm check.

In the embodiment of the present invention, the computer device can identify and match any corresponding value in each piece of converted data based on a preset regular expression or a preset metadata keyword. Wherein, the preset regular expression may be a VALUE regular expression, and the preset metadata keyword may be correspondingly determined based on an actual implementation situation, which is not limited in this embodiment of the present invention. When the matching is passed, any corresponding value is verified based on the preset algorithm. When the verification is passed, the total number of times and the number of sensitive types to which any corresponding value belongs are accumulated to obtain the first total number and the first time number. Wherein, the preset algorithm may be the VALUE algorithm, and of course, other algorithms may also be used, which is not limited in this embodiment of the present invention.

Further, the first recognition rate can be obtained based on the first total number and the first number, wherein the recognition rate is used to characterize the probability that the type of any corresponding value is a specific sensitive type, so that when determining the first recognition rate If it is not less than the corresponding preset threshold, a label is added to any corresponding value, and the label is used to indicate that the type corresponding to any corresponding value is a specific sensitive type.

In the embodiment of the present invention, it is assumed that the total number of times that all corresponding values are identified is expressed as N _{APP_ID_FIELD} , and the number of times that the corresponding value is identified as a sensitive type is expressed as N _X , where x is a document number (ID), a mobile phone number ( PHONE), bank card number (BANK) and other sensitive labels.

In the embodiment of the present invention, assuming that the preset threshold value corresponding to any corresponding value is expressed as _RERROR , and the field name corresponding to any corresponding value is expressed as F, then when any corresponding value is matched to a bank card number by a preset regular expression , and when the algorithm check is passed, add one to the values of _{NAPP_ID_FIELD(F)} and N _BANK(F) to obtain the first total number and the first number, so that the first recognition rate can be obtained: the first recognition rate can be determined A recognition rate: _RS(BANK) =N' _{APP_ID_FIELD(F)} /N' _BANK(F) .

Specifically, if R _S(BANK) is not less than R _ERROR , then add the BANK label to field F, and determine that the application interface corresponding to any corresponding value is a sensitive interface "involving bank card numbers". If R _S(BANK) is less than R _ERROR , then no BANK tag is added to field F, and if the field already has a BANK tag, it is cleared.

It should be noted that, in the embodiment of the present invention, the preset algorithm for verifying the bank card number may be a modulo 10 algorithm, of course, it may also be other algorithms, which are not limited in the embodiment of the present invention. It can be seen that different preset algorithms can be used for different specific sensitive types.

In a possible implementation manner, when the computer device determines that any corresponding value in each piece of conversion data identifies a match and/or fails the verification, the total number of times is accumulated to obtain the second total number of times, and then the second total number of times can be obtained based on the first 2 The total number of times and the number of times of the sensitive type to which any corresponding value belongs to obtain a second recognition rate. Further, when it is determined that the second recognition rate is not less than the preset threshold, the label corresponding to any corresponding value remains unchanged.

In this embodiment of the present invention, it is assumed that the processing of the previous field F is taken as an example for description. Specifically, when the field F does not satisfy any regular expression match, or the algorithm check fails, the value of N _{APP_ID_FIELD(F)} is increased by one to obtain the second total number of times, so that the second recognition rate can be determined as: R ' _S(BANK) =N' _{APP_ID_FIELD(F)} /N _BANK(F) . If the second recognition rate R' _S(BANK) is not less than R _ERROR at this time, then the label of field F remains unchanged; if the second recognition rate R' _S(BANK) is less than R _ERROR at this time, then the label corresponding to field F is cleared.

It should be noted that, in the embodiment of the present invention, if any field has multiple meanings, that is, it cannot pass the verification and has no previous label, a prompt will be output, and the user of the computer device can manually mark the field to achieve The label is determined.

It can be seen that in the embodiment of the present invention, first of all, using samples instead of full data to sort out sensitive assets of the overall service interface can greatly reduce the amount of data to be processed, improve the speed of data processing, and reduce manpower. and machine resource costs. Secondly, calculating the unique identifier of the application interface and classifying it can distinguish different system services, reduce the impact of different system service requests, solve the problem of data skew to a certain extent, and make the sample data better interface with the overall service characteristics fit. Next, the actual application scenario is real-time data processing. Based on the improved pond sampling method, it is possible to perform strong random sampling on streaming data and multiply it by weight coefficients to reduce the probability of sampled data being sampled (improved in disguise). Small amount of data sampling probability), which makes the coverage of sample data more comprehensive, more adaptable to changes in data sources, and improves the effectiveness and stability of sorting out sensitive assets.

As shown in Figure 3, the present invention provides a device for processing sensitive data, the device includes a first processing unit 301, a determination unit 302, a second processing unit 303 and an obtaining unit 304, wherein:

The first processing unit 301 is configured to receive sampled data sent by multiple application interfaces, perform hash processing on feature information corresponding to each of the application interfaces, and determine an interface identifier of each of the application interfaces;

The determination unit 302 is configured to determine the sample data corresponding to each of the interface identifiers based on the maximum total amount of processed data for the sampled data within a preset time period, the total amount of data at the current moment within the current cycle time period, and preset processing conditions;

The second processing unit 303 is configured to determine the conversion data corresponding to each piece of data in each of the sample data, the conversion data includes a field name and a corresponding value corresponding to the field name;

The obtaining unit 304 is configured to perform sensitivity type identification on corresponding values in each piece of converted data, and obtain sensitive types corresponding to all corresponding values in each piece of converted data.

Among them, I is used to represent the number of types of the application interface, K _{App_ID} is used to represent the sample data volume of each type of application interface within the preset time length, and K _MAX is used to represent the maximum processing of the sampled data within the preset time length total amount of data.

In a possible implementation manner, the determining unit 302 is specifically configured to: determine whether the current cycle duration is the cycle duration for determining the sample data of each of the interface identifiers for the first time; When determining the cycle duration of the sample data of each of the interface identifiers for the first time, determine the total amount of interface data corresponding to any of the interface identifiers at the current moment in the current cycle duration, and the total amount of data at the current moment in the current cycle duration Ratio; multiply the ratio by the total amount of maximum processed data to obtain the total amount of initial data of any initial sample data identified by the interface, and store the initial sample data in a corresponding array; determine At any time after the current moment within the duration of the current period, the first interface data total amount of the first interface data corresponding to any one of the interface identifiers; when it is determined that the total amount of any one of the first interface data is not greater than the corresponding When the total amount of initial data is used, determine the first probability that each piece of data in the first interface data is returned to the corresponding array, and based on the first probability and the data in the first interface data, obtain The first data in the corresponding array; using the first data as the sample data of any one of the interface identifiers to determine the sample data of each of the interface identifiers.

In a possible implementation manner, the determining unit 302 is further configured to: when it is determined that the total amount of the first interface data of any one of the interface identifiers is greater than the total amount of data of the initial sample data, determine the The second probability that each piece of data in the first interface data is returned to the corresponding array; based on the second probability and the data in the first interface data, obtain the second probability in the corresponding array data, the second probability is different from the first probability; the second data is used as sample data of any one of the interface identifiers to determine the sample data of each of the interface identifiers.

In a possible implementation manner, the determining unit 302 is specifically configured to: determine that the any When the historical sample data is stored in the array corresponding to the interface identifier, the historical sample data is processed to obtain the sample identifier of each piece of historical sample data; determine the total amount of historical sample data corresponding to any one of the interface identifiers , and the total amount of data corresponding to any of the sample identifiers, and based on the total amount of data of the historical sample data and the total amount of data corresponding to the sample identifier, determine the weight coefficient corresponding to any sample identifier; when determining any When the total amount of the first interface data identified by the interface is greater than the total amount of data of the initial sample data, determine a third probability that each piece of data in the first interface data is returned to the corresponding array; Based on the third probability and the data in the first interface data, the third data in the corresponding array is obtained, and the third data is used as the sample data of any of the interface identifiers to determine each For the sample data identified by the interface, the third probability is a product of the second probability and a weight coefficient.

In a possible implementation manner, the obtaining unit 304 is specifically configured to: perform initial identification processing on all corresponding values in the converted data, and obtain the total number of times that all corresponding values are identified and all corresponding values The number of times identified as corresponding to each sensitive type; based on preset regular expressions or preset metadata keywords, identify and match any corresponding value in each piece of conversion data, and when the match is passed, based on the preset algorithm Verify any corresponding value, and when the verification is passed, accumulate the total number of times and the number of times of the sensitive type to which any corresponding value belongs to obtain the first total number and the first time; based on The first total number and the first number are used to obtain a first recognition rate; the recognition rate is used to characterize the probability that the type of any corresponding value is a specific sensitive type; when it is determined that the first recognition rate is not less than A corresponding preset threshold value, then add a label to the any corresponding value, and the label is used to indicate that the type corresponding to the any corresponding value is the specific sensitive type.

In a possible implementation manner, the obtaining unit 304 is further configured to: when any corresponding value in each piece of converted data identifies a match and/or fails the verification, accumulate the total number of times, Obtaining a second total number of times; obtaining a second recognition rate based on the second total number of times and the number of sensitive types to which any corresponding value belongs; when it is determined that the second recognition rate is not less than the preset threshold, then Keep the label corresponding to any corresponding value unchanged.

An embodiment of the present invention provides a computer device, including a program or an instruction. When the program or instruction is executed, it is used to execute a method for processing sensitive data and any optional method provided in the embodiment of the present invention.

An embodiment of the present invention provides a storage medium, including a program or an instruction. When the program or instruction is executed, it is used to execute a method for processing sensitive data and any optional method provided in the embodiment of the present invention.

Finally, it should be noted that: those skilled in the art should understand that the embodiments of the present invention may be provided as methods, systems, or computer program products. Accordingly, the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, optical storage, etc.) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the invention. It should be understood that each procedure and/or block in the flowchart and/or block diagram, and a combination of procedures and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions may be provided to a general purpose computer, special purpose computer, embedded processor, or processor of other programmable data processing equipment to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing equipment produce a An apparatus for realizing the functions specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions The device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

Obviously, those skilled in the art can make various changes and modifications to the present invention without departing from the scope of the present invention. Thus, if these modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalent technologies, the present invention also intends to include these modifications and variations.

Claims

A method for processing sensitive data, characterized in that the method comprises:

receiving sampling data sent by multiple application interfaces, performing hash processing on the feature information corresponding to each of the application interfaces, and determining the interface identifier of each of the application interfaces;

Determine the sample data corresponding to each of the interface identifiers based on the maximum total amount of data processed for the sampled data within the preset time period, the total amount of data at the current moment within the current cycle time period, and preset processing conditions;

Determine the conversion data corresponding to each piece of data in each of the sample data, the conversion data includes a field name and a corresponding value corresponding to the field name;

Sensitive types are identified for corresponding values in each piece of converted data, and sensitive types corresponding to all corresponding values in each piece of converted data are obtained.
The method according to claim 1, wherein the preset processing condition is expressed in the following manner:

Among them, X is used to represent the number of types of the application interface, K App_ID is used to represent the sample data volume of each type of application interface within the preset duration, and K MAX is used to represent the maximum processing of the sampled data within the preset duration total amount of data.
The method according to claim 1 or 2, characterized in that, based on the maximum total amount of data processed for the sampled data within the preset time period, the total amount of data at the current moment within the current cycle time length, and the preset processing conditions, determine each Sample data for the above interface identifiers, including:

Determine whether the current cycle duration is the first time to determine the cycle duration of each of the sample data identified by the interface;

When it is determined that the current cycle duration is the cycle duration for determining the sample data of each of the interface identifiers for the first time, determine the total amount of interface data corresponding to any one of the interface identifiers at the current moment within the current cycle duration, which is different from the current cycle duration The ratio of the total amount of data at the current moment within the duration;

multiplying the ratio by the maximum total amount of processed data to obtain the total amount of initial data of the initial sample data identified by any of the interfaces;

Determining the total amount of first interface data of the first interface data corresponding to any one of the interface identifiers at any time after the current time within the duration of the current cycle;

When it is determined that the total amount of the first interface data is not greater than the corresponding initial data amount, determine the first probability that each piece of data in the first interface data is returned to the corresponding array, and based on the first a probability and the data in the first interface data to obtain the first data in the corresponding array;

Using the first data as the sample data of any of the interface identifiers to determine the sample data of any of the interface identifiers.
The method of claim 3, further comprising:

When it is determined that the total amount of the first interface data of any of the interface identifiers is greater than the total amount of data of the initial sample data, it is determined that each piece of data in the first interface data is returned to the corresponding array second probability;

obtaining second data in the corresponding array based on the second probability and data in the first interface data, the second probability being different from the first probability;

The second data is used as the sample data of any one of the interface identifiers to determine the sample data of each of the interface identifiers.
The method according to claim 3, characterized in that each of said interfaces is determined based on the maximum total amount of data processed for sampled data within a preset time period, the total amount of data at the current moment in the current period of time, and preset processing conditions. The identified sample data, including:

When it is determined that the current cycle duration is not the first time to determine the cycle duration of the sample data of each of the interface identifiers, and it is determined that the historical sample data is stored in the array corresponding to any interface identifier, the history The sample data is processed to obtain the sample identification of each piece of historical sample data;

Determine the total amount of historical sample data corresponding to any of the interface identifiers, and the total amount of data corresponding to any of the sample identifiers, and based on the total amount of data of the historical sample data and the total amount of data corresponding to the sample identifier Quantity, determine the weight coefficient corresponding to any sample identification;

When it is determined that the total amount of the first interface data of any of the interface identifiers is greater than the total amount of data of the initial sample data, it is determined that each piece of data in the first interface data is returned to the corresponding array third probability;

Based on the third probability and the data in the first interface data, the third data in the corresponding array is obtained, and the third data is used as the sample data of any of the interface identifiers to determine each For the sample data identified by the interface, the third probability is a product of the second probability and a weight coefficient.
The method according to claim 1, wherein the sensitive type is identified for each corresponding value in the converted data, and the sensitive types corresponding to all corresponding values in each converted data are obtained, including:

Perform initial identification processing on all corresponding values in the converted data, and obtain the total number of times that all corresponding values are identified and the number of times that all corresponding values are identified as corresponding to each sensitive type;

Identify and match any corresponding value in each piece of converted data based on a preset regular expression or preset metadata keyword, and verify any corresponding value based on a preset algorithm after the match is passed , when the verification is passed, the total number of times and the number of sensitive types to which any corresponding value belongs are accumulated to obtain the first total number and the first time;

Based on the first total number of times and the first number of times, a first recognition rate is obtained; the recognition rate is used to characterize the probability that the type of any corresponding value is a specific sensitive type;

When it is determined that the first recognition rate is not less than the corresponding preset threshold, a label is added to any corresponding value, and the label is used to indicate that the type corresponding to any corresponding value is the specific sensitive type.
The method of claim 6, further comprising:

When any corresponding value in each piece of converted data identifies a match and/or fails the verification, the total number of times is accumulated to obtain a second total number of times;

Obtaining a second recognition rate based on the second total number of times and the number of sensitive types to which any corresponding value belongs;

When it is determined that the second recognition rate is not less than the preset threshold, keep the label corresponding to any corresponding value unchanged.
A device for processing sensitive data, characterized in that the device includes:

The first processing unit is configured to receive sampled data sent by multiple application interfaces, perform hash processing on feature information corresponding to each of the application interfaces, and determine an interface identifier of each of the application interfaces;

A determining unit, configured to determine the sample data corresponding to each of the interface identifiers based on the maximum total amount of processed data for the sampled data within a preset time period, the total amount of data at the current moment within the current cycle time period, and preset processing conditions;

A second processing unit, configured to determine conversion data corresponding to each piece of data in each of the sample data, where the conversion data includes a field name and a corresponding value corresponding to the field name;

The obtaining unit is configured to perform sensitivity type identification on corresponding values in each piece of converted data, and obtain sensitive types corresponding to all corresponding values in each piece of converted data.
The device according to claim 8, wherein the preset processing condition is expressed in the following manner:

Among them, X is used to represent the number of types of the application interface, K App_ID is used to represent the sample data volume of each type of application interface within the preset duration, and K MAX is used to represent the maximum processing of the sampled data within the preset duration total amount of data.
The device according to claim 8 or 9, wherein the determining unit is specifically configured to:

Determine whether the current cycle duration is the first time to determine the cycle duration of each of the sample data identified by the interface;

When it is determined that the current cycle duration is the cycle duration for determining the sample data of each of the interface identifiers for the first time, determine the total amount of interface data corresponding to any one of the interface identifiers at the current moment within the current cycle duration, which is different from the current cycle duration The ratio of the total amount of data at the current moment within the duration;

multiplying the ratio by the maximum total amount of processed data to obtain the total amount of initial data of the initial sample data identified by any of the interfaces;

Determining the total amount of first interface data of the first interface data corresponding to any one of the interface identifiers at any time after the current time within the duration of the current cycle;

When it is determined that the total amount of the first interface data is not greater than the corresponding initial data amount, determine the first probability that each piece of data in the first interface data is returned to the corresponding array, and based on the first a probability and the data in the first interface data to obtain the first data in the corresponding array;

The first data is used as the sample data of any of the interface identifiers to determine the sample data of any of the interface identifiers.
The device according to claim 10, wherein the determining unit is further configured to:

When it is determined that the total amount of the first interface data of any of the interface identifiers is greater than the total amount of data of the initial sample data, it is determined that each piece of data in the first interface data is returned to the corresponding array second probability;

obtaining second data in the corresponding array based on the second probability and data in the first interface data, the second probability being different from the first probability;

The second data is used as the sample data of any one of the interface identifiers to determine the sample data of each of the interface identifiers.
The device according to claim 10, wherein the determining unit is specifically configured to:

When it is determined that the current cycle duration is not the first time to determine the cycle duration of the sample data of each of the interface identifiers, and it is determined that the historical sample data is stored in the array corresponding to any interface identifier, the history The sample data is processed to obtain the sample identification of each piece of historical sample data;

Determine the total amount of historical sample data corresponding to any of the interface identifiers, and the total amount of data corresponding to any of the sample identifiers, and based on the total amount of data of the historical sample data and the total amount of data corresponding to the sample identifier Quantity, determine the weight coefficient corresponding to any sample identification;

When it is determined that the total amount of the first interface data of any of the interface identifiers is greater than the total amount of data of the initial sample data, it is determined that each piece of data in the first interface data is returned to the corresponding array third probability;

Based on the third probability and the data in the first interface data, the third data in the corresponding array is obtained, and the third data is used as the sample data of any of the interface identifiers to determine each For the sample data identified by the interface, the third probability is a product of the second probability and a weight coefficient.
The device according to claim 8, wherein the obtaining unit is specifically used for:

Perform initial identification processing on all corresponding values in the converted data, and obtain the total number of times that all corresponding values are identified and the number of times that all corresponding values are identified as corresponding to each sensitive type;

Identify and match any corresponding value in each piece of converted data based on a preset regular expression or preset metadata keyword, and verify any corresponding value based on a preset algorithm after the match is passed , when the verification is passed, the total number of times and the number of sensitive types to which any corresponding value belongs are accumulated to obtain the first total number and the first time;

Based on the first total number and the first number, a first recognition rate is obtained; the recognition rate is used to characterize the probability that the type of any corresponding value is a specific sensitive type;

When it is determined that the first recognition rate is not less than the corresponding preset threshold, a label is added to any corresponding value, and the label is used to indicate that the type corresponding to any corresponding value is the specific sensitive type.
The device according to claim 13, wherein the obtaining unit is further configured to:

When any corresponding value in each piece of converted data identifies a match and/or fails the verification, the total number of times is accumulated to obtain a second total number of times;

Obtaining a second recognition rate based on the second total number of times and the number of sensitive types to which any corresponding value belongs;

When it is determined that the second recognition rate is not less than the preset threshold, keep the label corresponding to any corresponding value unchanged.
A computer device, characterized by including programs or instructions, when the programs or instructions are executed, the method according to any one of claims 1 to 7 is executed.
A storage medium is characterized by including programs or instructions, and when the programs or instructions are executed, the method according to any one of claims 1 to 7 is executed.