WO2023077815A1 - Method and device for processing sensitive data - Google Patents

Method and device for processing sensitive data Download PDF

Info

Publication number
WO2023077815A1
WO2023077815A1 PCT/CN2022/099611 CN2022099611W WO2023077815A1 WO 2023077815 A1 WO2023077815 A1 WO 2023077815A1 CN 2022099611 W CN2022099611 W CN 2022099611W WO 2023077815 A1 WO2023077815 A1 WO 2023077815A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
interface
total amount
sample
probability
Prior art date
Application number
PCT/CN2022/099611
Other languages
French (fr)
Chinese (zh)
Inventor
彭永杰
Original Assignee
深圳前海微众银行股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳前海微众银行股份有限公司 filed Critical 深圳前海微众银行股份有限公司
Publication of WO2023077815A1 publication Critical patent/WO2023077815A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/64Protecting data integrity, e.g. using checksums, certificates or signatures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Definitions

  • Embodiments of the present invention relate to the field of financial technology (Fintech), and in particular, to a method and device for processing sensitive data.
  • Fetech financial technology
  • Internet services have brought great convenience to people's lives, but at the same time, they have also brought many security problems.
  • Internet services provide various functional interfaces both internally and externally. If some interfaces involving sensitive data are compromised or their own problems lead to sensitive data leakage, it may cause huge security risks to users and enterprises. Therefore, in order to strengthen the governance, operation and protection of sensitive data, the risk identification and distribution flow of interface sensitive data assets become particularly important.
  • the present invention provides a method and device for processing sensitive data, which is used to effectively reduce the impact on the sensitive type identification of data due to the sudden increase or change of data volume and application interface category, and quickly and simply complete the sampling data corresponding to each application interface Sensitive types of carding.
  • the present invention provides a method for processing sensitive data.
  • the method includes: receiving sampled data sent by multiple application interfaces, performing hash processing on the feature information corresponding to each of the application interfaces, and determining the The interface identification of the application interface; based on the maximum total amount of data processed for the sampled data within the preset time period, the total amount of data at the current moment within the current cycle time period, and the preset processing conditions, determine the sample data corresponding to each of the interface identifications; determine Conversion data corresponding to each piece of data in each of the sample data, the conversion data includes a field name and a corresponding value corresponding to the field name; sensitive type identification is performed on the corresponding value in each piece of conversion data, and each piece of conversion data is obtained. Sensitivity types for all corresponding values in the transformed data described in this article.
  • the interface identifier corresponding to each application interface that sends sampled data is calculated, and the system services corresponding to different application interfaces are distinguished based on the interface identifier, so that the influence of different data volumes corresponding to different system services can be reduced, and further To a certain extent, it can reduce the impact of data skew on the identification of sensitive types of subsequent data, and can use samples instead of full data to sort out the sampled data sent by multiple application interfaces, which can greatly reduce the amount of data to be processed , improve the speed of data processing, thereby reducing the cost of manpower and machine resources, and then improving the identification efficiency of sensitive data.
  • the preset processing condition is expressed in the following manner:
  • X is used to represent the number of types of the application interface
  • K App_ID is used to represent the sample data volume of each type of application interface within the preset duration
  • K MAX is used to represent the maximum processing of the sampled data within the preset duration total amount of data.
  • the amount of sample data and the number of application interface types of each type of application interface are constrained, so that the sample data corresponding to each type of application interface can be covered as much as possible, and it is effectively guaranteed. Stabilization of the identification basis for the identification of sensitive types of subsequent data.
  • the sample data identified by each interface is determined based on the maximum total amount of data processed for the sampled data within the preset time period, the total amount of data at the current moment in the current cycle time period, and the preset processing conditions , including: determining whether the current cycle duration is the cycle duration of the first determination of the sample data of each of the interface identifiers; when determining that the current cycle duration is the first determination of the cycle duration of the sample data of each of the interface identifiers, Determine the ratio of the total amount of interface data corresponding to any of the interface identifiers at the current moment in the current cycle duration to the total amount of data at the current moment in the current cycle duration; compare the ratio with the maximum total amount of processed data Multiply to obtain the total amount of initial data of the initial sample data of any one of the interface identifiers; determine the first interface of the first interface data corresponding to any one of the interface identifiers at any time after the current moment within the duration of the current cycle The total amount of data; when it is determined that the total
  • the sample data covers a more comprehensive sample data, that is, a relatively small amount of sample data with a relatively comprehensive sample data coverage is provided for subsequent
  • the identification of the sensitive type of the sampling data sent by each application interface reduces the amount of data to be processed, thereby improving the processing speed of the sensitive data.
  • the method further includes: when it is determined that the total amount of data of the first interface identified by any of the interfaces is greater than the total amount of data of the initial sample data, determining that the first interface Each piece of data in the data is returned to the second probability in the corresponding array; based on the second probability and the data in the first interface data, obtain the second data in the corresponding array, the The second probability is different from the first probability; the second data is used as the sample data of any one of the interface identifiers to determine the sample data of each of the interface identifiers.
  • the sample data covers a more comprehensive sample data, that is, a relatively small amount of sample data with a relatively comprehensive sample data coverage is provided for subsequent
  • the identification of the sensitive type of the sampling data sent by the application interface reduces the amount of data to be processed, thereby improving the processing speed of the sensitive data.
  • the sample data identified by each interface is determined based on the maximum total amount of data processed for the sampled data within the preset time period, the total amount of data at the current moment in the current cycle time period, and the preset processing conditions , including when it is determined that the current cycle duration is not the first time to determine the cycle duration of the sample data of each of the interface identifiers, and when it is determined that the historical sample data is stored in the array corresponding to any of the interface identifiers, for all Process the historical sample data to obtain the sample identification of each piece of historical sample data; determine the total amount of data of the historical sample data corresponding to any of the interface identifications, and the total amount of data corresponding to any of the sample identifications, and based on the The total amount of data of the historical sample data and the total amount of data corresponding to the sample identification, determine the weight coefficient corresponding to any sample identification; When the total amount of sample data is used, determine the third probability that each piece of data in the first interface data is returned to the corresponding array; based on the
  • performing sensitive type identification on corresponding values in each piece of converted data, and obtaining sensitive types corresponding to all corresponding values in each piece of converted data includes: The corresponding value of the corresponding value is initially identified, and the total number of times that all corresponding values are identified and the number of times that all corresponding values are identified as corresponding to each sensitive type are obtained; based on preset regular expressions or preset metadata keywords, for each Identify and match any corresponding value in the conversion data described in Article 1. When the matching is passed, verify any corresponding value based on a preset algorithm.
  • the sensitivity type of the field corresponding to the corresponding value can be accurately determined.
  • the method further includes: when any corresponding value in each piece of converted data identifies a match and/or fails the check, accumulating the total times to obtain the second The total number of times; based on the second total number of times and the number of times of the sensitive type to which any corresponding value belongs, a second recognition rate is obtained; when it is determined that the second recognition rate is not less than the preset threshold value, then keep the The label corresponding to any corresponding value remains unchanged.
  • the present invention provides a device for determining an access token, the device comprising:
  • the first processing unit is configured to receive sampled data sent by multiple application interfaces, perform hash processing on feature information corresponding to each of the application interfaces, and determine an interface identifier of each of the application interfaces;
  • a determining unit configured to determine the sample data corresponding to each of the interface identifiers based on the maximum total amount of processed data for the sampled data within a preset time period, the total amount of data at the current moment within the current cycle time period, and preset processing conditions;
  • a second processing unit configured to determine conversion data corresponding to each piece of data in each of the sample data, where the conversion data includes a field name and a corresponding value corresponding to the field name;
  • the obtaining unit is configured to perform sensitivity type identification on corresponding values in each piece of converted data, and obtain sensitive types corresponding to all corresponding values in each piece of converted data.
  • the preset processing condition is expressed based on the following manner:
  • X is used to represent the number of types of the application interface
  • K App_ID is used to represent the sample data volume of each type of application interface within the preset duration
  • K MAX is used to represent the maximum processing of the sampled data within the preset duration total amount of data.
  • the determining unit is specifically configured to: determine whether the current cycle duration is the cycle duration for determining the sample data of each of the interface identifiers for the first time; When determining the cycle duration of the sample data of each of the interface identifiers, determine the total amount of interface data corresponding to any of the interface identifiers at the current moment within the current cycle duration, and the ratio of the total amount of data at the current moment within the current cycle duration Ratio; multiply the ratio by the total amount of maximum processed data to obtain the total amount of initial data of any initial sample data identified by the interface; determine any moment after the current moment within the duration of the current cycle, any A first interface data total amount of the first interface data corresponding to the interface identifier; when it is determined that the first interface data total amount is not greater than the corresponding initial data total amount, determine that each of the first interface data The piece of data is returned to the first probability in the corresponding array, and based on the first probability and the data in the first interface data, the first data in the corresponding array is obtained;
  • the determining unit is further configured to: when it is determined that the total amount of the first interface data of any one of the interface identifiers is greater than the total amount of data of the initial sample data, determine that the Each piece of data in the first interface data is returned to the second probability in the corresponding array; based on the second probability and the data in the first interface data, obtain the second data in the corresponding array , the second probability is different from the first probability; the second data is used as the sample data of any one of the interface identifiers to determine the sample data of each of the interface identifiers.
  • the determining unit is specifically configured to: determine that any interface When the historical sample data is stored in the array corresponding to the identifier, the historical sample data is processed to obtain a sample identifier of each piece of historical sample data; determine the total amount of historical sample data corresponding to any one of the interface identifiers, and the total amount of data corresponding to any of the sample identifiers, and based on the total amount of data of the historical sample data and the total amount of data corresponding to the sample identifier, determine the weight coefficient corresponding to any sample identifier; When the total amount of the first interface data identified by the interface is greater than the total amount of data of the initial sample data, determine the third probability that each piece of data in the first interface data is returned to the corresponding array; based on The third probability and the data in the first interface data, obtain the third data in the corresponding array, and use the third data as the sample data of any of the interface identifiers to determine each of the The sample data identified by the interface, the third probability is the product of the second
  • the obtaining unit is specifically configured to: perform initial identification processing on all corresponding values in the converted data, and obtain the total number of times that all corresponding values are identified and the number of times that all corresponding values are identified.
  • the number of times identified as corresponding to each sensitive type based on preset regular expressions or preset metadata keywords, identify and match any corresponding value in each piece of converted data, and when the match is passed, based on preset algorithms
  • the any corresponding value is verified, and when the verification is passed, the total number of times and the number of sensitive types to which any corresponding value belongs are accumulated to obtain the first total number and the first number; based on the The first total number and the first number are used to obtain the first recognition rate; the recognition rate is used to characterize the probability that the type of any corresponding value is a specific sensitive type; when it is determined that the first recognition rate is not less than the corresponding If the preset threshold value is set, a label is added to the any corresponding value, and the label is used to indicate that the type corresponding to the
  • the obtaining unit is further configured to: when any corresponding value in each piece of the converted data identifies a match and/or fails the verification, accumulate the total times to obtain The second total number of times; based on the second total number of times and the number of sensitive types to which any corresponding value belongs, a second recognition rate is obtained; when it is determined that the second recognition rate is not less than the preset threshold, then keep The label corresponding to any corresponding value remains unchanged.
  • the present invention provides a computer device, including a program or an instruction, and when the program or instruction is executed, is used to execute the above-mentioned first aspect and each optional method of the first aspect.
  • the present invention provides a storage medium, including a program or an instruction, and when the program or instruction is executed, is used to execute the above-mentioned first aspect and each optional method of the first aspect.
  • FIG. 1 is a schematic diagram of an application scenario provided by an embodiment of the present invention.
  • FIG. 2 is a schematic flowchart of steps of a method for processing sensitive data provided by an embodiment of the present invention
  • Fig. 3 is a schematic structural diagram of an apparatus for processing sensitive data provided by an embodiment of the present invention.
  • the sensitive types of the acquired data are generally identified directly, that is, the entire amount of data is identified and processed. In this way, not only the identification efficiency is low, but also more memory resources are consumed. And as the source and data volume of sensitive data increase, it is impossible to process the newly added sensitive data in an accurate and timely manner, that is, the overall processing efficiency of sensitive data is low.
  • the embodiment of the present invention provides a method for processing sensitive data.
  • the interface identifier corresponding to each application interface can be calculated, and the system services corresponding to different application interfaces can be distinguished based on the interface identifier, thereby reducing the
  • the impact of different data volumes corresponding to different system services can reduce the impact of data skew on the identification of sensitive types of subsequent data to a certain extent, and can use samples instead of full data to sort out the sampled data sent by multiple application interfaces , can greatly reduce the amount of data to be processed, increase the speed of data processing, thereby reducing the cost of manpower and machine resources, and then improve the identification efficiency of sensitive data.
  • FIG. 1 the schematic diagram of an application scenario shown in FIG. 1 , which includes a computer device 101 and an application server 102 , and the computer device 101 can communicate with the application server 102 .
  • the application server 102 includes an application server 102-1, an application server 102-2, . . . , and an application server 102-n, where n is a positive integer greater than 2.
  • the application server 102 can send data containing sensitive data to the computer device 101, so that the computer device 101 can process the received data, thereby obtaining the data type of the sensitive data in the received data, and realizing sorting out the sensitive data .
  • the computer device 101 may store the processing result of the received data in a corresponding database, and may also send the processing result of the received data to a data security platform deployed on other computer devices.
  • the computer device 101 can be an independent physical server, or a server cluster or a distributed system composed of multiple physical servers, and can also provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, Cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network, content distribution network), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms, etc., but are not limited to this.
  • the application server 102 may be a server deployed in a distributed system.
  • Step 201 Receive sampling data sent by multiple application interfaces, perform hash processing on feature information corresponding to each application interface, and determine an interface identifier of each application interface.
  • the computer device may receive sampling data sent by multiple application interfaces.
  • the multiple application interfaces may be interfaces of different types, or interfaces of partly the same type and partly of different types.
  • the present invention There is no restriction on this in implementation.
  • the number of multiple application interfaces may also be updated based on time update. For example, at 9:31 am on June 17, 2021, there are 4 application interfaces that send sampling data to computer equipment, and at 9:32 am on June 17, 2021, 8 application interfaces send sampling data to computer equipment.
  • the computer device may determine the characteristic information of each application interface in the plurality of application interfaces, so as to determine the characteristic value corresponding to the characteristic information.
  • the method of determining the characteristic information may be determined based on the fact that multiple application interfaces carry their corresponding characteristic information when sending sampled data, or the computer device may send a request for acquiring characteristic information to the application server corresponding to the multiple application interfaces, Therefore, the characteristic information is obtained based on the feedback information of the corresponding application server, which is not limited in this embodiment of the present invention.
  • the feature information may at least include: the service ID corresponding to the application interface; the scene ID, where the scene ID is, for example, the ID of the update scene; the packet type of the data, such as synchronous or asynchronous; the system number of the requester ;Responder system number.
  • a hash operation may be performed on the feature value corresponding to each application interface, so as to obtain the interface identifier of each application interface. It should be noted that each interface identifier is unique, that is, the corresponding application interface can be determined based on the interface identifier.
  • the system services corresponding to different application interfaces are distinguished based on the interface identifier, so that the impact of different data volumes corresponding to different system services can be reduced, and the impact of data skew on subsequent data can be reduced to a certain extent.
  • Step 202 Determine the sample data corresponding to each interface identifier based on the maximum total amount of data processed for the sampled data within the preset time period, the total amount of data at the current moment in the current cycle time period, and the preset processing conditions.
  • the computer device may periodically process the received sampling data based on a preset duration. For example, assuming that the preset duration is 1 minute, the received sampling data may be processed at a period of 1 minute. It should be noted that the preset duration may be determined based on actual implementation, which is not limited in this embodiment of the present invention.
  • the current cycle duration is the cycle duration for which the sample data identified by each interface is determined for the first time.
  • the following steps may be used, but not limited to, to determine the initial corresponding sample data of any interface identifier:
  • Step a determine the total amount of interface data corresponding to any interface identifier at the current moment in the current cycle duration, and the ratio of the total amount of data at the current moment in the current cycle duration;
  • Step b Multiply the ratio by the maximum total amount of data processed for the sampled data within a preset period of time to obtain the total amount of initial data of the initial sample data identified by any interface;
  • the maximum total amount of data processed for sampling data within the preset duration is K MAX
  • the total amount of data at the current moment in the current cycle duration is N
  • any application interface at the current moment in the current cycle duration is N APP_ID , so it can be determined that the total amount of initial data identified by each interface is:
  • Step c determine the first interface data total amount of the first interface data corresponding to any interface identifier at any moment after the current moment in the current cycle duration;
  • Step d When it is determined that the total amount of the first interface data is not greater than the corresponding initial data amount, determine the first probability that each piece of data in the first interface data is returned to the corresponding array, and based on the first probability and the first For the data in the interface data, obtain the first data in the corresponding array;
  • Step e use the first data as the sample data of the interface identifier to determine the sample data of any interface identifier.
  • the first probability can be determined as:
  • x may represent the sequence identifier of the application interface, for example, if the sequence identifier of the first application interface is 1, then the interface identifier of the first application interface is APP_ID(1).
  • the computer device may obtain the first data in the corresponding array based on the first probability and the data in the first interface data. Then use the first data as the sample data of the interface identifier to determine the sample data of any interface identifier.
  • Step f When it is determined that the total amount of first interface data identified by any interface is greater than the total amount of data of the initial sample data, determine the second probability that each piece of data in the first interface data is returned to the corresponding array.
  • Step g based on the second probability and the data in the first interface data, obtain the second data in the corresponding array, the second probability is different from the first probability;
  • Step h use the second data as the sample data of any interface identifier to determine the sample data of each interface identifier.
  • the second probability can be determined as: Specifically, if the current data starts with The probability is taken out, then continue to The probability of replacing the existing data in the corresponding array, otherwise the array data remains unchanged. Therefore, the probability of retaining the current data is
  • the aforementioned solution for determining the sample data corresponding to the interface identifier needs to meet the preset processing conditions.
  • the preset processing conditions can be expressed in the following manner:
  • X is used to represent the number of types of application interfaces
  • K App_ID is used to represent the sample data volume of each type of application interface within a preset time period
  • K MAX is used to represent the maximum total processing data of sampled data within a preset time period. quantity.
  • K APP_ID(1) , K APP_ID(2) and K APP_ID(3) is not greater than K MAX .
  • application interface x is an application interface with confirmed sample data
  • the data probability of the subsequent feedback of the application interface x is: when the total amount of initial data corresponding to the application interface x is not greater than the total amount of interface data after the reduction, then based on Return data to the array; when the total amount of initial data corresponding to the application interface x is greater than the total amount of interface data after reduction, based on to return data to an array.
  • the solution for determining the sample data of each interface identifier may include but not limited to the following steps:
  • Step A When historical sample data is stored in the array corresponding to any interface identifier, process the historical sample data to obtain the sample identifier of each piece of historical sample data;
  • Step B Determine the total amount of historical sample data corresponding to any interface identifier, and the total amount of data corresponding to any sample identifier, and based on the total amount of historical sample data and the total amount of data corresponding to the sample identifier, determine each The weight coefficient corresponding to the sample ID;
  • Step C When it is determined that the total amount of first interface data identified by any interface is greater than the total amount of data in the initial sample data, determine the third probability that each piece of data in the first interface data is returned to the corresponding array;
  • Step D Obtain the third data in the corresponding array based on the third probability and the data in the first interface data, and use the third data as the sample data of any interface identifier to determine the sample data of each interface identifier, the first
  • the third probability is the product of the second probability and the weight coefficient.
  • the APP_ID is calculated according to the characteristic information or attribute value of the application interface, and the identification of the sensitive type of data needs to be for each piece of data in the corresponding sending data of the application interface, that is, in each piece of data Therefore, when determining the sample data corresponding to the current interface, consider reducing the probability that the same type of data that has been determined as a sample will be determined as a sample again, minimize the problem of data skew, and improve the coverage of sample data.
  • the data content of each piece of sample data in the historical sample data can be analyzed to obtain attributes such as the parameter list P and message length L of the message content, and the unique identifier of the message content can be calculated through a hash algorithm.
  • the third probability can be determined as: Specifically, if the current data starts with The probability is taken out, then continue to The probability of replacing the existing elements in the corresponding array, otherwise the array elements remain unchanged. Therefore, the probability of retaining the current data is
  • the unit time such as 1 minute
  • the current cycle duration is the cycle duration of the sample data of the application interface A determined for the first time, assuming that the maximum total amount of data processed is 100 pieces of data, and the application interface A The corresponding total amount of data is 0 data.
  • the total amount of interface data that is, 101 pieces of data is greater than the initial total amount of data of application interface A, that is, 100 pieces, then for 101 pieces of data, they will be kept in the array with a probability of 100/101, and the original 100 pieces of data in the data will be stored in the array with a probability of 1 /100 probability of being selected for replacement.
  • the final quantity can be determined by rounding up. It can be seen that the total amount of the first interface data of application interface B, i.e. 1 piece, is not greater than the corresponding initial data amount of 1 piece, so that the first piece of data of application interface B can be returned to in the corresponding array.
  • the above-mentioned method that is, based on the improved pond sampling method, can carry out strong random sampling of streaming data, making the sample data coverage more comprehensive, more adaptable to changes in data sources, and improving the effectiveness of sensitive data identification and sorting and stability.
  • Step 203 Determine the conversion data corresponding to each piece of data in each sample data, the conversion data includes field names and corresponding values corresponding to the field names.
  • the computer device can analyze and process each piece of data in each sample data.
  • the message formats such as JSON and XML can be converted into It is a KEY-VALUE key-value pair, that is, the conversion data including the field name and the corresponding value.
  • Step 204 Perform sensitive type identification on corresponding values in each piece of converted data, and obtain sensitive types corresponding to all corresponding values in each piece of converted data.
  • the computer device may perform initial recognition processing on the corresponding values in all the converted data, and obtain the total number of times that all corresponding values are recognized and the number of times that all corresponding values are recognized as corresponding to each sensitive type .
  • the computer device can identify the sensitive type of each piece of data based on the identification strategy.
  • the identification strategy is based on metadata keyword matching and preset algorithm verification, or the identification strategy is based on preset regular expression matching and Default algorithm check.
  • the computer device can identify and match any corresponding value in each piece of converted data based on a preset regular expression or a preset metadata keyword.
  • the preset regular expression may be a VALUE regular expression
  • the preset metadata keyword may be correspondingly determined based on an actual implementation situation, which is not limited in this embodiment of the present invention.
  • the matching is passed, any corresponding value is verified based on the preset algorithm.
  • the verification is passed, the total number of times and the number of sensitive types to which any corresponding value belongs are accumulated to obtain the first total number and the first time number.
  • the preset algorithm may be the VALUE algorithm, and of course, other algorithms may also be used, which is not limited in this embodiment of the present invention.
  • the first recognition rate can be obtained based on the first total number and the first number, wherein the recognition rate is used to characterize the probability that the type of any corresponding value is a specific sensitive type, so that when determining the first recognition rate If it is not less than the corresponding preset threshold, a label is added to any corresponding value, and the label is used to indicate that the type corresponding to any corresponding value is a specific sensitive type.
  • N APP_ID_FIELD the total number of times that all corresponding values are identified
  • N X the number of times that the corresponding value is identified as a sensitive type
  • x is a document number (ID), a mobile phone number ( PHONE), bank card number (BANK) and other sensitive labels.
  • the preset threshold value corresponding to any corresponding value is expressed as RERROR
  • the field name corresponding to any corresponding value is expressed as F
  • the algorithm check is passed, add one to the values of NAPP_ID_FIELD(F) and N BANK(F) to obtain the first total number and the first number, so that the first recognition rate can be obtained: the first recognition rate can be determined
  • a recognition rate: RS(BANK) N' APP_ID_FIELD(F) /N' BANK(F) .
  • R S(BANK) is not less than R ERROR , then add the BANK label to field F, and determine that the application interface corresponding to any corresponding value is a sensitive interface "involving bank card numbers". If R S(BANK) is less than R ERROR , then no BANK tag is added to field F, and if the field already has a BANK tag, it is cleared.
  • the preset algorithm for verifying the bank card number may be a modulo 10 algorithm, of course, it may also be other algorithms, which are not limited in the embodiment of the present invention. It can be seen that different preset algorithms can be used for different specific sensitive types.
  • the computer device determines that any corresponding value in each piece of conversion data identifies a match and/or fails the verification
  • the total number of times is accumulated to obtain the second total number of times, and then the second total number of times can be obtained based on the first 2
  • any field has multiple meanings, that is, it cannot pass the verification and has no previous label, a prompt will be output, and the user of the computer device can manually mark the field to achieve The label is determined.
  • the present invention provides a device for processing sensitive data, the device includes a first processing unit 301, a determination unit 302, a second processing unit 303 and an obtaining unit 304, wherein:
  • the first processing unit 301 is configured to receive sampled data sent by multiple application interfaces, perform hash processing on feature information corresponding to each of the application interfaces, and determine an interface identifier of each of the application interfaces;
  • the determination unit 302 is configured to determine the sample data corresponding to each of the interface identifiers based on the maximum total amount of processed data for the sampled data within a preset time period, the total amount of data at the current moment within the current cycle time period, and preset processing conditions;
  • the second processing unit 303 is configured to determine the conversion data corresponding to each piece of data in each of the sample data, the conversion data includes a field name and a corresponding value corresponding to the field name;
  • the obtaining unit 304 is configured to perform sensitivity type identification on corresponding values in each piece of converted data, and obtain sensitive types corresponding to all corresponding values in each piece of converted data.
  • the preset processing condition is expressed based on the following manner:
  • I is used to represent the number of types of the application interface
  • K App_ID is used to represent the sample data volume of each type of application interface within the preset time length
  • K MAX is used to represent the maximum processing of the sampled data within the preset time length total amount of data.
  • the determining unit 302 is specifically configured to: determine whether the current cycle duration is the cycle duration for determining the sample data of each of the interface identifiers for the first time; When determining the cycle duration of the sample data of each of the interface identifiers for the first time, determine the total amount of interface data corresponding to any of the interface identifiers at the current moment in the current cycle duration, and the total amount of data at the current moment in the current cycle duration Ratio; multiply the ratio by the total amount of maximum processed data to obtain the total amount of initial data of any initial sample data identified by the interface, and store the initial sample data in a corresponding array; determine At any time after the current moment within the duration of the current period, the first interface data total amount of the first interface data corresponding to any one of the interface identifiers; when it is determined that the total amount of any one of the first interface data is not greater than the corresponding When the total amount of initial data is used, determine the first probability that each piece of data in the first interface data is returned to the corresponding array
  • the determining unit 302 is further configured to: when it is determined that the total amount of the first interface data of any one of the interface identifiers is greater than the total amount of data of the initial sample data, determine the The second probability that each piece of data in the first interface data is returned to the corresponding array; based on the second probability and the data in the first interface data, obtain the second probability in the corresponding array data, the second probability is different from the first probability; the second data is used as sample data of any one of the interface identifiers to determine the sample data of each of the interface identifiers.
  • the determining unit 302 is specifically configured to: determine that the any When the historical sample data is stored in the array corresponding to the interface identifier, the historical sample data is processed to obtain the sample identifier of each piece of historical sample data; determine the total amount of historical sample data corresponding to any one of the interface identifiers , and the total amount of data corresponding to any of the sample identifiers, and based on the total amount of data of the historical sample data and the total amount of data corresponding to the sample identifier, determine the weight coefficient corresponding to any sample identifier; when determining any When the total amount of the first interface data identified by the interface is greater than the total amount of data of the initial sample data, determine a third probability that each piece of data in the first interface data is returned to the corresponding array; Based on the third probability and the data in the first interface data, the third data in the corresponding array is obtained, and the third data is used as the sample data of any of the interface identifiers to determine each For the sample data identified by the interface, the
  • the obtaining unit 304 is specifically configured to: perform initial identification processing on all corresponding values in the converted data, and obtain the total number of times that all corresponding values are identified and all corresponding values The number of times identified as corresponding to each sensitive type; based on preset regular expressions or preset metadata keywords, identify and match any corresponding value in each piece of conversion data, and when the match is passed, based on the preset algorithm Verify any corresponding value, and when the verification is passed, accumulate the total number of times and the number of times of the sensitive type to which any corresponding value belongs to obtain the first total number and the first time; based on The first total number and the first number are used to obtain a first recognition rate; the recognition rate is used to characterize the probability that the type of any corresponding value is a specific sensitive type; when it is determined that the first recognition rate is not less than A corresponding preset threshold value, then add a label to the any corresponding value, and the label is used to indicate that the type corresponding to the any corresponding value is the specific sensitive type
  • the obtaining unit 304 is further configured to: when any corresponding value in each piece of converted data identifies a match and/or fails the verification, accumulate the total number of times, Obtaining a second total number of times; obtaining a second recognition rate based on the second total number of times and the number of sensitive types to which any corresponding value belongs; when it is determined that the second recognition rate is not less than the preset threshold, then Keep the label corresponding to any corresponding value unchanged.
  • An embodiment of the present invention provides a computer device, including a program or an instruction.
  • the program or instruction When the program or instruction is executed, it is used to execute a method for processing sensitive data and any optional method provided in the embodiment of the present invention.
  • An embodiment of the present invention provides a storage medium, including a program or an instruction.
  • the program or instruction When executed, it is used to execute a method for processing sensitive data and any optional method provided in the embodiment of the present invention.
  • the embodiments of the present invention may be provided as methods, systems, or computer program products. Accordingly, the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, optical storage, etc.) having computer-usable program code embodied therein.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions
  • the device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

Abstract

Disclosed in the present invention are a method and device for processing sensitive data. The method comprises: receiving sampling data sent by a plurality of application interfaces, performing hash processing on feature information corresponding to each application interface, and determining an interface identifier of each application interface; determining, on the basis of the maximum total processing data volume of the sampling data in a preset duration, the total data volume at a current moment in a current period duration, and a preset processing condition, sample data corresponding to each interface identifier; determining conversion data corresponding to each piece of data in each piece of sample data, the conversion data comprising field names and corresponding values corresponding to the field names; and performing sensitive type identification on the corresponding values in each piece of conversion data, and obtaining sensitive types corresponding to all the corresponding values in each piece of conversion data. The method can effectively reduce the influence of sudden increase or change of the data volume and types of application interfaces on the sensitive type identification of the data, and quickly and simply complete sensitive type sorting of the sampling data corresponding to each application interface.

Description

一种处理敏感数据的方法及装置A method and device for processing sensitive data
相关申请的交叉引用Cross References to Related Applications
本申请要求在2021年11月03日提交中国专利局、申请号为202111294701.7、申请名称为“一种处理敏感数据的方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202111294701.7 and the application title "A Method and Device for Processing Sensitive Data" filed with the China Patent Office on November 03, 2021, the entire contents of which are incorporated herein by reference Applying.
技术领域technical field
本发明实施例涉及金融科技(Fintech)领域,尤其涉及一种处理敏感数据的方法及装置。Embodiments of the present invention relate to the field of financial technology (Fintech), and in particular, to a method and device for processing sensitive data.
背景技术Background technique
随着计算机技术的发展,越来越多的技术应用在金融领域,传统金融业正在逐步向金融科技转变,但由于金融行业的安全性、实时性要求,也对技术提出的更高的要求。With the development of computer technology, more and more technologies are applied in the financial field, and the traditional financial industry is gradually transforming into financial technology. However, due to the security and real-time requirements of the financial industry, higher requirements are also placed on technology.
目前,随着云计算、大数据的快速发展,互联网服务给人们的生活带来极大的方便,但同时,也带来了许多安全问题。目前,互联网服务对内和对外都提供了各种功能接口,如果一些涉及敏感数据的接口因受到侵入或自身问题而导致敏感数据泄露,将可能对用户和企业造成巨大安全隐患。因此,为了加强对敏感数据的治理、运营和保护,接口敏感数据资产的风险识别和分布流向变得尤为重要。At present, with the rapid development of cloud computing and big data, Internet services have brought great convenience to people's lives, but at the same time, they have also brought many security problems. At present, Internet services provide various functional interfaces both internally and externally. If some interfaces involving sensitive data are compromised or their own problems lead to sensitive data leakage, it may cause huge security risks to users and enterprises. Therefore, in order to strengthen the governance, operation and protection of sensitive data, the risk identification and distribution flow of interface sensitive data assets become particularly important.
然而,现有技术中在进行敏感数据处理时,一般都是需要对获取的敏感数据直接进行分析,这样,需要处理大量敏感数据,导致整体处理速度较慢,且随着敏感数据的来源和数据量增加时,无法准确及时的对新增的敏感数据及时处理,即对敏感数据的整体处理效率较低。However, when processing sensitive data in the prior art, it is generally necessary to directly analyze the acquired sensitive data. In this way, a large amount of sensitive data needs to be processed, resulting in a slow overall processing speed. When the amount of data increases, it is impossible to process the newly added sensitive data in an accurate and timely manner, that is, the overall processing efficiency of sensitive data is low.
发明内容Contents of the invention
本发明提供一种处理敏感数据的方法及装置,用于有效降低因数据量和应用接口类别骤增或变动对数据的敏感类型识别的影响,快速且简单地完成每个应用接口对应的采样数据的敏感类型梳理。The present invention provides a method and device for processing sensitive data, which is used to effectively reduce the impact on the sensitive type identification of data due to the sudden increase or change of data volume and application interface category, and quickly and simply complete the sampling data corresponding to each application interface Sensitive types of carding.
第一方面,本发明提供一种处理敏感数据的方法,该方法包括:接收多个应用接口发送的采样数据,对每个所述应用接口对应的特征信息进行哈希处理,确定每个所述应用接口的接口标识;基于预设时长内对采样数据的最大处理数据总量、当前周期时长内当前时刻的数据总量以及预设处理条件,确定每个所述接口标识对应的样本数据;确定每个所述 样本数据中每条数据对应的转换数据,所述转换数据包括字段名和与所述字段名对应的对应值;对每条所述转换数据中的对应值进行敏感类型识别,获得每条所述转换数据中所有对应值对应的敏感类型。In a first aspect, the present invention provides a method for processing sensitive data. The method includes: receiving sampled data sent by multiple application interfaces, performing hash processing on the feature information corresponding to each of the application interfaces, and determining the The interface identification of the application interface; based on the maximum total amount of data processed for the sampled data within the preset time period, the total amount of data at the current moment within the current cycle time period, and the preset processing conditions, determine the sample data corresponding to each of the interface identifications; determine Conversion data corresponding to each piece of data in each of the sample data, the conversion data includes a field name and a corresponding value corresponding to the field name; sensitive type identification is performed on the corresponding value in each piece of conversion data, and each piece of conversion data is obtained. Sensitivity types for all corresponding values in the transformed data described in this article.
上述方法中,计算了发送采样数据的每个应用接口对应的接口标识,基于接口标识对不同的应用接口所对应的系统服务作区分,从而可以降低不同系统服务对应的数据量不同的影响,进而在一定程度可以降低数据倾斜对后续数据的敏感类型识别的影响,且可以使用样本而非全量的数据,对多个应用接口发送的采样数据进行梳理,能够较大幅度地减少要处理的数据量,提高数据处理的速度,从而降低了人力和机器资源成本,进而提高了对敏感数据的识别效率。In the above method, the interface identifier corresponding to each application interface that sends sampled data is calculated, and the system services corresponding to different application interfaces are distinguished based on the interface identifier, so that the influence of different data volumes corresponding to different system services can be reduced, and further To a certain extent, it can reduce the impact of data skew on the identification of sensitive types of subsequent data, and can use samples instead of full data to sort out the sampled data sent by multiple application interfaces, which can greatly reduce the amount of data to be processed , improve the speed of data processing, thereby reducing the cost of manpower and machine resources, and then improving the identification efficiency of sensitive data.
在一种可能的实施方式中,所述预设处理条件基于以下方式表示:In a possible implementation manner, the preset processing condition is expressed in the following manner:
Figure PCTCN2022099611-appb-000001
Figure PCTCN2022099611-appb-000001
其中,X用于表征所述应用接口的类型个数,K App_ID用于表征预设时长内每个类型的应用接口的样本数据量,K MAX用于表征预设时长内对采样数据的最大处理数据总量。 Among them, X is used to represent the number of types of the application interface, K App_ID is used to represent the sample data volume of each type of application interface within the preset duration, and K MAX is used to represent the maximum processing of the sampled data within the preset duration total amount of data.
上述方法中,对每个类型的应用接口的样本数据量和应用接口类型个数进行了约束,这样,可以尽量保证对每个类型的应用接口对应的样本数据覆盖更全,且有效保证了对后续数据的敏感类型识别的识别基础的稳定。In the above method, the amount of sample data and the number of application interface types of each type of application interface are constrained, so that the sample data corresponding to each type of application interface can be covered as much as possible, and it is effectively guaranteed. Stabilization of the identification basis for the identification of sensitive types of subsequent data.
在一种可能的实施方式中,基于预设时长内对采样数据的最大处理数据总量、当前周期时长内当前时刻的数据总量以及预设处理条件,确定每个所述接口标识的样本数据,包括:确定所述当前周期时长是否为首次确定每个所述接口标识的样本数据的周期时长;当确定所述当前周期时长为首次确定每个所述接口标识的样本数据的周期时长时,确定当前周期时长内当前时刻的任一所述接口标识对应的接口数据总量,与所述当前周期时长内当前时刻的数据总量的比值;将所述比值与所述最大处理数据总量相乘,获得任一所述接口标识的初始样本数据的初始数据总量;确定所述当前周期时长内当前时刻后的任一时刻,任一所述接口标识对应的第一接口数据的第一接口数据总量;当确定任一所述第一接口数据总量不大于对应的所述初始数据总量时,确定所述第一接口数据中每条数据被返回到对应的数组中的第一概率,并基于所述第一概率和所述第一接口数据中的数据,获得所述对应的数组中的第一数据;将所述第一数据作为任一所述接口标识的样本数据,以确定每个所述接口标识的样本数据。In a possible implementation manner, the sample data identified by each interface is determined based on the maximum total amount of data processed for the sampled data within the preset time period, the total amount of data at the current moment in the current cycle time period, and the preset processing conditions , including: determining whether the current cycle duration is the cycle duration of the first determination of the sample data of each of the interface identifiers; when determining that the current cycle duration is the first determination of the cycle duration of the sample data of each of the interface identifiers, Determine the ratio of the total amount of interface data corresponding to any of the interface identifiers at the current moment in the current cycle duration to the total amount of data at the current moment in the current cycle duration; compare the ratio with the maximum total amount of processed data Multiply to obtain the total amount of initial data of the initial sample data of any one of the interface identifiers; determine the first interface of the first interface data corresponding to any one of the interface identifiers at any time after the current moment within the duration of the current cycle The total amount of data; when it is determined that the total amount of any one of the first interface data is not greater than the corresponding initial data amount, determine the first probability that each piece of data in the first interface data is returned to the corresponding array , and based on the first probability and the data in the first interface data, obtain the first data in the corresponding array; use the first data as sample data of any of the interface identifiers to determine Sample data for each of the described interface identifiers.
基于上述方法,可以在第一接口数据总量不大于对应的初始数据总量时,确定样本数据覆盖更全面的样本数据,即提供了较为少量但样本数据覆盖较为全面的样本数据,为后 续对各个应用接口发送的采样数据的敏感类型的识别减少了待处理的数据量,从而提升对敏感数据的处理速度。Based on the above method, when the total amount of the first interface data is not greater than the corresponding initial data amount, it can be determined that the sample data covers a more comprehensive sample data, that is, a relatively small amount of sample data with a relatively comprehensive sample data coverage is provided for subsequent The identification of the sensitive type of the sampling data sent by each application interface reduces the amount of data to be processed, thereby improving the processing speed of the sensitive data.
在一种可能的实施方式中,所述方法还包括:当确定任一所述接口标识的所述第一接口数据总量大于所述初始样本数据的数据总量时,确定所述第一接口数据中每条数据被返回到所述对应的数组中的第二概率;基于所述第二概率和所述第一接口数据中的数据,获得所述对应的数组中的第二数据,所述第二概率与所述第一概率不同;将所述第二数据作为任一所述接口标识的样本数据,以确定每个所述接口标识的样本数据。In a possible implementation manner, the method further includes: when it is determined that the total amount of data of the first interface identified by any of the interfaces is greater than the total amount of data of the initial sample data, determining that the first interface Each piece of data in the data is returned to the second probability in the corresponding array; based on the second probability and the data in the first interface data, obtain the second data in the corresponding array, the The second probability is different from the first probability; the second data is used as the sample data of any one of the interface identifiers to determine the sample data of each of the interface identifiers.
基于上述方法,可以在第一接口数据总量大于对应的初始数据总量时,确定样本数据覆盖更全面的样本数据,即提供了较为少量但样本数据覆盖较为全面的样本数据,为后续对各个应用接口发送的采样数据的敏感类型的识别减少了待处理的数据量,从而提升对敏感数据的处理速度。Based on the above method, when the total amount of the first interface data is greater than the corresponding initial data amount, it can be determined that the sample data covers a more comprehensive sample data, that is, a relatively small amount of sample data with a relatively comprehensive sample data coverage is provided for subsequent The identification of the sensitive type of the sampling data sent by the application interface reduces the amount of data to be processed, thereby improving the processing speed of the sensitive data.
在一种可能的实施方式中,基于预设时长内对采样数据的最大处理数据总量、当前周期时长内当前时刻的数据总量以及预设处理条件,确定每个所述接口标识的样本数据,包括当确定所述当前周期时长为非首次确定每个所述接口标识的样本数据的周期时长时,且确定所述任一接口标识对应的数组中存储有所述历史样本数据时,对所述历史样本数据进行处理,获得每条历史样本数据的样本标识;确定任一所述接口标识对应的历史样本数据的数据总量,以及任一所述样本标识对应的数据总量,并基于所述历史样本数据的数据总量和所述样本标识对应的数据总量,确定任一样本标识对应的权重系数;当确定任一所述接口标识的所述第一接口数据总量大于所述初始样本数据的数据总量时,确定所述第一接口数据中每条数据被返回到所述对应的数组中的第三概率;基于所述第三概率和所述第一接口数据中的数据,获得所述对应的数组中的第三数据,将所述第三数据作为任一所述接口标识的样本数据,以确定任一所述接口标识的样本数据,所述第三概率为所述第二概率与权重系数的乘积。In a possible implementation manner, the sample data identified by each interface is determined based on the maximum total amount of data processed for the sampled data within the preset time period, the total amount of data at the current moment in the current cycle time period, and the preset processing conditions , including when it is determined that the current cycle duration is not the first time to determine the cycle duration of the sample data of each of the interface identifiers, and when it is determined that the historical sample data is stored in the array corresponding to any of the interface identifiers, for all Process the historical sample data to obtain the sample identification of each piece of historical sample data; determine the total amount of data of the historical sample data corresponding to any of the interface identifications, and the total amount of data corresponding to any of the sample identifications, and based on the The total amount of data of the historical sample data and the total amount of data corresponding to the sample identification, determine the weight coefficient corresponding to any sample identification; When the total amount of sample data is used, determine the third probability that each piece of data in the first interface data is returned to the corresponding array; based on the third probability and the data in the first interface data, Obtain the third data in the corresponding array, and use the third data as the sample data of any of the interface identifiers to determine the sample data of any of the interface identifiers, and the third probability is the first The product of two probabilities and weight coefficients.
上述方法中,通过增加权重系数,来降低确定为样本的同类型的数据再次被确定为样本的概率,尽量减少数据倾斜问题,提高样品数据的覆盖率。In the above method, by increasing the weight coefficient, the probability of the same type of data determined as a sample being determined as a sample again is reduced, the problem of data skew is minimized, and the coverage rate of sample data is improved.
在一种可能的实施方式中,对每条所述转换数据中的对应值进行敏感类型识别,获得每条所述转换数据中所有对应值对应的敏感类型,包括:对所有所述转换数据中的对应值进行初始识别处理,获得所有对应值对应被识别的总次数以及所述所有对应值被识别为各个敏感类型对应的次数;基于预设正则表达式或预设元数据关键字,对每条所述转换数据中的任一对应值进行识别匹配,当匹配通过后,基于预设算法对所述任一对应值进行校验,当校验通过时,对所述总次数和所述任一对应值所属的敏感类型的次数进行累加,获得第 一总次数和第一次数;基于所述第一总次数和第一次数,获得第一识别率;所述识别率用于表征所述任一对应值的类型为特定敏感类型的概率;当确定所述第一识别率不小于对应的预设阈值,则对所述任一对应值添加标签,且所述标签用于表征所述任一对应值对应的类型为所述特定敏感类型。In a possible implementation manner, performing sensitive type identification on corresponding values in each piece of converted data, and obtaining sensitive types corresponding to all corresponding values in each piece of converted data includes: The corresponding value of the corresponding value is initially identified, and the total number of times that all corresponding values are identified and the number of times that all corresponding values are identified as corresponding to each sensitive type are obtained; based on preset regular expressions or preset metadata keywords, for each Identify and match any corresponding value in the conversion data described in Article 1. When the matching is passed, verify any corresponding value based on a preset algorithm. When the verification is passed, the total number of times and any Accumulate the number of times of the sensitive type corresponding to the value to obtain the first total number and the first number; based on the first total number and the first number, obtain the first recognition rate; the recognition rate is used to characterize the The probability that the type of any corresponding value is a specific sensitive type; when it is determined that the first recognition rate is not less than the corresponding preset threshold, a label is added to any corresponding value, and the label is used to characterize the The type corresponding to any corresponding value is the specific sensitive type.
基于上述方法,可以在任一对应值对应的第一识别率不小于对应的预设阈值时,准确确定对应值所对应的字段的敏感类型。Based on the above method, when the first recognition rate corresponding to any corresponding value is not less than the corresponding preset threshold, the sensitivity type of the field corresponding to the corresponding value can be accurately determined.
在一种可能的实施方式中,所述方法还包括:当每条所述转换数据中的任一对应值识别匹配和/或校验未通过时,对所述总次数进行累加,获得第二总次数;基于所述第二总次数和所述任一对应值所属的敏感类型的次数,获得第二识别率;当确定所述第二识别率不小于所述预设阈值,则保持所述任一对应值对应的标签不变。In a possible implementation manner, the method further includes: when any corresponding value in each piece of converted data identifies a match and/or fails the check, accumulating the total times to obtain the second The total number of times; based on the second total number of times and the number of times of the sensitive type to which any corresponding value belongs, a second recognition rate is obtained; when it is determined that the second recognition rate is not less than the preset threshold value, then keep the The label corresponding to any corresponding value remains unchanged.
基于上述方法,可以较为准确的确定已标记有标签的字段对应的标签是否准确,提高字段对应的标签的准确率。Based on the above method, it can be more accurately determined whether the label corresponding to the field marked with the label is accurate, and the accuracy rate of the label corresponding to the field is improved.
第二方面,本发明提供一种确定的访问令牌的装置,该装置包括:In a second aspect, the present invention provides a device for determining an access token, the device comprising:
第一处理单元,用于接收多个应用接口发送的采样数据,对每个所述应用接口对应的特征信息进行哈希处理,确定每个所述应用接口的接口标识;The first processing unit is configured to receive sampled data sent by multiple application interfaces, perform hash processing on feature information corresponding to each of the application interfaces, and determine an interface identifier of each of the application interfaces;
确定单元,用于基于预设时长内对采样数据的最大处理数据总量、当前周期时长内当前时刻的数据总量以及预设处理条件,确定每个所述接口标识对应的样本数据;A determining unit, configured to determine the sample data corresponding to each of the interface identifiers based on the maximum total amount of processed data for the sampled data within a preset time period, the total amount of data at the current moment within the current cycle time period, and preset processing conditions;
第二处理单元,用于确定每个所述样本数据中每条数据对应的转换数据,所述转换数据包括字段名和与所述字段名对应的对应值;A second processing unit, configured to determine conversion data corresponding to each piece of data in each of the sample data, where the conversion data includes a field name and a corresponding value corresponding to the field name;
获得单元,用于对每条所述转换数据中的对应值进行敏感类型识别,获得每条所述转换数据中所有对应值对应的敏感类型。The obtaining unit is configured to perform sensitivity type identification on corresponding values in each piece of converted data, and obtain sensitive types corresponding to all corresponding values in each piece of converted data.
在一种可能的方式中,所述预设处理条件基于以下方式表示:In a possible manner, the preset processing condition is expressed based on the following manner:
Figure PCTCN2022099611-appb-000002
Figure PCTCN2022099611-appb-000002
其中,X用于表征所述应用接口的类型个数,K App_ID用于表征预设时长内每个类型的应用接口的样本数据量,K MAX用于表征预设时长内对采样数据的最大处理数据总量。 Among them, X is used to represent the number of types of the application interface, K App_ID is used to represent the sample data volume of each type of application interface within the preset duration, and K MAX is used to represent the maximum processing of the sampled data within the preset duration total amount of data.
在一种可能的实施方式中,所述确定单元,具体用于:确定所述当前周期时长是否为首次确定每个所述接口标识的样本数据的周期时长;当确定所述当前周期时长为首次确定每个所述接口标识的样本数据的周期时长时,确定当前周期时长内当前时刻的任一所述接口标识对应的接口数据总量,与所述当前周期时长内当前时刻的数据总量的比值;将所述比值与所述最大处理数据总量相乘,获得任一所述接口标识的初始样本数据的初始数据总 量;确定所述当前周期时长内当前时刻后的任一时刻,任一所述接口标识对应的第一接口数据的第一接口数据总量;当确定所述第一接口数据总量不大于对应的所述初始数据总量时,确定所述第一接口数据中每条数据被返回到对应的数组中的第一概率,并基于所述第一概率和所述第一接口数据中的数据,获得所述对应的数组中的第一数据;将所述第一数据作为任一所述接口标识的样本数据,以确定任一所述接口标识的样本数据。In a possible implementation manner, the determining unit is specifically configured to: determine whether the current cycle duration is the cycle duration for determining the sample data of each of the interface identifiers for the first time; When determining the cycle duration of the sample data of each of the interface identifiers, determine the total amount of interface data corresponding to any of the interface identifiers at the current moment within the current cycle duration, and the ratio of the total amount of data at the current moment within the current cycle duration Ratio; multiply the ratio by the total amount of maximum processed data to obtain the total amount of initial data of any initial sample data identified by the interface; determine any moment after the current moment within the duration of the current cycle, any A first interface data total amount of the first interface data corresponding to the interface identifier; when it is determined that the first interface data total amount is not greater than the corresponding initial data total amount, determine that each of the first interface data The piece of data is returned to the first probability in the corresponding array, and based on the first probability and the data in the first interface data, the first data in the corresponding array is obtained; the first data As the sample data of any of the interface identifiers, to determine the sample data of any of the interface identifiers.
在一种可能的实施方式中,所述确定单元,还用于:当确定任一所述接口标识的所述第一接口数据总量大于所述初始样本数据的数据总量时,确定所述第一接口数据中每条数据被返回到所述对应的数组中的第二概率;基于所述第二概率和所述第一接口数据中的数据,获得所述对应的数组中的第二数据,所述第二概率与所述第一概率不同;将所述第二数据作为任一所述接口标识的样本数据,以确定每个所述接口标识的样本数据。In a possible implementation manner, the determining unit is further configured to: when it is determined that the total amount of the first interface data of any one of the interface identifiers is greater than the total amount of data of the initial sample data, determine that the Each piece of data in the first interface data is returned to the second probability in the corresponding array; based on the second probability and the data in the first interface data, obtain the second data in the corresponding array , the second probability is different from the first probability; the second data is used as the sample data of any one of the interface identifiers to determine the sample data of each of the interface identifiers.
在一种可能的实施方式中,所述确定单元,具体用于:当确定所述当前周期时长为非首次确定每个所述接口标识的样本数据的周期时长时,且确定所述任一接口标识对应的数组中存储有所述历史样本数据时,对所述历史样本数据进行处理,获得每条历史样本数据的样本标识;确定任一所述接口标识对应的历史样本数据的数据总量,以及任一所述样本标识对应的数据总量,并基于所述历史样本数据的数据总量和所述样本标识对应的数据总量,确定任一样本标识对应的权重系数;当确定任一所述接口标识的所述第一接口数据总量大于所述初始样本数据的数据总量时,确定所述第一接口数据中每条数据被返回到所述对应的数组中的第三概率;基于所述第三概率和所述第一接口数据中的数据,获得所述对应的数组中的第三数据,将所述第三数据作为任一所述接口标识的样本数据,以确定每个所述接口标识的样本数据,所述第三概率为所述第二概率与权重系数的乘积。In a possible implementation manner, the determining unit is specifically configured to: determine that any interface When the historical sample data is stored in the array corresponding to the identifier, the historical sample data is processed to obtain a sample identifier of each piece of historical sample data; determine the total amount of historical sample data corresponding to any one of the interface identifiers, and the total amount of data corresponding to any of the sample identifiers, and based on the total amount of data of the historical sample data and the total amount of data corresponding to the sample identifier, determine the weight coefficient corresponding to any sample identifier; When the total amount of the first interface data identified by the interface is greater than the total amount of data of the initial sample data, determine the third probability that each piece of data in the first interface data is returned to the corresponding array; based on The third probability and the data in the first interface data, obtain the third data in the corresponding array, and use the third data as the sample data of any of the interface identifiers to determine each of the The sample data identified by the interface, the third probability is the product of the second probability and a weight coefficient.
在一种可能的实施方式中,所述获得单元,具体用于:对所有所述转换数据中的对应值进行初始识别处理,获得所有对应值对应被识别的总次数以及所述所有对应值被识别为各个敏感类型对应的次数;基于预设正则表达式或预设元数据关键字,对每条所述转换数据中的任一对应值进行识别匹配,当匹配通过后,基于预设算法对所述任一对应值进行校验,当校验通过时,对所述总次数和所述任一对应值所属的敏感类型的次数进行累加,获得第一总次数和第一次数;基于所述第一总次数和第一次数,获得第一识别率;所述识别率用于表征所述任一对应值的类型为特定敏感类型的概率;当确定所述第一识别率不小于对应的预设阈值,则对所述任一对应值添加标签,且所述标签用于表征所述任一对应值对应的类型为所述特定敏感类型。In a possible implementation manner, the obtaining unit is specifically configured to: perform initial identification processing on all corresponding values in the converted data, and obtain the total number of times that all corresponding values are identified and the number of times that all corresponding values are identified. The number of times identified as corresponding to each sensitive type; based on preset regular expressions or preset metadata keywords, identify and match any corresponding value in each piece of converted data, and when the match is passed, based on preset algorithms The any corresponding value is verified, and when the verification is passed, the total number of times and the number of sensitive types to which any corresponding value belongs are accumulated to obtain the first total number and the first number; based on the The first total number and the first number are used to obtain the first recognition rate; the recognition rate is used to characterize the probability that the type of any corresponding value is a specific sensitive type; when it is determined that the first recognition rate is not less than the corresponding If the preset threshold value is set, a label is added to the any corresponding value, and the label is used to indicate that the type corresponding to the any corresponding value is the specific sensitive type.
在一种可能的实施方式中,所述获得单元还用于:当每条所述转换数据中的任一对应值识别匹配和/或校验未通过时,对所述总次数进行累加,获得第二总次数;基于所述第二 总次数和所述任一对应值所属的敏感类型的次数,获得第二识别率;当确定所述第二识别率不小于所述预设阈值,则保持所述任一对应值对应的标签不变。In a possible implementation manner, the obtaining unit is further configured to: when any corresponding value in each piece of the converted data identifies a match and/or fails the verification, accumulate the total times to obtain The second total number of times; based on the second total number of times and the number of sensitive types to which any corresponding value belongs, a second recognition rate is obtained; when it is determined that the second recognition rate is not less than the preset threshold, then keep The label corresponding to any corresponding value remains unchanged.
上述第二方面及第二方面各个可选装置的有益效果,可以参考上述第一方面及第一方面各个可选方法的有益效果,这里不再赘述。For the beneficial effects of the above-mentioned second aspect and each optional device of the second aspect, reference may be made to the beneficial effects of the above-mentioned first aspect and each optional method of the first aspect, which will not be repeated here.
第三方面,本发明提供一种计算机设备,包括程序或指令,当所述程序或指令被执行时,用以执行上述第一方面及第一方面各个可选的方法。In a third aspect, the present invention provides a computer device, including a program or an instruction, and when the program or instruction is executed, is used to execute the above-mentioned first aspect and each optional method of the first aspect.
第四方面,本发明提供一种存储介质,包括程序或指令,当所述程序或指令被执行时,用以执行上述第一方面及第一方面各个可选的方法。In a fourth aspect, the present invention provides a storage medium, including a program or an instruction, and when the program or instruction is executed, is used to execute the above-mentioned first aspect and each optional method of the first aspect.
附图说明Description of drawings
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简要介绍。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the following will briefly introduce the drawings that need to be used in the description of the embodiments.
图1为本发明实施例提供的应用场景的示意图;FIG. 1 is a schematic diagram of an application scenario provided by an embodiment of the present invention;
图2为本发明实施例提供的一种处理敏感数据方法的步骤流程示意图;FIG. 2 is a schematic flowchart of steps of a method for processing sensitive data provided by an embodiment of the present invention;
图3为本发明实施例提供的一种处理敏感数据装置的结构示意图。Fig. 3 is a schematic structural diagram of an apparatus for processing sensitive data provided by an embodiment of the present invention.
具体实施方式Detailed ways
为了更好的理解上述技术方案,下面将结合说明书附图及具体的实施方式对上述技术方案进行详细的说明,应当理解本发明实施例以及实施例中的具体特征是对本发明技术方案的详细的说明,而不是对本发明技术方案的限定,在不冲突的情况下,本发明实施例以及实施例中的技术特征可以相互结合。In order to better understand the above-mentioned technical solution, the above-mentioned technical solution will be described in detail below in conjunction with the accompanying drawings and specific implementation methods. It should be understood that the embodiments of the present invention and the specific features in the embodiments are detailed descriptions of the technical solution of the present invention. To illustrate, rather than limit, the technical solutions of the present invention, the embodiments of the present invention and the technical features in the embodiments may be combined without conflict.
需要说明的是,本发明的说明书和权利要求中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的图像在适当情况下可以互换,以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。以下示例性实施例中所描述的实施方式并不代表与本发明相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本发明的一些方面相一致的装置和方法的例子。It should be noted that the terms "first" and "second" in the specification and claims of the present invention are used to distinguish similar objects, but not necessarily used to describe a specific sequence or sequence. It is to be understood that the images so used are interchangeable under appropriate circumstances such that the embodiments of the invention described herein can be practiced in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples do not represent all implementations consistent with the present invention. Rather, they are merely examples of apparatuses and methods consistent with aspects of the invention as recited in the appended claims.
目前,随着互联网业务井喷式增长,分布式系统服务数量骤增,服务间接口调用关系多样和复杂化的情况下,需要加强对各个接口对应的数据中的敏感数据的治理、运营和保护,因此,对数据中敏感数据的识别变的尤为重要。At present, with the explosive growth of Internet business, the number of distributed system services has increased sharply, and the inter-service interface call relationship is diverse and complicated, it is necessary to strengthen the governance, operation and protection of sensitive data in the data corresponding to each interface. Therefore, the identification of sensitive data in data becomes particularly important.
然而,现有技术中一般都是直接对获取的数据进行敏感类型的识别,即对全量数据进 行识别处理,这样,不仅识别效率较低,且消耗的内存资源也较多。且随着敏感数据的来源和数据量增加时,无法准确及时的对新增的敏感数据及时处理,即对敏感数据的整体处理效率较低。However, in the prior art, the sensitive types of the acquired data are generally identified directly, that is, the entire amount of data is identified and processed. In this way, not only the identification efficiency is low, but also more memory resources are consumed. And as the source and data volume of sensitive data increase, it is impossible to process the newly added sensitive data in an accurate and timely manner, that is, the overall processing efficiency of sensitive data is low.
鉴于此,本发明实施例提供一种处理敏感数据的方法,通过该方法,可以计算每个应用接口对应的接口标识,基于接口标识对不同的应用接口所对应的系统服务作区分,从而可以降低不同系统服务对应的数据量不同的影响,进而在一定程度可以降低数据倾斜对后续数据的敏感类型识别的影响,且可以使用样本而非全量的数据,对多个应用接口发送的采样数据进行梳理,能够较大幅度地减少要处理的数据量,提高数据处理的速度,从而降低了人力和机器资源成本,进而提高了对敏感数据的识别效率。In view of this, the embodiment of the present invention provides a method for processing sensitive data. Through this method, the interface identifier corresponding to each application interface can be calculated, and the system services corresponding to different application interfaces can be distinguished based on the interface identifier, thereby reducing the The impact of different data volumes corresponding to different system services can reduce the impact of data skew on the identification of sensitive types of subsequent data to a certain extent, and can use samples instead of full data to sort out the sampled data sent by multiple application interfaces , can greatly reduce the amount of data to be processed, increase the speed of data processing, thereby reducing the cost of manpower and machine resources, and then improve the identification efficiency of sensitive data.
介绍完本发明实施例的设计思想之后,下面对本发明实施例中的处理敏感数据的技术方案适用的应用场景做一些简单介绍,需要说明的是,本发明实施例描述的应用场景是为了更加清楚的说明本发明实施例的技术方案,并不构成对于本发明实施例提供的技术方案的限定,本领域普通技术人员可知,随着新应用场景的出现,本发明实施例提供的技术方案对于类似的技术问题,同样适用。After introducing the design ideas of the embodiments of the present invention, the following briefly introduces the application scenarios applicable to the technical solution for processing sensitive data in the embodiments of the present invention. It should be noted that the application scenarios described in the embodiments of the present invention are for clearer The description of the technical solutions of the embodiments of the present invention does not constitute a limitation to the technical solutions provided by the embodiments of the present invention. Those of ordinary skill in the art know that with the emergence of new application scenarios, the technical solutions provided by the embodiments of the present invention are applicable to similar The same applies to technical issues.
在本发明实施例中,请参见图1所示的应用场景示意图,该场景中包括计算机设备101和应用服务器102,计算机设备101可以与应用服务器102进行通信。具体的,例如通过有线或无线通信方式进行直接或间接地连接,本发明不做限制。其中,应用服务器102包括应用服务器102-1、应用服务器102-2、……、应用服务器102-n,n为大于2的正整数。In the embodiment of the present invention, please refer to the schematic diagram of an application scenario shown in FIG. 1 , which includes a computer device 101 and an application server 102 , and the computer device 101 can communicate with the application server 102 . Specifically, for example, direct or indirect connection is performed through wired or wireless communication, which is not limited in the present invention. Wherein, the application server 102 includes an application server 102-1, an application server 102-2, . . . , and an application server 102-n, where n is a positive integer greater than 2.
在该场景中,应用服务器102可以向计算机设备101发送包含敏感数据的数据,从而计算机设备101可以对接收的数据进行处理,从而获得接收的数据中敏感数据的数据类型,实现对敏感数据的梳理。在具体的实施过程中,计算机设备101可以将对接收的数据的处理结果存储到对应的数据库,也可以将对接收的数据的处理结果发送给部署于其它计算机设备上的数据安全平台。In this scenario, the application server 102 can send data containing sensitive data to the computer device 101, so that the computer device 101 can process the received data, thereby obtaining the data type of the sensitive data in the received data, and realizing sorting out the sensitive data . In a specific implementation process, the computer device 101 may store the processing result of the received data in a corresponding database, and may also send the processing result of the received data to a data security platform deployed on other computer devices.
其中,计算机设备101可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN(Content Delivery Network,内容分发网络)、以及大数据和人工智能平台等基础云计算服务的云服务器等,但并不局限于此。应用服务器102可以是分布式系统部署的服务器。Wherein, the computer device 101 can be an independent physical server, or a server cluster or a distributed system composed of multiple physical servers, and can also provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, Cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network, content distribution network), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms, etc., but are not limited to this. The application server 102 may be a server deployed in a distributed system.
为进一步说明本发明实施例提供的处理敏感数据的方法的方案,下面结合附图以及具体实施方式对此进行详细的说明。虽然本发明实施例提供了如下述实施例或附图所示的方法操作步骤,但基于常规或者无需创造性的劳动在所述方法中可以包括更多或者更少的操 作步骤。在逻辑上不存在必要因果关系的步骤中,这些步骤的执行顺序不限于本发明实施例提供的执行顺序。所述方法在实际的处理过程中或者装置执行时,可以按照实施例或者附图所示的方法顺序执行或者并行执行(例如并行处理器或者多线程处理的应用环境)。In order to further illustrate the solution of the method for processing sensitive data provided by the embodiment of the present invention, it will be described in detail below in conjunction with the accompanying drawings and specific implementation methods. Although the embodiments of the present invention provide the method operation steps as shown in the following embodiments or drawings, more or less operation steps may be included in the method based on conventional or creative efforts. In the steps that logically do not have a necessary causal relationship, the execution order of these steps is not limited to the execution order provided in the embodiment of the present invention. The method can be executed sequentially or in parallel according to the methods shown in the embodiments or drawings during the actual processing process or when the device is executed (for example, a parallel processor or an application environment for multi-thread processing).
以下结合图2所示的方法流程图对本发明实施例中处理敏感数据的方法进行说明,下面对本发明实施例的方法流程进行介绍。The method for processing sensitive data in the embodiment of the present invention will be described below with reference to the method flowchart shown in FIG. 2 , and the method flow in the embodiment of the present invention will be introduced below.
步骤201:接收多个应用接口发送的采样数据,对每个应用接口对应的特征信息进行哈希处理,确定每个应用接口的接口标识。Step 201: Receive sampling data sent by multiple application interfaces, perform hash processing on feature information corresponding to each application interface, and determine an interface identifier of each application interface.
在本发明实施例中,计算机设备可以接收多个应用接口发送的采样数据,具体的,多个应用接口可以是各个不同类型的接口,也可以是部分相同类型和部分不同类型的接口,本发明实施中对此不做限制。In the embodiment of the present invention, the computer device may receive sampling data sent by multiple application interfaces. Specifically, the multiple application interfaces may be interfaces of different types, or interfaces of partly the same type and partly of different types. The present invention There is no restriction on this in implementation.
此外,在实际实施过程中,多个应用接口的数量也可以基于时间更新而更新。例如,2021年6月17号早上9点31分有4个应用接口给计算机设备发送采样数据,2021年6月17号早上9点32分有8个应用接口给计算机设备发送采样数据。In addition, in an actual implementation process, the number of multiple application interfaces may also be updated based on time update. For example, at 9:31 am on June 17, 2021, there are 4 application interfaces that send sampling data to computer equipment, and at 9:32 am on June 17, 2021, 8 application interfaces send sampling data to computer equipment.
在本发明实施例中,计算机设备可以确定多个应用接口中每个应用接口的特征信息,从而确定该特征信息对应的特征值。具体的,特征信息的确定方式可以是基于多个应用接口在发送采样数据时携带其对应的特征信息来确定,也可以是计算机设备向多个应用接口对应的应用服务器发送获取特征信息的请求,从而基于对应的应用服务器的反馈信息来获取特征信息,本发明实施例对此不做限定。In the embodiment of the present invention, the computer device may determine the characteristic information of each application interface in the plurality of application interfaces, so as to determine the characteristic value corresponding to the characteristic information. Specifically, the method of determining the characteristic information may be determined based on the fact that multiple application interfaces carry their corresponding characteristic information when sending sampled data, or the computer device may send a request for acquiring characteristic information to the application server corresponding to the multiple application interfaces, Therefore, the characteristic information is obtained based on the feedback information of the corresponding application server, which is not limited in this embodiment of the present invention.
具体的,特征信息可以至少包括:应用接口对应的服务ID;场景ID,其中,场景ID例如为更新场景的ID;数据的报文类型,报文类型例如为同步或者是异步;请求方系统编号;响应方系统编号。Specifically, the feature information may at least include: the service ID corresponding to the application interface; the scene ID, where the scene ID is, for example, the ID of the update scene; the packet type of the data, such as synchronous or asynchronous; the system number of the requester ;Responder system number.
在本发明实施例中,可以对每个应用接口对应的特征值进行哈希运算,从而获取每个应用接口的接口标识。需要说明的是,每个接口标识是唯一的,即可以基于接口标识,确定对应的应用接口。In the embodiment of the present invention, a hash operation may be performed on the feature value corresponding to each application interface, so as to obtain the interface identifier of each application interface. It should be noted that each interface identifier is unique, that is, the corresponding application interface can be determined based on the interface identifier.
例如,假设应用接口1对应的特征值为:V1、V2、……、Vn,其中,n为大于2的正整数,从而可以确定应用接口1对应的接口标识可以表示为:APP_ID=HASH(V1+V2+...+Vn)。For example, assuming that the characteristic values corresponding to application interface 1 are: V1, V2, ..., Vn, where n is a positive integer greater than 2, it can be determined that the interface identifier corresponding to application interface 1 can be expressed as: APP_ID=HASH(V1 +V2+...+Vn).
可见,在本发明实施例中,基于接口标识对不同的应用接口所对应的系统服务作区分,从而可以降低不同系统服务对应的数据量不同的影响,进而在一定程度可以降低数据倾斜对后续数据的敏感类型识别的影响。It can be seen that in the embodiment of the present invention, the system services corresponding to different application interfaces are distinguished based on the interface identifier, so that the impact of different data volumes corresponding to different system services can be reduced, and the impact of data skew on subsequent data can be reduced to a certain extent. The impact of sensitive type recognition.
步骤202:基于预设时长内对采样数据的最大处理数据总量、当前周期时长内当前时 刻的数据总量以及预设处理条件,确定每个接口标识对应的样本数据。Step 202: Determine the sample data corresponding to each interface identifier based on the maximum total amount of data processed for the sampled data within the preset time period, the total amount of data at the current moment in the current cycle time period, and the preset processing conditions.
在本发明实施例中,计算机设备可以基于预设时长对接收的采样数据进行周期性处理。例如,假设预设时长为1分钟,则可以以1分钟为周期对接收的采样数据进行处理。需要说明的是,预设时长可以基于实际实施对应确定,本发明实施例中对此不做限定。In the embodiment of the present invention, the computer device may periodically process the received sampling data based on a preset duration. For example, assuming that the preset duration is 1 minute, the received sampling data may be processed at a period of 1 minute. It should be noted that the preset duration may be determined based on actual implementation, which is not limited in this embodiment of the present invention.
在本发明实施例中,为了使得最终确定的样本数据的覆盖面更全,且适应数据源即多个应用接口的接口数量的所对应提供的数据或者是每个应用接口的数据的数据总量的变化,因此,在确定各个应用接口对应的样本数据之前,可以确定当前周期时长是否为首次确定每个接口标识的样本数据的周期时长。In the embodiment of the present invention, in order to make the coverage of the finally determined sample data more comprehensive and adapt to the data source, that is, the data provided corresponding to the number of interfaces of multiple application interfaces or the total amount of data of each application interface Therefore, before determining the sample data corresponding to each application interface, it may be determined whether the current cycle duration is the cycle duration for which the sample data identified by each interface is determined for the first time.
在一种可能的实施方式中,当确定当前周期时长为首次确定每个接口标识的样本数据的周期时长时,可以采用但不限于以下步骤确定任一接口标识初始对应的样本数据:In a possible implementation manner, when the current period is determined to be the period for which the sample data of each interface identifier is determined for the first time, the following steps may be used, but not limited to, to determine the initial corresponding sample data of any interface identifier:
步骤a:确定当前周期时长内当前时刻的任一接口标识对应的接口数据总量,与当前周期时长内当前时刻的数据总量的比值;Step a: determine the total amount of interface data corresponding to any interface identifier at the current moment in the current cycle duration, and the ratio of the total amount of data at the current moment in the current cycle duration;
步骤b:将比值与预设时长内对采样数据的最大处理数据总量相乘,获得任一接口标识的初始样本数据的初始数据总量;Step b: Multiply the ratio by the maximum total amount of data processed for the sampled data within a preset period of time to obtain the total amount of initial data of the initial sample data identified by any interface;
在本发明实施例中,假设预设时长内对采样数据的最大处理数据总量为K MAX,当前周期时长内当前时刻的数据总量为N,以及当前周期时长内当前时刻的任一应用接口对应的接口数据总量为N APP_ID,从而可以确定各个接口标识的初始数据总量为:
Figure PCTCN2022099611-appb-000003
Figure PCTCN2022099611-appb-000004
In the embodiment of the present invention, it is assumed that the maximum total amount of data processed for sampling data within the preset duration is K MAX , the total amount of data at the current moment in the current cycle duration is N, and any application interface at the current moment in the current cycle duration The corresponding total amount of interface data is N APP_ID , so it can be determined that the total amount of initial data identified by each interface is:
Figure PCTCN2022099611-appb-000003
Figure PCTCN2022099611-appb-000004
步骤c:确定当前周期时长内当前时刻后的任一时刻,任一接口标识对应的第一接口数据的第一接口数据总量;Step c: determine the first interface data total amount of the first interface data corresponding to any interface identifier at any moment after the current moment in the current cycle duration;
步骤d:当确定第一接口数据总量不大于对应的初始数据总量时,确定第一接口数据中每条数据被返回到对应的数组中的第一概率,并基于第一概率和第一接口数据中的数据,获得对应的数组中的第一数据;Step d: When it is determined that the total amount of the first interface data is not greater than the corresponding initial data amount, determine the first probability that each piece of data in the first interface data is returned to the corresponding array, and based on the first probability and the first For the data in the interface data, obtain the first data in the corresponding array;
步骤e:将第一数据作为接口标识的样本数据,以确定任一接口标识的样本数据。Step e: use the first data as the sample data of the interface identifier to determine the sample data of any interface identifier.
在本发明实施例中,由于接口数据总量N APP_ID随着其对应接口标识所对应的应用接口的发送数据的输入而增长,因此,假设当前周期时长内当前时刻后的任一时刻,任一接口标识对应的第一接口数据总量为:N′ APP_IDIn the embodiment of the present invention, since the total amount of interface data N APP_ID increases with the input of the sending data of the application interface corresponding to the interface identifier, it is assumed that at any time after the current time in the current period, any The total amount of first interface data corresponding to the interface identifier is: N′ APP_ID .
具体的,N′ APP_ID(x)≤K APP_ID(x)时,则可以确定第一概率为:
Figure PCTCN2022099611-appb-000005
其中,x可以表示应用接口的顺序标识,例如,第一个应用接口的顺序标识为1,则第一应用接口的接口标识为APP_ID(1)。
Specifically, when N′ APP_ID(x) ≤ K APP_ID(x) , the first probability can be determined as:
Figure PCTCN2022099611-appb-000005
Wherein, x may represent the sequence identifier of the application interface, for example, if the sequence identifier of the first application interface is 1, then the interface identifier of the first application interface is APP_ID(1).
进一步地,计算机设备可以基于第一概率和第一接口数据中的数据,获得对应的数组中的第一数据。然后将第一数据作为接口标识的样本数据,以确定任一接口标识的样本数据。Further, the computer device may obtain the first data in the corresponding array based on the first probability and the data in the first interface data. Then use the first data as the sample data of the interface identifier to determine the sample data of any interface identifier.
步骤f:当确定任一接口标识的第一接口数据总量大于初始样本数据的数据总量时,确定第一接口数据中每条数据被返回到对应的数组中的第二概率。Step f: When it is determined that the total amount of first interface data identified by any interface is greater than the total amount of data of the initial sample data, determine the second probability that each piece of data in the first interface data is returned to the corresponding array.
步骤g:基于第二概率和第一接口数据中的数据,获得对应的数组中的第二数据,第二概率与第一概率不同;Step g: based on the second probability and the data in the first interface data, obtain the second data in the corresponding array, the second probability is different from the first probability;
步骤h:将第二数据作为任一接口标识的样本数据,以确定每个接口标识的样本数据。Step h: use the second data as the sample data of any interface identifier to determine the sample data of each interface identifier.
在本发明实施例中,N′ APP_ID(x)>K APP_ID(x)时,则可以确定第二概率为:
Figure PCTCN2022099611-appb-000006
具体的,若当前数据以
Figure PCTCN2022099611-appb-000007
的概率取出,则继续以
Figure PCTCN2022099611-appb-000008
的概率替换对应的数组中已有的数据,否则数组数据不变。因此,保留当前数据的概率为
Figure PCTCN2022099611-appb-000009
In the embodiment of the present invention, when N' APP_ID(x) >K APP_ID(x) , the second probability can be determined as:
Figure PCTCN2022099611-appb-000006
Specifically, if the current data starts with
Figure PCTCN2022099611-appb-000007
The probability is taken out, then continue to
Figure PCTCN2022099611-appb-000008
The probability of replacing the existing data in the corresponding array, otherwise the array data remains unchanged. Therefore, the probability of retaining the current data is
Figure PCTCN2022099611-appb-000009
在本发明实施例中,前述确定接口标识对应的样本数据的方案需满足预设处理条件,具体的,预设处理条件可以基于以下方式表示:In the embodiment of the present invention, the aforementioned solution for determining the sample data corresponding to the interface identifier needs to meet the preset processing conditions. Specifically, the preset processing conditions can be expressed in the following manner:
Figure PCTCN2022099611-appb-000010
Figure PCTCN2022099611-appb-000010
其中,X用于表征应用接口的类型个数,K App_ID用于表征预设时长内每个类型的应用接口的样本数据量,K MAX用于表征预设时长内对采样数据的最大处理数据总量。 Among them, X is used to represent the number of types of application interfaces, K App_ID is used to represent the sample data volume of each type of application interface within a preset time period, and K MAX is used to represent the maximum total processing data of sampled data within a preset time period. quantity.
例如,假设应用接口的类型个数为3,则K APP_ID(1)、K APP_ID(2)以及K APP_ID(3)的总和不大于K MAXFor example, assuming that the number of types of application interfaces is 3, the sum of K APP_ID(1) , K APP_ID(2) and K APP_ID(3) is not greater than K MAX .
可见,当新增应用接口时,对于每一个已标识过的应用接口,都存在接口数据总量不变,而数据总量变大的情况,即每个应用接口的样本数据的值会变小。It can be seen that when adding an application interface, for each identified application interface, the total amount of interface data remains unchanged, but the total amount of data becomes larger, that is, the value of sample data for each application interface will become smaller .
具体的,假设应用接口x为已确认样本数据的应用接口,则若应用接口x对应的样本数据总量不大于当前时刻之后的时刻所确定的变小后的样本数据总量,那么不需要调整之前确定样本数据的数据总量,且该应用接口x后续反馈的数据概率为:当应用接口x对应的初始数据总量不大于变小后的接口数据总量,则基于
Figure PCTCN2022099611-appb-000011
来向数组返回数据;当应用接口x对应的初始数据总量大于变小后的接口数据总量时,则基于
Figure PCTCN2022099611-appb-000012
来向数组返回数据。
Specifically, assuming that application interface x is an application interface with confirmed sample data, if the total amount of sample data corresponding to application interface x is not greater than the reduced total amount of sample data determined at a time after the current time, then no adjustment is required The total amount of sample data is determined before, and the data probability of the subsequent feedback of the application interface x is: when the total amount of initial data corresponding to the application interface x is not greater than the total amount of interface data after the reduction, then based on
Figure PCTCN2022099611-appb-000011
Return data to the array; when the total amount of initial data corresponding to the application interface x is greater than the total amount of interface data after reduction, based on
Figure PCTCN2022099611-appb-000012
to return data to an array.
以及,若应用接口x对应的样本数据总量大于当前时刻之后的时刻所确定的变小后的 样本数据总量,那么对已有的样本数据需要减少至变小后的样本数据总量,且向数组返回数据的概率不变。And, if the total amount of sample data corresponding to application interface x is greater than the total amount of reduced sample data determined at a time after the current moment, then the existing sample data needs to be reduced to the reduced total amount of sample data, and The probability of returning data to the array is unchanged.
在一种可能的实施方式中,当确定当前周期时长为首次确定每个接口标识的样本数据的周期时长时,确定每个接口标识的样本数据的方案可以包括但不限于以下步骤:In a possible implementation manner, when the current cycle duration is determined to be the cycle duration for determining the sample data of each interface identifier for the first time, the solution for determining the sample data of each interface identifier may include but not limited to the following steps:
步骤A:当任一接口标识对应的数组中存储有历史样本数据时,对历史样本数据进行处理,获得每条历史样本数据的样本标识;Step A: When historical sample data is stored in the array corresponding to any interface identifier, process the historical sample data to obtain the sample identifier of each piece of historical sample data;
步骤B:确定任一接口标识对应的历史样本数据的数据总量,以及任一样本标识对应的数据总量,并基于历史样本数据的数据总量和样本标识对应的数据总量,确定每个样本标识对应的权重系数;Step B: Determine the total amount of historical sample data corresponding to any interface identifier, and the total amount of data corresponding to any sample identifier, and based on the total amount of historical sample data and the total amount of data corresponding to the sample identifier, determine each The weight coefficient corresponding to the sample ID;
步骤C:当确定任一接口标识的第一接口数据总量大于初始样本数据的数据总量时,确定第一接口数据中每条数据被返回到对应的数组中的第三概率;Step C: When it is determined that the total amount of first interface data identified by any interface is greater than the total amount of data in the initial sample data, determine the third probability that each piece of data in the first interface data is returned to the corresponding array;
步骤D:基于第三概率和第一接口数据中的数据,获得对应的数组中的第三数据,将第三数据作为任一接口标识的样本数据,以确定每个接口标识的样本数据,第三概率为第二概率与权重系数的乘积。Step D: Obtain the third data in the corresponding array based on the third probability and the data in the first interface data, and use the third data as the sample data of any interface identifier to determine the sample data of each interface identifier, the first The third probability is the product of the second probability and the weight coefficient.
在本发明实施例中,由于APP_ID是根据应用接口的特征信息或属性值计算而来,且对数据的敏感类型的识别需要针对应用接口对应的发送数据中的每条数据,即每条数据中的报文内容,因此,可以在确定当前接口对应的样本数据时,考虑降低已确定为样本的同类型的数据再次被确定为样本的概率,尽量减少数据倾斜问题,提高样品数据的覆盖率。In the embodiment of the present invention, since the APP_ID is calculated according to the characteristic information or attribute value of the application interface, and the identification of the sensitive type of data needs to be for each piece of data in the corresponding sending data of the application interface, that is, in each piece of data Therefore, when determining the sample data corresponding to the current interface, consider reducing the probability that the same type of data that has been determined as a sample will be determined as a sample again, minimize the problem of data skew, and improve the coverage of sample data.
具体的,可以对历史样本数据中的每条样本数据的数据内容进行解析,得到报文内容的参数列表P、报文长度L等属性,并通过哈希算法计算报文内容的唯一标识,该唯一标识可以称为样本标识,且可以表示为:BODY_ID=HASH(P+…+L)。Specifically, the data content of each piece of sample data in the historical sample data can be analyzed to obtain attributes such as the parameter list P and message length L of the message content, and the unique identifier of the message content can be calculated through a hash algorithm. The unique identifier may be called a sample identifier, and may be expressed as: BODY_ID=HASH(P+...+L).
假设任一接口标识对应的历史样本数据的数据总量表示为:K APP_ID(ALL),任一样本标识对应的数据总量表示为:V BODY_ID,则可以确定每个样本标识的对应的权重系数为:W BODY_ID=1-V BODY_ID/K APP_ID(ALL)。可见,当V BODY_ID对应的数据总量为0时,则W BODY_ID为1。 Assuming that the total amount of historical sample data corresponding to any interface ID is expressed as: K APP_ID(ALL) , and the total amount of data corresponding to any sample ID is expressed as: V BODY_ID , then the corresponding weight coefficient of each sample ID can be determined It is: W BODY_ID =1-V BODY_ID /K APP_ID(ALL) . It can be seen that when the total amount of data corresponding to V BODY_ID is 0, then W BODY_ID is 1.
在本发明实施例中,当确定N′ APP_ID(x)>K APP_ID(x)时,则可以确定第三概率为:
Figure PCTCN2022099611-appb-000013
具体的,若当前数据以
Figure PCTCN2022099611-appb-000014
的概率取出,则继续以
Figure PCTCN2022099611-appb-000015
Figure PCTCN2022099611-appb-000016
的概率替换对应的数组中已有的元素,否则数组元素不变。因此,保留当前数据的概率为
Figure PCTCN2022099611-appb-000017
In the embodiment of the present invention, when it is determined that N' APP_ID(x) >K APP_ID(x) , the third probability can be determined as:
Figure PCTCN2022099611-appb-000013
Specifically, if the current data starts with
Figure PCTCN2022099611-appb-000014
The probability is taken out, then continue to
Figure PCTCN2022099611-appb-000015
Figure PCTCN2022099611-appb-000016
The probability of replacing the existing elements in the corresponding array, otherwise the array elements remain unchanged. Therefore, the probability of retaining the current data is
Figure PCTCN2022099611-appb-000017
为了更好的对确定样本数据的过程进行说明,下面以一个具体的处理过程为例对步骤 202提供的确定样本数据的方式进行说明。In order to better illustrate the process of determining sample data, a specific processing procedure is taken as an example below to describe the manner of determining sample data provided in step 202.
在本发明实施例中,假设在单位时间例如1分钟为预设时长,且当前周期时长为首次确定应用接口A的样本数据的周期时长,假设最大处理数据总量100条数据,且应用接口A对应的数据总量为0条数据。In the embodiment of the present invention, it is assumed that the unit time, such as 1 minute, is the preset duration, and the current cycle duration is the cycle duration of the sample data of the application interface A determined for the first time, assuming that the maximum total amount of data processed is 100 pieces of data, and the application interface A The corresponding total amount of data is 0 data.
那么,在当前周期时长内当前时刻的后一第一时刻例如15点06分1秒,若接收到应用接口A发送的第1条数据,则可以确定应用接口A的初始数据总量为:1/1*100=100条,即应用接口A的第一接口数据总量即1条数据不大于应用接口A的初始数据总量即100条数据,从而可以将应用接口A的第1条数据以第一概率即1/1=1返回到对应的数组中。Then, at the first moment after the current moment in the current period, for example, 15:06:1 second, if the first piece of data sent by application interface A is received, it can be determined that the total amount of initial data of application interface A is: 1 /1*100=100 pieces, that is, the total amount of data of the first interface of application interface A, that is, 1 piece of data is not greater than the total amount of initial data of application interface A, that is, 100 pieces of data, so that the first piece of data of application interface A can be converted to The first probability, ie 1/1=1, is returned to the corresponding array.
在15点06分2秒,若接收到应用接口A发送的第2条数据,则可以确定应用接口A的初始数据总量为:2/2*100=100条,即应用接口A的第一接口数据总量即2条数据不大于应用接口A的初始数据总量即100条,从而可以将应用接口A的第2条数据以第一概率即2/2=1返回到对应的数组中。At 15:06:2 seconds, if the second piece of data sent by application interface A is received, it can be determined that the total amount of initial data of application interface A is: 2/2*100=100 pieces, that is, the first piece of data of application interface A The total amount of interface data, that is, 2 pieces of data is not greater than the total amount of initial data of application interface A, which is 100, so that the second piece of data of application interface A can be returned to the corresponding array with the first probability that is 2/2=1.
在15点06分13秒,若接收到应用接口A发送的第100条数据,则可以确定应用接口A的初始数据总量为:100/100*100=100条,即应用接口A的第一接口数据总量即100条不大于应用接口A的初始数据总量即100条,从而可以将应用接口A的第100条数据以第一概率即100/100=1返回到对应的数组中。At 15:06:13, if the 100th piece of data sent by application interface A is received, it can be determined that the total amount of initial data of application interface A is: 100/100*100=100 pieces, that is, the first data of application interface A The total amount of interface data, that is 100, is not greater than the total amount of initial data of application interface A, namely 100, so that the 100th piece of data of application interface A can be returned to the corresponding array with the first probability that is 100/100=1.
在15点06分15秒,若接收到应用接口A发送的第101条数据,则可以确定应用接口A的初始数据总量为:101/101*100=100条,且应用接口A的第一接口数据总量即101条数据大于应用接口A的初始数据总量即100条,则对于101条数据,先以100/101的概率保留在数组中,而数据中原来的100条数据,以1/100的概率选出被替换。At 15:06:15, if the 101st piece of data sent by application interface A is received, it can be determined that the total amount of initial data of application interface A is: 101/101*100=100 pieces, and the first data of application interface A The total amount of interface data, that is, 101 pieces of data is greater than the initial total amount of data of application interface A, that is, 100 pieces, then for 101 pieces of data, they will be kept in the array with a probability of 100/101, and the original 100 pieces of data in the data will be stored in the array with a probability of 1 /100 probability of being selected for replacement.
进一步地,在15点06分16秒,若接收到应用接口B发送的第1条数据,则可以确定应用接口B的初始数据总量为:1/102*100=1条,需要说明的是,在实际计算过程中,可以采取向上取整的方式确定最终的数量。可见,应用接口B的第一接口数据总量即1条不大于对应的初始数据总量即1条,从而可以将应用接口B的第1条数据以第一概率即1/1=1返回到对应的数组中。Further, at 15:06:16, if the first piece of data sent by application interface B is received, it can be determined that the total amount of initial data of application interface B is: 1/102*100=1 piece, it should be noted that , in the actual calculation process, the final quantity can be determined by rounding up. It can be seen that the total amount of the first interface data of application interface B, i.e. 1 piece, is not greater than the corresponding initial data amount of 1 piece, so that the first piece of data of application interface B can be returned to in the corresponding array.
在15点06分17秒,若接收到应用接口B发送的第2条数据,则可以确定应用接口B的初始数据总量为:2/103*100=2条,需要说明的是,在实际计算过程中,可以采取向上取整的方式确定最终的数量。可见,应用接口B的第一接口数据总量即2条数据不大于对应的初始数据总量即2条数据,从而可以将应用接口B的第2条数据以第一概率即2/2=1返回到对应的数组中。At 15:06:17, if the second piece of data sent by application interface B is received, it can be determined that the total amount of initial data of application interface B is: 2/103*100=2 pieces. It should be noted that, in actual During the calculation process, the final quantity can be determined by rounding up. It can be seen that the total amount of the first interface data of the application interface B, that is, 2 pieces of data, is not greater than the corresponding total amount of initial data, that is, 2 pieces of data, so that the second piece of data of the application interface B can be used with the first probability, that is, 2/2=1 Return to the corresponding array.
在15点06分19秒,若接收到应用接口B发送的第11条数据,则可以确定应用接口B的初始数据总量为:11/112*100=10条;可见,应用接口B的第一接口数据总量11条数据大于对应的初始数据总量即10条数据,那么对该第11条数据,先以10/11的概率保留在数组中,而数组中原来的10条数据,以1/10的概率选出被替换。At 15:06:19, if the 11th piece of data sent by application interface B is received, it can be determined that the total amount of initial data of application interface B is: 11/112*100=10 pieces; it can be seen that the first piece of data of application interface B If the total amount of 11 pieces of data in an interface is greater than the corresponding initial data volume of 10 pieces of data, then the 11th piece of data will be retained in the array with a probability of 10/11, and the original 10 pieces of data in the array will be stored in the array with a probability of 10/11. A 1/10 chance of being selected for replacement.
在15点06分20秒,若接收到应用接口B发送的第12条数据,则可以确定应用接口B的初始数据总量为:12/113*100=11条(向上取整),可见,初始数据总量的数组中的样本数据的数据总量变为11条,而接收的应用接口B发送的第11条数据替换了原数组中10条数据中的一条数据,即应用接口B对应的数组中的数据条数小于应用接口B对应的初始数据总量,因此,可以将应用接口B的第12条数据以第一概率即1返回到对应的数组中。At 15:06:20, if the 12th piece of data sent by application interface B is received, it can be determined that the total amount of initial data of application interface B is: 12/113*100=11 pieces (rounded up), it can be seen that, The total amount of sample data in the array of the total amount of initial data becomes 11 pieces, and the 11th piece of data sent by the received application interface B replaces one of the 10 pieces of data in the original array, that is, the data corresponding to the application interface B The number of data items in the array is less than the total amount of initial data corresponding to the application interface B. Therefore, the twelfth item of data of the application interface B can be returned to the corresponding array with the first probability of 1.
在15点06分35秒,若接收到应用接口A发送的第102条数据,则可以确定应用接口A的第一接口数据总量为:102/114*100=90条,而应用接口A对应的数组中已保存100条数据,即历史样本数据大于第一接口数据总量,因此,先将数组的元素,按90/100的概率保留其中的90个,然后对于这102条数据,以90/102的概率保留,数组中的以1/90的概率选出被替换。At 15:06:35, if the 102nd piece of data sent by application interface A is received, it can be determined that the total amount of the first interface data of application interface A is: 102/114*100=90 pieces, and application interface A corresponds to 100 pieces of data have been saved in the array, that is, the historical sample data is greater than the total amount of data of the first interface. Therefore, firstly, 90 of the elements of the array are reserved according to the probability of 90/100, and then for these 102 pieces of data, 90 The probability of /102 is retained, and the one in the array is selected with a probability of 1/90 to be replaced.
可见,采用前述的方法,即基于改进的水塘抽样方法,能够对流式数据进行较强随机的抽样,使得样品数据覆盖面更全,更加适应数据源变化,提高对敏感数据识别和梳理的有效性和稳定性。It can be seen that the above-mentioned method, that is, based on the improved pond sampling method, can carry out strong random sampling of streaming data, making the sample data coverage more comprehensive, more adaptable to changes in data sources, and improving the effectiveness of sensitive data identification and sorting and stability.
步骤203:确定每个样本数据中每条数据对应的转换数据,转换数据包括字段名和与字段名对应的对应值。Step 203: Determine the conversion data corresponding to each piece of data in each sample data, the conversion data includes field names and corresponding values corresponding to the field names.
在本发明实施例中,当确定多个接口标识分别对应的样本数据之后,计算机设备可以对每个样本数据中每条数据进行解析处理,具体的,可以将JSON、XML等报文格式,转化为KEY-VALUE键值对,即包括字段名和对应值的转换数据。In the embodiment of the present invention, after determining the sample data corresponding to the multiple interface identifiers, the computer device can analyze and process each piece of data in each sample data. Specifically, the message formats such as JSON and XML can be converted into It is a KEY-VALUE key-value pair, that is, the conversion data including the field name and the corresponding value.
步骤204:对每条转换数据中的对应值进行敏感类型识别,获得每条转换数据中所有对应值对应的敏感类型。Step 204: Perform sensitive type identification on corresponding values in each piece of converted data, and obtain sensitive types corresponding to all corresponding values in each piece of converted data.
在本发明实施例中,计算机设备可以对所有所述转换数据中的对应值进行初始识别处理,获得所有对应值对应被识别的总次数以及所述所有对应值被识别为各个敏感类型对应的次数。In the embodiment of the present invention, the computer device may perform initial recognition processing on the corresponding values in all the converted data, and obtain the total number of times that all corresponding values are recognized and the number of times that all corresponding values are recognized as corresponding to each sensitive type .
进一步地,计算机设备可以基于识别策略对每条数据的敏感类型进行识别,具体的,识别策略为基于元数据关键字匹配和预设算法校验,或者识别策略为基于预设正则表达式匹配和预设算法校验。Further, the computer device can identify the sensitive type of each piece of data based on the identification strategy. Specifically, the identification strategy is based on metadata keyword matching and preset algorithm verification, or the identification strategy is based on preset regular expression matching and Default algorithm check.
在本发明实施例中,计算机设备可以基于预设正则表达式或预设元数据关键字,对每 条转换数据中的任一对应值进行识别匹配。其中,预设正则表达式可以为VALUE正则表达式,预设元数据关键字可以基于实际实施情况对应确定,本发明实施例对此不做限定。当匹配通过后,基于预设算法对任一对应值进行校验,当校验通过时,对总次数和任一对应值所属的敏感类型的次数进行累加,获得第一总次数和第一次数。其中,预设算法可以为VALUE算法,当然,也可以是其它算法,本发明实施例对此不做限定。In the embodiment of the present invention, the computer device can identify and match any corresponding value in each piece of converted data based on a preset regular expression or a preset metadata keyword. Wherein, the preset regular expression may be a VALUE regular expression, and the preset metadata keyword may be correspondingly determined based on an actual implementation situation, which is not limited in this embodiment of the present invention. When the matching is passed, any corresponding value is verified based on the preset algorithm. When the verification is passed, the total number of times and the number of sensitive types to which any corresponding value belongs are accumulated to obtain the first total number and the first time number. Wherein, the preset algorithm may be the VALUE algorithm, and of course, other algorithms may also be used, which is not limited in this embodiment of the present invention.
进一步地,可以基于第一总次数和第一次数,获得第一识别率,其中,识别率用于表征所述任一对应值的类型为特定敏感类型的概率,从而当确定第一识别率不小于对应的预设阈值时,则对任一对应值添加标签,且标签用于表征所述任一对应值对应的类型为特定敏感类型。Further, the first recognition rate can be obtained based on the first total number and the first number, wherein the recognition rate is used to characterize the probability that the type of any corresponding value is a specific sensitive type, so that when determining the first recognition rate If it is not less than the corresponding preset threshold, a label is added to any corresponding value, and the label is used to indicate that the type corresponding to any corresponding value is a specific sensitive type.
在本发明实施例中,假设所有对应值对应被识别的总次数表示为N APP_ID_FIELD,对应值被识别为敏感类型对应的次数表示为N X,其中,x为证件号(ID)、手机号(PHONE)、银行卡号(BANK)等敏感类型的标签。 In the embodiment of the present invention, it is assumed that the total number of times that all corresponding values are identified is expressed as N APP_ID_FIELD , and the number of times that the corresponding value is identified as a sensitive type is expressed as N X , where x is a document number (ID), a mobile phone number ( PHONE), bank card number (BANK) and other sensitive labels.
在本发明实施例中,假设任一对应值对应的预设阈值表示为R ERROR,任一对应值对应的字段名表示为F,则当任一对应值通过预设正则表达式匹配为银行卡号,且算法校验通过时,对N APP_ID_FIELD(F)和N BANK(F)的值加一,即可以获得第一总次数和第一次数,从而可以获得第一识别率为:可以确定第一识别率为:R S(BANK)=N′ APP_ID_FIELD(F)/N′ BANK(F)In the embodiment of the present invention, assuming that the preset threshold value corresponding to any corresponding value is expressed as RERROR , and the field name corresponding to any corresponding value is expressed as F, then when any corresponding value is matched to a bank card number by a preset regular expression , and when the algorithm check is passed, add one to the values of NAPP_ID_FIELD(F) and N BANK(F) to obtain the first total number and the first number, so that the first recognition rate can be obtained: the first recognition rate can be determined A recognition rate: RS(BANK) =N' APP_ID_FIELD(F) /N' BANK(F) .
具体的,若R S(BANK)不小于R ERROR,那么对字段F添加BANK标签,且确定该任一对应值对应的应用接口为“涉及银行卡号”的敏感接口。若R S(BANK)小于R ERROR,那么对字段F不添加BANK标签,若该字段已有BANK标签则清除。 Specifically, if R S(BANK) is not less than R ERROR , then add the BANK label to field F, and determine that the application interface corresponding to any corresponding value is a sensitive interface "involving bank card numbers". If R S(BANK) is less than R ERROR , then no BANK tag is added to field F, and if the field already has a BANK tag, it is cleared.
需要说明的是,在本发明实施例中,对银行卡号的校验的预设算法可以为模10算法,当然,也可以是其它算法,本发明实施例中对此不做限定。可见,针对不同的特定敏感类型,可以采用不同的预设算法。It should be noted that, in the embodiment of the present invention, the preset algorithm for verifying the bank card number may be a modulo 10 algorithm, of course, it may also be other algorithms, which are not limited in the embodiment of the present invention. It can be seen that different preset algorithms can be used for different specific sensitive types.
在一种可能的实施方式中,当计算机设备确定每条转换数据中的任一对应值识别匹配和/或校验未通过时,对总次数进行累加,获得第二总次数,然后可以基于第二总次数和任一对应值所属的敏感类型的次数,获得第二识别率。进一步地,当确定第二识别率不小于预设阈值,则保持任一对应值对应的标签不变。In a possible implementation manner, when the computer device determines that any corresponding value in each piece of conversion data identifies a match and/or fails the verification, the total number of times is accumulated to obtain the second total number of times, and then the second total number of times can be obtained based on the first 2 The total number of times and the number of times of the sensitive type to which any corresponding value belongs to obtain a second recognition rate. Further, when it is determined that the second recognition rate is not less than the preset threshold, the label corresponding to any corresponding value remains unchanged.
在本发明实施例中,假设继续以前面字段F的处理为例进行说明。具体的,当字段F不满足任何正则表达式匹配,或者算法校验不通过,则N APP_ID_FIELD(F)的值加一,即可以获得第二总次数,从而可以确定第二识别率为:R′ S(BANK)=N′ APP_ID_FIELD(F)/N BANK(F)。若此时第二识别R′ S(BANK)不小于R ERROR,那么字段F的标签不变;若此时第二识别率R′ S(BANK)小于R ERROR,那么清除字段F对应的标签。 In this embodiment of the present invention, it is assumed that the processing of the previous field F is taken as an example for description. Specifically, when the field F does not satisfy any regular expression match, or the algorithm check fails, the value of N APP_ID_FIELD(F) is increased by one to obtain the second total number of times, so that the second recognition rate can be determined as: R ' S(BANK) =N' APP_ID_FIELD(F) /N BANK(F) . If the second recognition rate R' S(BANK) is not less than R ERROR at this time, then the label of field F remains unchanged; if the second recognition rate R' S(BANK) is less than R ERROR at this time, then the label corresponding to field F is cleared.
需要说明的是,在本发明实施例中,若任一字段存在多中含义,即无法通过校验且无既往标签,则输出提示,并通过使用计算机设备的用户手动打标记来实现对该字段的标签的确定。It should be noted that, in the embodiment of the present invention, if any field has multiple meanings, that is, it cannot pass the verification and has no previous label, a prompt will be output, and the user of the computer device can manually mark the field to achieve The label is determined.
可见,在本发明实施例中,首先,使用样本而非全量的数据,对整体的服务接口进行敏感资产梳理,能够较大幅度地减少要处理的数据量,提高数据处理的速度,降低了人力和机器资源成本。其次,计算应用接口的唯一标识并以此分类,能够对不同的系统服务作区分,降低不同系统服务请求量不同的影响,一定程度解决数据倾斜的问题,使得样品数据较好地与整体服务接口特征相适应。接着,实际应用场景是实时数据处理的,基于改进的水塘抽样方法,能够对流式数据进行较强随机的抽样,同时乘以权重系数,来降低已采样过的数据被采样的概率(变相提高量少的数据采样概率),使得样品数据覆盖面更全,更加适应数据源变化,提高敏感资产梳理的有效性和稳定性。It can be seen that in the embodiment of the present invention, first of all, using samples instead of full data to sort out sensitive assets of the overall service interface can greatly reduce the amount of data to be processed, improve the speed of data processing, and reduce manpower. and machine resource costs. Secondly, calculating the unique identifier of the application interface and classifying it can distinguish different system services, reduce the impact of different system service requests, solve the problem of data skew to a certain extent, and make the sample data better interface with the overall service characteristics fit. Next, the actual application scenario is real-time data processing. Based on the improved pond sampling method, it is possible to perform strong random sampling on streaming data and multiply it by weight coefficients to reduce the probability of sampled data being sampled (improved in disguise). Small amount of data sampling probability), which makes the coverage of sample data more comprehensive, more adaptable to changes in data sources, and improves the effectiveness and stability of sorting out sensitive assets.
如图3所示,本发明提供一种处理敏感数据的装置,所述装置包括第一处理单元301、确定单元302,第二处理单元303以及获得单元304,其中:As shown in Figure 3, the present invention provides a device for processing sensitive data, the device includes a first processing unit 301, a determination unit 302, a second processing unit 303 and an obtaining unit 304, wherein:
第一处理单元301,用于接收多个应用接口发送的采样数据,对每个所述应用接口对应的特征信息进行哈希处理,确定每个所述应用接口的接口标识;The first processing unit 301 is configured to receive sampled data sent by multiple application interfaces, perform hash processing on feature information corresponding to each of the application interfaces, and determine an interface identifier of each of the application interfaces;
确定单元302,用于基于预设时长内对采样数据的最大处理数据总量、当前周期时长内当前时刻的数据总量以及预设处理条件,确定每个所述接口标识对应的样本数据;The determination unit 302 is configured to determine the sample data corresponding to each of the interface identifiers based on the maximum total amount of processed data for the sampled data within a preset time period, the total amount of data at the current moment within the current cycle time period, and preset processing conditions;
第二处理单元303,用于确定每个所述样本数据中每条数据对应的转换数据,所述转换数据包括字段名和与所述字段名对应的对应值;The second processing unit 303 is configured to determine the conversion data corresponding to each piece of data in each of the sample data, the conversion data includes a field name and a corresponding value corresponding to the field name;
获得单元304,用于对每条所述转换数据中的对应值进行敏感类型识别,获得每条所述转换数据中所有对应值对应的敏感类型。The obtaining unit 304 is configured to perform sensitivity type identification on corresponding values in each piece of converted data, and obtain sensitive types corresponding to all corresponding values in each piece of converted data.
在一种可能的方式中,所述预设处理条件基于以下方式表示:In a possible manner, the preset processing condition is expressed based on the following manner:
Figure PCTCN2022099611-appb-000018
Figure PCTCN2022099611-appb-000018
其中,I用于表征所述应用接口的类型个数,K App_ID用于表征预设时长内每个类型的应用接口的样本数据量,K MAX用于表征预设时长内对采样数据的最大处理数据总量。 Among them, I is used to represent the number of types of the application interface, K App_ID is used to represent the sample data volume of each type of application interface within the preset time length, and K MAX is used to represent the maximum processing of the sampled data within the preset time length total amount of data.
在一种可能的实施方式中,所述确定单元302,具体用于:确定所述当前周期时长是否为首次确定每个所述接口标识的样本数据的周期时长;当确定所述当前周期时长为首次确定每个所述接口标识的样本数据的周期时长时,确定当前周期时长内当前时刻的任一所述接口标识对应的接口数据总量,与所述当前周期时长内当前时刻的数据总量的比值;将所述比值与所述最大处理数据总量相乘,获得任一所述接口标识的初始样本数据的初始数 据总量,且将所述初始样本数据存储于对应的数组中;确定所述当前周期时长内当前时刻后的任一时刻,任一所述接口标识对应的第一接口数据的第一接口数据总量;当确定任一所述第一接口数据总量不大于对应的所述初始数据总量时,确定所述第一接口数据中每条数据被返回到对应的数组中的第一概率,并基于所述第一概率和所述第一接口数据中的数据,获得所述对应的数组中的第一数据;将所述第一数据作为任一所述接口标识的样本数据,以确定每个所述接口标识的样本数据。In a possible implementation manner, the determining unit 302 is specifically configured to: determine whether the current cycle duration is the cycle duration for determining the sample data of each of the interface identifiers for the first time; When determining the cycle duration of the sample data of each of the interface identifiers for the first time, determine the total amount of interface data corresponding to any of the interface identifiers at the current moment in the current cycle duration, and the total amount of data at the current moment in the current cycle duration Ratio; multiply the ratio by the total amount of maximum processed data to obtain the total amount of initial data of any initial sample data identified by the interface, and store the initial sample data in a corresponding array; determine At any time after the current moment within the duration of the current period, the first interface data total amount of the first interface data corresponding to any one of the interface identifiers; when it is determined that the total amount of any one of the first interface data is not greater than the corresponding When the total amount of initial data is used, determine the first probability that each piece of data in the first interface data is returned to the corresponding array, and based on the first probability and the data in the first interface data, obtain The first data in the corresponding array; using the first data as the sample data of any one of the interface identifiers to determine the sample data of each of the interface identifiers.
在一种可能的实施方式中,所述确定单元302,还用于:当确定任一所述接口标识的所述第一接口数据总量大于所述初始样本数据的数据总量时,确定所述第一接口数据中每条数据被返回到所述对应的数组中的第二概率;基于所述第二概率和所述第一接口数据中的数据,获得所述对应的数组中的第二数据,所述第二概率与所述第一概率不同;将所述第二数据作为任一所述接口标识的样本数据,以确定每个所述接口标识的样本数据。In a possible implementation manner, the determining unit 302 is further configured to: when it is determined that the total amount of the first interface data of any one of the interface identifiers is greater than the total amount of data of the initial sample data, determine the The second probability that each piece of data in the first interface data is returned to the corresponding array; based on the second probability and the data in the first interface data, obtain the second probability in the corresponding array data, the second probability is different from the first probability; the second data is used as sample data of any one of the interface identifiers to determine the sample data of each of the interface identifiers.
在一种可能的实施方式中,所述确定单元302,具体用于:当确定所述当前周期时长为非首次确定每个所述接口标识的样本数据的周期时长时,且确定所述任一接口标识对应的数组中存储有所述历史样本数据时,对所述历史样本数据进行处理,获得每条历史样本数据的样本标识;确定任一所述接口标识对应的历史样本数据的数据总量,以及任一所述样本标识对应的数据总量,并基于所述历史样本数据的数据总量和所述样本标识对应的数据总量,确定任一样本标识对应的权重系数;当确定任一所述接口标识的所述第一接口数据总量大于所述初始样本数据的数据总量时,确定所述第一接口数据中每条数据被返回到所述对应的数组中的第三概率;基于所述第三概率和所述第一接口数据中的数据,获得所述对应的数组中的第三数据,将所述第三数据作为任一所述接口标识的样本数据,以确定每个所述接口标识的样本数据,所述第三概率为所述第二概率与权重系数的乘积。In a possible implementation manner, the determining unit 302 is specifically configured to: determine that the any When the historical sample data is stored in the array corresponding to the interface identifier, the historical sample data is processed to obtain the sample identifier of each piece of historical sample data; determine the total amount of historical sample data corresponding to any one of the interface identifiers , and the total amount of data corresponding to any of the sample identifiers, and based on the total amount of data of the historical sample data and the total amount of data corresponding to the sample identifier, determine the weight coefficient corresponding to any sample identifier; when determining any When the total amount of the first interface data identified by the interface is greater than the total amount of data of the initial sample data, determine a third probability that each piece of data in the first interface data is returned to the corresponding array; Based on the third probability and the data in the first interface data, the third data in the corresponding array is obtained, and the third data is used as the sample data of any of the interface identifiers to determine each For the sample data identified by the interface, the third probability is a product of the second probability and a weight coefficient.
在一种可能的实施方式中,所述获得单元304,具体用于:对所有所述转换数据中的对应值进行初始识别处理,获得所有对应值对应被识别的总次数以及所述所有对应值被识别为各个敏感类型对应的次数;基于预设正则表达式或预设元数据关键字,对每条所述转换数据中的任一对应值进行识别匹配,当匹配通过后,基于预设算法对所述任一对应值进行校验,当校验通过时,对所述总次数和所述任一对应值所属的敏感类型的次数进行累加,获得第一总次数和第一次数;基于所述第一总次数和第一次数,获得第一识别率;所述识别率用于表征所述任一对应值的类型为特定敏感类型的概率;当确定所述第一识别率不小于对应的预设阈值,则对所述任一对应值添加标签,且所述标签用于表征所述任一对应值对应的类型为所述特定敏感类型。In a possible implementation manner, the obtaining unit 304 is specifically configured to: perform initial identification processing on all corresponding values in the converted data, and obtain the total number of times that all corresponding values are identified and all corresponding values The number of times identified as corresponding to each sensitive type; based on preset regular expressions or preset metadata keywords, identify and match any corresponding value in each piece of conversion data, and when the match is passed, based on the preset algorithm Verify any corresponding value, and when the verification is passed, accumulate the total number of times and the number of times of the sensitive type to which any corresponding value belongs to obtain the first total number and the first time; based on The first total number and the first number are used to obtain a first recognition rate; the recognition rate is used to characterize the probability that the type of any corresponding value is a specific sensitive type; when it is determined that the first recognition rate is not less than A corresponding preset threshold value, then add a label to the any corresponding value, and the label is used to indicate that the type corresponding to the any corresponding value is the specific sensitive type.
在一种可能的实施方式中,所述获得单元304还用于:当每条所述转换数据中的任一 对应值识别匹配和/或校验未通过时,对所述总次数进行累加,获得第二总次数;基于所述第二总次数和所述任一对应值所属的敏感类型的次数,获得第二识别率;当确定所述第二识别率不小于所述预设阈值,则保持所述任一对应值对应的标签不变。In a possible implementation manner, the obtaining unit 304 is further configured to: when any corresponding value in each piece of converted data identifies a match and/or fails the verification, accumulate the total number of times, Obtaining a second total number of times; obtaining a second recognition rate based on the second total number of times and the number of sensitive types to which any corresponding value belongs; when it is determined that the second recognition rate is not less than the preset threshold, then Keep the label corresponding to any corresponding value unchanged.
本发明实施例提供一种计算机设备,包括程序或指令,当所述程序或指令被执行时,用以执行本发明实施例提供的一种处理敏感数据的方法及任一可选方法。An embodiment of the present invention provides a computer device, including a program or an instruction. When the program or instruction is executed, it is used to execute a method for processing sensitive data and any optional method provided in the embodiment of the present invention.
本发明实施例提供一种存储介质,包括程序或指令,当所述程序或指令被执行时,用以执行本发明实施例提供的一种处理敏感数据的方法及任一可选方法。An embodiment of the present invention provides a storage medium, including a program or an instruction. When the program or instruction is executed, it is used to execute a method for processing sensitive data and any optional method provided in the embodiment of the present invention.
最后应说明的是:本领域内的技术人员应明白,本发明的实施例可提供为方法、系统、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、光学存储器等)上实施的计算机程序产品的形式。Finally, it should be noted that: those skilled in the art should understand that the embodiments of the present invention may be provided as methods, systems, or computer program products. Accordingly, the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, optical storage, etc.) having computer-usable program code embodied therein.
本发明是参照根据本发明的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the invention. It should be understood that each procedure and/or block in the flowchart and/or block diagram, and a combination of procedures and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions may be provided to a general purpose computer, special purpose computer, embedded processor, or processor of other programmable data processing equipment to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing equipment produce a An apparatus for realizing the functions specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions The device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.
显然,本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的范围。这样,倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内,则本发明也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the present invention without departing from the scope of the present invention. Thus, if these modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalent technologies, the present invention also intends to include these modifications and variations.

Claims (16)

  1. 一种处理敏感数据的方法,其特征在于,所述方法包括:A method for processing sensitive data, characterized in that the method comprises:
    接收多个应用接口发送的采样数据,对每个所述应用接口对应的特征信息进行哈希处理,确定每个所述应用接口的接口标识;receiving sampling data sent by multiple application interfaces, performing hash processing on the feature information corresponding to each of the application interfaces, and determining the interface identifier of each of the application interfaces;
    基于预设时长内对采样数据的最大处理数据总量、当前周期时长内当前时刻的数据总量以及预设处理条件,确定每个所述接口标识对应的样本数据;Determine the sample data corresponding to each of the interface identifiers based on the maximum total amount of data processed for the sampled data within the preset time period, the total amount of data at the current moment within the current cycle time period, and preset processing conditions;
    确定每个所述样本数据中每条数据对应的转换数据,所述转换数据包括字段名和与所述字段名对应的对应值;Determine the conversion data corresponding to each piece of data in each of the sample data, the conversion data includes a field name and a corresponding value corresponding to the field name;
    对每条所述转换数据中的对应值进行敏感类型识别,获得每条所述转换数据中所有对应值对应的敏感类型。Sensitive types are identified for corresponding values in each piece of converted data, and sensitive types corresponding to all corresponding values in each piece of converted data are obtained.
  2. 如权利要求1所述的方法,其特征在于,所述预设处理条件基于以下方式表示:The method according to claim 1, wherein the preset processing condition is expressed in the following manner:
    Figure PCTCN2022099611-appb-100001
    Figure PCTCN2022099611-appb-100001
    其中,X用于表征所述应用接口的类型个数,K App_ID用于表征预设时长内每个类型的应用接口的样本数据量,K MAX用于表征预设时长内对采样数据的最大处理数据总量。 Among them, X is used to represent the number of types of the application interface, K App_ID is used to represent the sample data volume of each type of application interface within the preset duration, and K MAX is used to represent the maximum processing of the sampled data within the preset duration total amount of data.
  3. 如权利要求1或2所述的方法,其特征在于,基于预设时长内对采样数据的最大处理数据总量、当前周期时长内当前时刻的数据总量以及预设处理条件,确定每个所述接口标识的样本数据,包括:The method according to claim 1 or 2, characterized in that, based on the maximum total amount of data processed for the sampled data within the preset time period, the total amount of data at the current moment within the current cycle time length, and the preset processing conditions, determine each Sample data for the above interface identifiers, including:
    确定所述当前周期时长是否为首次确定每个所述接口标识的样本数据的周期时长;Determine whether the current cycle duration is the first time to determine the cycle duration of each of the sample data identified by the interface;
    当确定所述当前周期时长为首次确定每个所述接口标识的样本数据的周期时长时,确定当前周期时长内当前时刻的任一所述接口标识对应的接口数据总量,与所述当前周期时长内当前时刻的数据总量的比值;When it is determined that the current cycle duration is the cycle duration for determining the sample data of each of the interface identifiers for the first time, determine the total amount of interface data corresponding to any one of the interface identifiers at the current moment within the current cycle duration, which is different from the current cycle duration The ratio of the total amount of data at the current moment within the duration;
    将所述比值与所述最大处理数据总量相乘,获得任一所述接口标识的初始样本数据的初始数据总量;multiplying the ratio by the maximum total amount of processed data to obtain the total amount of initial data of the initial sample data identified by any of the interfaces;
    确定所述当前周期时长内当前时刻后的任一时刻,任一所述接口标识对应的第一接口数据的第一接口数据总量;Determining the total amount of first interface data of the first interface data corresponding to any one of the interface identifiers at any time after the current time within the duration of the current cycle;
    当确定所述第一接口数据总量不大于对应的所述初始数据总量时,确定所述第一接口数据中每条数据被返回到对应的数组中的第一概率,并基于所述第一概率和所述第一接口数据中的数据,获得所述对应的数组中的第一数据;When it is determined that the total amount of the first interface data is not greater than the corresponding initial data amount, determine the first probability that each piece of data in the first interface data is returned to the corresponding array, and based on the first a probability and the data in the first interface data to obtain the first data in the corresponding array;
    将所述第一数据作为任一所述接口标识的样本数据,以确定任一所述接口标识的样本 数据。Using the first data as the sample data of any of the interface identifiers to determine the sample data of any of the interface identifiers.
  4. 如权利要求3所述的方法,其特征在于,所述方法还包括:The method of claim 3, further comprising:
    当确定任一所述接口标识的所述第一接口数据总量大于所述初始样本数据的数据总量时,确定所述第一接口数据中每条数据被返回到所述对应的数组中的第二概率;When it is determined that the total amount of the first interface data of any of the interface identifiers is greater than the total amount of data of the initial sample data, it is determined that each piece of data in the first interface data is returned to the corresponding array second probability;
    基于所述第二概率和所述第一接口数据中的数据,获得所述对应的数组中的第二数据,所述第二概率与所述第一概率不同;obtaining second data in the corresponding array based on the second probability and data in the first interface data, the second probability being different from the first probability;
    将所述第二数据作为任一所述接口标识的样本数据,以确定每个所述接口标识的样本数据。The second data is used as the sample data of any one of the interface identifiers to determine the sample data of each of the interface identifiers.
  5. 如权利要求3所述的方法,其特征在于,基于预设时长内对采样数据的最大处理数据总量、当前周期时长内当前时刻的数据总量以及预设处理条件,确定每个所述接口标识的样本数据,包括:The method according to claim 3, characterized in that each of said interfaces is determined based on the maximum total amount of data processed for sampled data within a preset time period, the total amount of data at the current moment in the current period of time, and preset processing conditions. The identified sample data, including:
    当确定所述当前周期时长为非首次确定每个所述接口标识的样本数据的周期时长时,且确定所述任一接口标识对应的数组中存储有所述历史样本数据时,对所述历史样本数据进行处理,获得每条历史样本数据的样本标识;When it is determined that the current cycle duration is not the first time to determine the cycle duration of the sample data of each of the interface identifiers, and it is determined that the historical sample data is stored in the array corresponding to any interface identifier, the history The sample data is processed to obtain the sample identification of each piece of historical sample data;
    确定任一所述接口标识对应的历史样本数据的数据总量,以及任一所述样本标识对应的数据总量,并基于所述历史样本数据的数据总量和所述样本标识对应的数据总量,确定任一样本标识对应的权重系数;Determine the total amount of historical sample data corresponding to any of the interface identifiers, and the total amount of data corresponding to any of the sample identifiers, and based on the total amount of data of the historical sample data and the total amount of data corresponding to the sample identifier Quantity, determine the weight coefficient corresponding to any sample identification;
    当确定任一所述接口标识的所述第一接口数据总量大于所述初始样本数据的数据总量时,确定所述第一接口数据中每条数据被返回到所述对应的数组中的第三概率;When it is determined that the total amount of the first interface data of any of the interface identifiers is greater than the total amount of data of the initial sample data, it is determined that each piece of data in the first interface data is returned to the corresponding array third probability;
    基于所述第三概率和所述第一接口数据中的数据,获得所述对应的数组中的第三数据,将所述第三数据作为任一所述接口标识的样本数据,以确定每个所述接口标识的样本数据,所述第三概率为所述第二概率与权重系数的乘积。Based on the third probability and the data in the first interface data, the third data in the corresponding array is obtained, and the third data is used as the sample data of any of the interface identifiers to determine each For the sample data identified by the interface, the third probability is a product of the second probability and a weight coefficient.
  6. 如权利要求1所述的方法,其特征在于,对每条所述转换数据中的对应值进行敏感类型识别,获得每条所述转换数据中所有对应值对应的敏感类型,包括:The method according to claim 1, wherein the sensitive type is identified for each corresponding value in the converted data, and the sensitive types corresponding to all corresponding values in each converted data are obtained, including:
    对所有所述转换数据中的对应值进行初始识别处理,获得所有对应值对应被识别的总次数以及所述所有对应值被识别为各个敏感类型对应的次数;Perform initial identification processing on all corresponding values in the converted data, and obtain the total number of times that all corresponding values are identified and the number of times that all corresponding values are identified as corresponding to each sensitive type;
    基于预设正则表达式或预设元数据关键字,对每条所述转换数据中的任一对应值进行识别匹配,当匹配通过后,基于预设算法对所述任一对应值进行校验,当校验通过时,对所述总次数和所述任一对应值所属的敏感类型的次数进行累加,获得第一总次数和第一次数;Identify and match any corresponding value in each piece of converted data based on a preset regular expression or preset metadata keyword, and verify any corresponding value based on a preset algorithm after the match is passed , when the verification is passed, the total number of times and the number of sensitive types to which any corresponding value belongs are accumulated to obtain the first total number and the first time;
    基于所述第一总次数和第一次数,获得第一识别率;所述识别率用于表征所述任一对 应值的类型为特定敏感类型的概率;Based on the first total number of times and the first number of times, a first recognition rate is obtained; the recognition rate is used to characterize the probability that the type of any corresponding value is a specific sensitive type;
    当确定所述第一识别率不小于对应的预设阈值,则对所述任一对应值添加标签,且所述标签用于表征所述任一对应值对应的类型为所述特定敏感类型。When it is determined that the first recognition rate is not less than the corresponding preset threshold, a label is added to any corresponding value, and the label is used to indicate that the type corresponding to any corresponding value is the specific sensitive type.
  7. 如权利要求6所述的方法,其特征在于,所述方法还包括:The method of claim 6, further comprising:
    当每条所述转换数据中的任一对应值识别匹配和/或校验未通过时,对所述总次数进行累加,获得第二总次数;When any corresponding value in each piece of converted data identifies a match and/or fails the verification, the total number of times is accumulated to obtain a second total number of times;
    基于所述第二总次数和所述任一对应值所属的敏感类型的次数,获得第二识别率;Obtaining a second recognition rate based on the second total number of times and the number of sensitive types to which any corresponding value belongs;
    当确定所述第二识别率不小于所述预设阈值,则保持所述任一对应值对应的标签不变。When it is determined that the second recognition rate is not less than the preset threshold, keep the label corresponding to any corresponding value unchanged.
  8. 一种处理敏感数据的装置,其特征在于,所述装置包括:A device for processing sensitive data, characterized in that the device includes:
    第一处理单元,用于接收多个应用接口发送的采样数据,对每个所述应用接口对应的特征信息进行哈希处理,确定每个所述应用接口的接口标识;The first processing unit is configured to receive sampled data sent by multiple application interfaces, perform hash processing on feature information corresponding to each of the application interfaces, and determine an interface identifier of each of the application interfaces;
    确定单元,用于基于预设时长内对采样数据的最大处理数据总量、当前周期时长内当前时刻的数据总量以及预设处理条件,确定每个所述接口标识对应的样本数据;A determining unit, configured to determine the sample data corresponding to each of the interface identifiers based on the maximum total amount of processed data for the sampled data within a preset time period, the total amount of data at the current moment within the current cycle time period, and preset processing conditions;
    第二处理单元,用于确定每个所述样本数据中每条数据对应的转换数据,所述转换数据包括字段名和与所述字段名对应的对应值;A second processing unit, configured to determine conversion data corresponding to each piece of data in each of the sample data, where the conversion data includes a field name and a corresponding value corresponding to the field name;
    获得单元,用于对每条所述转换数据中的对应值进行敏感类型识别,获得每条所述转换数据中所有对应值对应的敏感类型。The obtaining unit is configured to perform sensitivity type identification on corresponding values in each piece of converted data, and obtain sensitive types corresponding to all corresponding values in each piece of converted data.
  9. 如权利要求8所述的装置,其特征在于,所述预设处理条件基于以下方式表示:The device according to claim 8, wherein the preset processing condition is expressed in the following manner:
    Figure PCTCN2022099611-appb-100002
    Figure PCTCN2022099611-appb-100002
    其中,X用于表征所述应用接口的类型个数,K App_ID用于表征预设时长内每个类型的应用接口的样本数据量,K MAX用于表征预设时长内对采样数据的最大处理数据总量。 Among them, X is used to represent the number of types of the application interface, K App_ID is used to represent the sample data volume of each type of application interface within the preset duration, and K MAX is used to represent the maximum processing of the sampled data within the preset duration total amount of data.
  10. 如权利要求8或9所述的装置,其特征在于,所述确定单元,具体用于:The device according to claim 8 or 9, wherein the determining unit is specifically configured to:
    确定所述当前周期时长是否为首次确定每个所述接口标识的样本数据的周期时长;Determine whether the current cycle duration is the first time to determine the cycle duration of each of the sample data identified by the interface;
    当确定所述当前周期时长为首次确定每个所述接口标识的样本数据的周期时长时,确定当前周期时长内当前时刻的任一所述接口标识对应的接口数据总量,与所述当前周期时长内当前时刻的数据总量的比值;When it is determined that the current cycle duration is the cycle duration for determining the sample data of each of the interface identifiers for the first time, determine the total amount of interface data corresponding to any one of the interface identifiers at the current moment within the current cycle duration, which is different from the current cycle duration The ratio of the total amount of data at the current moment within the duration;
    将所述比值与所述最大处理数据总量相乘,获得任一所述接口标识的初始样本数据的初始数据总量;multiplying the ratio by the maximum total amount of processed data to obtain the total amount of initial data of the initial sample data identified by any of the interfaces;
    确定所述当前周期时长内当前时刻后的任一时刻,任一所述接口标识对应的第一接口数据的第一接口数据总量;Determining the total amount of first interface data of the first interface data corresponding to any one of the interface identifiers at any time after the current time within the duration of the current cycle;
    当确定所述第一接口数据总量不大于对应的所述初始数据总量时,确定所述第一接口数据中每条数据被返回到对应的数组中的第一概率,并基于所述第一概率和所述第一接口数据中的数据,获得所述对应的数组中的第一数据;When it is determined that the total amount of the first interface data is not greater than the corresponding initial data amount, determine the first probability that each piece of data in the first interface data is returned to the corresponding array, and based on the first a probability and the data in the first interface data to obtain the first data in the corresponding array;
    将所述第一数据作为任一所述接口标识的样本数据,以确定任一所述接口标识的样本数据。The first data is used as the sample data of any of the interface identifiers to determine the sample data of any of the interface identifiers.
  11. 如权利要求10所述的装置,其特征在于,所述确定单元,还用于:The device according to claim 10, wherein the determining unit is further configured to:
    当确定任一所述接口标识的所述第一接口数据总量大于所述初始样本数据的数据总量时,确定所述第一接口数据中每条数据被返回到所述对应的数组中的第二概率;When it is determined that the total amount of the first interface data of any of the interface identifiers is greater than the total amount of data of the initial sample data, it is determined that each piece of data in the first interface data is returned to the corresponding array second probability;
    基于所述第二概率和所述第一接口数据中的数据,获得所述对应的数组中的第二数据,所述第二概率与所述第一概率不同;obtaining second data in the corresponding array based on the second probability and data in the first interface data, the second probability being different from the first probability;
    将所述第二数据作为任一所述接口标识的样本数据,以确定每个所述接口标识的样本数据。The second data is used as the sample data of any one of the interface identifiers to determine the sample data of each of the interface identifiers.
  12. 如权利要求10所述的装置,其特征在于,所述确定单元,具体用于:The device according to claim 10, wherein the determining unit is specifically configured to:
    当确定所述当前周期时长为非首次确定每个所述接口标识的样本数据的周期时长时,且确定所述任一接口标识对应的数组中存储有所述历史样本数据时,对所述历史样本数据进行处理,获得每条历史样本数据的样本标识;When it is determined that the current cycle duration is not the first time to determine the cycle duration of the sample data of each of the interface identifiers, and it is determined that the historical sample data is stored in the array corresponding to any interface identifier, the history The sample data is processed to obtain the sample identification of each piece of historical sample data;
    确定任一所述接口标识对应的历史样本数据的数据总量,以及任一所述样本标识对应的数据总量,并基于所述历史样本数据的数据总量和所述样本标识对应的数据总量,确定任一样本标识对应的权重系数;Determine the total amount of historical sample data corresponding to any of the interface identifiers, and the total amount of data corresponding to any of the sample identifiers, and based on the total amount of data of the historical sample data and the total amount of data corresponding to the sample identifier Quantity, determine the weight coefficient corresponding to any sample identification;
    当确定任一所述接口标识的所述第一接口数据总量大于所述初始样本数据的数据总量时,确定所述第一接口数据中每条数据被返回到所述对应的数组中的第三概率;When it is determined that the total amount of the first interface data of any of the interface identifiers is greater than the total amount of data of the initial sample data, it is determined that each piece of data in the first interface data is returned to the corresponding array third probability;
    基于所述第三概率和所述第一接口数据中的数据,获得所述对应的数组中的第三数据,将所述第三数据作为任一所述接口标识的样本数据,以确定每个所述接口标识的样本数据,所述第三概率为所述第二概率与权重系数的乘积。Based on the third probability and the data in the first interface data, the third data in the corresponding array is obtained, and the third data is used as the sample data of any of the interface identifiers to determine each For the sample data identified by the interface, the third probability is a product of the second probability and a weight coefficient.
  13. 如权利要求8所述的装置,其特征在于,所述获得单元,具体用于:The device according to claim 8, wherein the obtaining unit is specifically used for:
    对所有所述转换数据中的对应值进行初始识别处理,获得所有对应值对应被识别的总次数以及所述所有对应值被识别为各个敏感类型对应的次数;Perform initial identification processing on all corresponding values in the converted data, and obtain the total number of times that all corresponding values are identified and the number of times that all corresponding values are identified as corresponding to each sensitive type;
    基于预设正则表达式或预设元数据关键字,对每条所述转换数据中的任一对应值进行识别匹配,当匹配通过后,基于预设算法对所述任一对应值进行校验,当校验通过时,对所述总次数和所述任一对应值所属的敏感类型的次数进行累加,获得第一总次数和第一次数;Identify and match any corresponding value in each piece of converted data based on a preset regular expression or preset metadata keyword, and verify any corresponding value based on a preset algorithm after the match is passed , when the verification is passed, the total number of times and the number of sensitive types to which any corresponding value belongs are accumulated to obtain the first total number and the first time;
    基于所述第一总次数和第一次数,获得第一识别率;所述识别率用于表征所述任一对应值的类型为特定敏感类型的概率;Based on the first total number and the first number, a first recognition rate is obtained; the recognition rate is used to characterize the probability that the type of any corresponding value is a specific sensitive type;
    当确定所述第一识别率不小于对应的预设阈值,则对所述任一对应值添加标签,且所述标签用于表征所述任一对应值对应的类型为所述特定敏感类型。When it is determined that the first recognition rate is not less than the corresponding preset threshold, a label is added to any corresponding value, and the label is used to indicate that the type corresponding to any corresponding value is the specific sensitive type.
  14. 如权利要求13所述的装置,其特征在于,所述获得单元,还用于:The device according to claim 13, wherein the obtaining unit is further configured to:
    当每条所述转换数据中的任一对应值识别匹配和/或校验未通过时,对所述总次数进行累加,获得第二总次数;When any corresponding value in each piece of converted data identifies a match and/or fails the verification, the total number of times is accumulated to obtain a second total number of times;
    基于所述第二总次数和所述任一对应值所属的敏感类型的次数,获得第二识别率;Obtaining a second recognition rate based on the second total number of times and the number of sensitive types to which any corresponding value belongs;
    当确定所述第二识别率不小于所述预设阈值,则保持所述任一对应值对应的标签不变。When it is determined that the second recognition rate is not less than the preset threshold, keep the label corresponding to any corresponding value unchanged.
  15. 一种计算机设备,其特征在于,包括程序或指令,当所述程序或指令被执行时,如权利要求1至7中任意一项所述的方法被执行。A computer device, characterized by including programs or instructions, when the programs or instructions are executed, the method according to any one of claims 1 to 7 is executed.
  16. 一种存储介质,其特征在于,包括程序或指令,当所述程序或指令被执行时,如权利要求1至7中任意一项所述的方法被执行。A storage medium is characterized by including programs or instructions, and when the programs or instructions are executed, the method according to any one of claims 1 to 7 is executed.
PCT/CN2022/099611 2021-11-03 2022-06-17 Method and device for processing sensitive data WO2023077815A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111294701.7 2021-11-03
CN202111294701.7A CN114048512A (en) 2021-11-03 2021-11-03 Method and device for processing sensitive data

Publications (1)

Publication Number Publication Date
WO2023077815A1 true WO2023077815A1 (en) 2023-05-11

Family

ID=80207059

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/099611 WO2023077815A1 (en) 2021-11-03 2022-06-17 Method and device for processing sensitive data

Country Status (2)

Country Link
CN (1) CN114048512A (en)
WO (1) WO2023077815A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114048512A (en) * 2021-11-03 2022-02-15 深圳前海微众银行股份有限公司 Method and device for processing sensitive data

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10275396B1 (en) * 2014-09-23 2019-04-30 Symantec Corporation Techniques for data classification based on sensitive data
CN110222170A (en) * 2019-04-25 2019-09-10 平安科技(深圳)有限公司 A kind of method, apparatus, storage medium and computer equipment identifying sensitive data
US20200380160A1 (en) * 2019-05-29 2020-12-03 Microsoft Technology Licensing, Llc Data security classification sampling and labeling
CN112487447A (en) * 2020-11-25 2021-03-12 平安信托有限责任公司 Data security processing method, device, equipment and storage medium
CN113489704A (en) * 2021-06-29 2021-10-08 平安信托有限责任公司 Sensitive data identification method and device based on flow, electronic equipment and medium
CN114048512A (en) * 2021-11-03 2022-02-15 深圳前海微众银行股份有限公司 Method and device for processing sensitive data

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10275396B1 (en) * 2014-09-23 2019-04-30 Symantec Corporation Techniques for data classification based on sensitive data
CN110222170A (en) * 2019-04-25 2019-09-10 平安科技(深圳)有限公司 A kind of method, apparatus, storage medium and computer equipment identifying sensitive data
US20200380160A1 (en) * 2019-05-29 2020-12-03 Microsoft Technology Licensing, Llc Data security classification sampling and labeling
CN112487447A (en) * 2020-11-25 2021-03-12 平安信托有限责任公司 Data security processing method, device, equipment and storage medium
CN113489704A (en) * 2021-06-29 2021-10-08 平安信托有限责任公司 Sensitive data identification method and device based on flow, electronic equipment and medium
CN114048512A (en) * 2021-11-03 2022-02-15 深圳前海微众银行股份有限公司 Method and device for processing sensitive data

Also Published As

Publication number Publication date
CN114048512A (en) 2022-02-15

Similar Documents

Publication Publication Date Title
US20220391763A1 (en) Machine learning service
US9886670B2 (en) Feature processing recipes for machine learning
WO2018166113A1 (en) Random forest model training method, electronic apparatus and storage medium
EP2715565B1 (en) Dynamic rule reordering for message classification
WO2016069065A1 (en) Similarity search and malware prioritization
US20140365827A1 (en) Architecture for end-to-end testing of long-running, multi-stage asynchronous data processing services
CN112527649A (en) Test case generation method and device
WO2021068563A1 (en) Sample date processing method, device and computer equipment, and storage medium
US11570078B2 (en) Collecting route-based traffic metrics in a service-oriented system
WO2023077815A1 (en) Method and device for processing sensitive data
CN113282630A (en) Data query method and device based on interface switching
CN111581258A (en) Safety data analysis method, device, system, equipment and storage medium
CN111865576B (en) Method and device for synchronizing URL classification data
US20210056586A1 (en) Optimizing large scale data analysis
CN113760484A (en) Data processing method and device
CN113157911A (en) Service verification method and device
CN116108132B (en) Method and device for auditing text of short message
CN117272970B (en) Document generation method, device, equipment and storage medium
US11907658B2 (en) User-agent anomaly detection using sentence embedding
CN115529271A (en) Service request distribution method, device, equipment and medium
CN112819018A (en) Method and device for generating sample, electronic equipment and storage medium
CN113761182A (en) Method and device for determining service problem
CN117574186A (en) Session feature recognition method, device, equipment and storage medium
CN116841505A (en) Index generation method, device, computer equipment and storage medium
CN113362097A (en) User determination method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22888851

Country of ref document: EP

Kind code of ref document: A1