WO2023077815A1 - Procédé et dispositif de traitement de données sensibles - Google Patents

Procédé et dispositif de traitement de données sensibles Download PDF

Info

Publication number
WO2023077815A1
WO2023077815A1 PCT/CN2022/099611 CN2022099611W WO2023077815A1 WO 2023077815 A1 WO2023077815 A1 WO 2023077815A1 CN 2022099611 W CN2022099611 W CN 2022099611W WO 2023077815 A1 WO2023077815 A1 WO 2023077815A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
interface
total amount
sample
probability
Prior art date
Application number
PCT/CN2022/099611
Other languages
English (en)
Chinese (zh)
Inventor
彭永杰
Original Assignee
深圳前海微众银行股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳前海微众银行股份有限公司 filed Critical 深圳前海微众银行股份有限公司
Publication of WO2023077815A1 publication Critical patent/WO2023077815A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/64Protecting data integrity, e.g. using checksums, certificates or signatures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Definitions

  • Embodiments of the present invention relate to the field of financial technology (Fintech), and in particular, to a method and device for processing sensitive data.
  • Fetech financial technology
  • Internet services have brought great convenience to people's lives, but at the same time, they have also brought many security problems.
  • Internet services provide various functional interfaces both internally and externally. If some interfaces involving sensitive data are compromised or their own problems lead to sensitive data leakage, it may cause huge security risks to users and enterprises. Therefore, in order to strengthen the governance, operation and protection of sensitive data, the risk identification and distribution flow of interface sensitive data assets become particularly important.
  • the present invention provides a method and device for processing sensitive data, which is used to effectively reduce the impact on the sensitive type identification of data due to the sudden increase or change of data volume and application interface category, and quickly and simply complete the sampling data corresponding to each application interface Sensitive types of carding.
  • the present invention provides a method for processing sensitive data.
  • the method includes: receiving sampled data sent by multiple application interfaces, performing hash processing on the feature information corresponding to each of the application interfaces, and determining the The interface identification of the application interface; based on the maximum total amount of data processed for the sampled data within the preset time period, the total amount of data at the current moment within the current cycle time period, and the preset processing conditions, determine the sample data corresponding to each of the interface identifications; determine Conversion data corresponding to each piece of data in each of the sample data, the conversion data includes a field name and a corresponding value corresponding to the field name; sensitive type identification is performed on the corresponding value in each piece of conversion data, and each piece of conversion data is obtained. Sensitivity types for all corresponding values in the transformed data described in this article.
  • the interface identifier corresponding to each application interface that sends sampled data is calculated, and the system services corresponding to different application interfaces are distinguished based on the interface identifier, so that the influence of different data volumes corresponding to different system services can be reduced, and further To a certain extent, it can reduce the impact of data skew on the identification of sensitive types of subsequent data, and can use samples instead of full data to sort out the sampled data sent by multiple application interfaces, which can greatly reduce the amount of data to be processed , improve the speed of data processing, thereby reducing the cost of manpower and machine resources, and then improving the identification efficiency of sensitive data.
  • the preset processing condition is expressed in the following manner:
  • X is used to represent the number of types of the application interface
  • K App_ID is used to represent the sample data volume of each type of application interface within the preset duration
  • K MAX is used to represent the maximum processing of the sampled data within the preset duration total amount of data.
  • the amount of sample data and the number of application interface types of each type of application interface are constrained, so that the sample data corresponding to each type of application interface can be covered as much as possible, and it is effectively guaranteed. Stabilization of the identification basis for the identification of sensitive types of subsequent data.
  • the sample data identified by each interface is determined based on the maximum total amount of data processed for the sampled data within the preset time period, the total amount of data at the current moment in the current cycle time period, and the preset processing conditions , including: determining whether the current cycle duration is the cycle duration of the first determination of the sample data of each of the interface identifiers; when determining that the current cycle duration is the first determination of the cycle duration of the sample data of each of the interface identifiers, Determine the ratio of the total amount of interface data corresponding to any of the interface identifiers at the current moment in the current cycle duration to the total amount of data at the current moment in the current cycle duration; compare the ratio with the maximum total amount of processed data Multiply to obtain the total amount of initial data of the initial sample data of any one of the interface identifiers; determine the first interface of the first interface data corresponding to any one of the interface identifiers at any time after the current moment within the duration of the current cycle The total amount of data; when it is determined that the total
  • the sample data covers a more comprehensive sample data, that is, a relatively small amount of sample data with a relatively comprehensive sample data coverage is provided for subsequent
  • the identification of the sensitive type of the sampling data sent by each application interface reduces the amount of data to be processed, thereby improving the processing speed of the sensitive data.
  • the method further includes: when it is determined that the total amount of data of the first interface identified by any of the interfaces is greater than the total amount of data of the initial sample data, determining that the first interface Each piece of data in the data is returned to the second probability in the corresponding array; based on the second probability and the data in the first interface data, obtain the second data in the corresponding array, the The second probability is different from the first probability; the second data is used as the sample data of any one of the interface identifiers to determine the sample data of each of the interface identifiers.
  • the sample data covers a more comprehensive sample data, that is, a relatively small amount of sample data with a relatively comprehensive sample data coverage is provided for subsequent
  • the identification of the sensitive type of the sampling data sent by the application interface reduces the amount of data to be processed, thereby improving the processing speed of the sensitive data.
  • the sample data identified by each interface is determined based on the maximum total amount of data processed for the sampled data within the preset time period, the total amount of data at the current moment in the current cycle time period, and the preset processing conditions , including when it is determined that the current cycle duration is not the first time to determine the cycle duration of the sample data of each of the interface identifiers, and when it is determined that the historical sample data is stored in the array corresponding to any of the interface identifiers, for all Process the historical sample data to obtain the sample identification of each piece of historical sample data; determine the total amount of data of the historical sample data corresponding to any of the interface identifications, and the total amount of data corresponding to any of the sample identifications, and based on the The total amount of data of the historical sample data and the total amount of data corresponding to the sample identification, determine the weight coefficient corresponding to any sample identification; When the total amount of sample data is used, determine the third probability that each piece of data in the first interface data is returned to the corresponding array; based on the
  • performing sensitive type identification on corresponding values in each piece of converted data, and obtaining sensitive types corresponding to all corresponding values in each piece of converted data includes: The corresponding value of the corresponding value is initially identified, and the total number of times that all corresponding values are identified and the number of times that all corresponding values are identified as corresponding to each sensitive type are obtained; based on preset regular expressions or preset metadata keywords, for each Identify and match any corresponding value in the conversion data described in Article 1. When the matching is passed, verify any corresponding value based on a preset algorithm.
  • the sensitivity type of the field corresponding to the corresponding value can be accurately determined.
  • the method further includes: when any corresponding value in each piece of converted data identifies a match and/or fails the check, accumulating the total times to obtain the second The total number of times; based on the second total number of times and the number of times of the sensitive type to which any corresponding value belongs, a second recognition rate is obtained; when it is determined that the second recognition rate is not less than the preset threshold value, then keep the The label corresponding to any corresponding value remains unchanged.
  • the present invention provides a device for determining an access token, the device comprising:
  • the first processing unit is configured to receive sampled data sent by multiple application interfaces, perform hash processing on feature information corresponding to each of the application interfaces, and determine an interface identifier of each of the application interfaces;
  • a determining unit configured to determine the sample data corresponding to each of the interface identifiers based on the maximum total amount of processed data for the sampled data within a preset time period, the total amount of data at the current moment within the current cycle time period, and preset processing conditions;
  • a second processing unit configured to determine conversion data corresponding to each piece of data in each of the sample data, where the conversion data includes a field name and a corresponding value corresponding to the field name;
  • the obtaining unit is configured to perform sensitivity type identification on corresponding values in each piece of converted data, and obtain sensitive types corresponding to all corresponding values in each piece of converted data.
  • the preset processing condition is expressed based on the following manner:
  • X is used to represent the number of types of the application interface
  • K App_ID is used to represent the sample data volume of each type of application interface within the preset duration
  • K MAX is used to represent the maximum processing of the sampled data within the preset duration total amount of data.
  • the determining unit is specifically configured to: determine whether the current cycle duration is the cycle duration for determining the sample data of each of the interface identifiers for the first time; When determining the cycle duration of the sample data of each of the interface identifiers, determine the total amount of interface data corresponding to any of the interface identifiers at the current moment within the current cycle duration, and the ratio of the total amount of data at the current moment within the current cycle duration Ratio; multiply the ratio by the total amount of maximum processed data to obtain the total amount of initial data of any initial sample data identified by the interface; determine any moment after the current moment within the duration of the current cycle, any A first interface data total amount of the first interface data corresponding to the interface identifier; when it is determined that the first interface data total amount is not greater than the corresponding initial data total amount, determine that each of the first interface data The piece of data is returned to the first probability in the corresponding array, and based on the first probability and the data in the first interface data, the first data in the corresponding array is obtained;
  • the determining unit is further configured to: when it is determined that the total amount of the first interface data of any one of the interface identifiers is greater than the total amount of data of the initial sample data, determine that the Each piece of data in the first interface data is returned to the second probability in the corresponding array; based on the second probability and the data in the first interface data, obtain the second data in the corresponding array , the second probability is different from the first probability; the second data is used as the sample data of any one of the interface identifiers to determine the sample data of each of the interface identifiers.
  • the determining unit is specifically configured to: determine that any interface When the historical sample data is stored in the array corresponding to the identifier, the historical sample data is processed to obtain a sample identifier of each piece of historical sample data; determine the total amount of historical sample data corresponding to any one of the interface identifiers, and the total amount of data corresponding to any of the sample identifiers, and based on the total amount of data of the historical sample data and the total amount of data corresponding to the sample identifier, determine the weight coefficient corresponding to any sample identifier; When the total amount of the first interface data identified by the interface is greater than the total amount of data of the initial sample data, determine the third probability that each piece of data in the first interface data is returned to the corresponding array; based on The third probability and the data in the first interface data, obtain the third data in the corresponding array, and use the third data as the sample data of any of the interface identifiers to determine each of the The sample data identified by the interface, the third probability is the product of the second
  • the obtaining unit is specifically configured to: perform initial identification processing on all corresponding values in the converted data, and obtain the total number of times that all corresponding values are identified and the number of times that all corresponding values are identified.
  • the number of times identified as corresponding to each sensitive type based on preset regular expressions or preset metadata keywords, identify and match any corresponding value in each piece of converted data, and when the match is passed, based on preset algorithms
  • the any corresponding value is verified, and when the verification is passed, the total number of times and the number of sensitive types to which any corresponding value belongs are accumulated to obtain the first total number and the first number; based on the The first total number and the first number are used to obtain the first recognition rate; the recognition rate is used to characterize the probability that the type of any corresponding value is a specific sensitive type; when it is determined that the first recognition rate is not less than the corresponding If the preset threshold value is set, a label is added to the any corresponding value, and the label is used to indicate that the type corresponding to the
  • the obtaining unit is further configured to: when any corresponding value in each piece of the converted data identifies a match and/or fails the verification, accumulate the total times to obtain The second total number of times; based on the second total number of times and the number of sensitive types to which any corresponding value belongs, a second recognition rate is obtained; when it is determined that the second recognition rate is not less than the preset threshold, then keep The label corresponding to any corresponding value remains unchanged.
  • the present invention provides a computer device, including a program or an instruction, and when the program or instruction is executed, is used to execute the above-mentioned first aspect and each optional method of the first aspect.
  • the present invention provides a storage medium, including a program or an instruction, and when the program or instruction is executed, is used to execute the above-mentioned first aspect and each optional method of the first aspect.
  • FIG. 1 is a schematic diagram of an application scenario provided by an embodiment of the present invention.
  • FIG. 2 is a schematic flowchart of steps of a method for processing sensitive data provided by an embodiment of the present invention
  • Fig. 3 is a schematic structural diagram of an apparatus for processing sensitive data provided by an embodiment of the present invention.
  • the sensitive types of the acquired data are generally identified directly, that is, the entire amount of data is identified and processed. In this way, not only the identification efficiency is low, but also more memory resources are consumed. And as the source and data volume of sensitive data increase, it is impossible to process the newly added sensitive data in an accurate and timely manner, that is, the overall processing efficiency of sensitive data is low.
  • the embodiment of the present invention provides a method for processing sensitive data.
  • the interface identifier corresponding to each application interface can be calculated, and the system services corresponding to different application interfaces can be distinguished based on the interface identifier, thereby reducing the
  • the impact of different data volumes corresponding to different system services can reduce the impact of data skew on the identification of sensitive types of subsequent data to a certain extent, and can use samples instead of full data to sort out the sampled data sent by multiple application interfaces , can greatly reduce the amount of data to be processed, increase the speed of data processing, thereby reducing the cost of manpower and machine resources, and then improve the identification efficiency of sensitive data.
  • FIG. 1 the schematic diagram of an application scenario shown in FIG. 1 , which includes a computer device 101 and an application server 102 , and the computer device 101 can communicate with the application server 102 .
  • the application server 102 includes an application server 102-1, an application server 102-2, . . . , and an application server 102-n, where n is a positive integer greater than 2.
  • the application server 102 can send data containing sensitive data to the computer device 101, so that the computer device 101 can process the received data, thereby obtaining the data type of the sensitive data in the received data, and realizing sorting out the sensitive data .
  • the computer device 101 may store the processing result of the received data in a corresponding database, and may also send the processing result of the received data to a data security platform deployed on other computer devices.
  • the computer device 101 can be an independent physical server, or a server cluster or a distributed system composed of multiple physical servers, and can also provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, Cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network, content distribution network), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms, etc., but are not limited to this.
  • the application server 102 may be a server deployed in a distributed system.
  • Step 201 Receive sampling data sent by multiple application interfaces, perform hash processing on feature information corresponding to each application interface, and determine an interface identifier of each application interface.
  • the computer device may receive sampling data sent by multiple application interfaces.
  • the multiple application interfaces may be interfaces of different types, or interfaces of partly the same type and partly of different types.
  • the present invention There is no restriction on this in implementation.
  • the number of multiple application interfaces may also be updated based on time update. For example, at 9:31 am on June 17, 2021, there are 4 application interfaces that send sampling data to computer equipment, and at 9:32 am on June 17, 2021, 8 application interfaces send sampling data to computer equipment.
  • the computer device may determine the characteristic information of each application interface in the plurality of application interfaces, so as to determine the characteristic value corresponding to the characteristic information.
  • the method of determining the characteristic information may be determined based on the fact that multiple application interfaces carry their corresponding characteristic information when sending sampled data, or the computer device may send a request for acquiring characteristic information to the application server corresponding to the multiple application interfaces, Therefore, the characteristic information is obtained based on the feedback information of the corresponding application server, which is not limited in this embodiment of the present invention.
  • the feature information may at least include: the service ID corresponding to the application interface; the scene ID, where the scene ID is, for example, the ID of the update scene; the packet type of the data, such as synchronous or asynchronous; the system number of the requester ;Responder system number.
  • a hash operation may be performed on the feature value corresponding to each application interface, so as to obtain the interface identifier of each application interface. It should be noted that each interface identifier is unique, that is, the corresponding application interface can be determined based on the interface identifier.
  • the system services corresponding to different application interfaces are distinguished based on the interface identifier, so that the impact of different data volumes corresponding to different system services can be reduced, and the impact of data skew on subsequent data can be reduced to a certain extent.
  • Step 202 Determine the sample data corresponding to each interface identifier based on the maximum total amount of data processed for the sampled data within the preset time period, the total amount of data at the current moment in the current cycle time period, and the preset processing conditions.
  • the computer device may periodically process the received sampling data based on a preset duration. For example, assuming that the preset duration is 1 minute, the received sampling data may be processed at a period of 1 minute. It should be noted that the preset duration may be determined based on actual implementation, which is not limited in this embodiment of the present invention.
  • the current cycle duration is the cycle duration for which the sample data identified by each interface is determined for the first time.
  • the following steps may be used, but not limited to, to determine the initial corresponding sample data of any interface identifier:
  • Step a determine the total amount of interface data corresponding to any interface identifier at the current moment in the current cycle duration, and the ratio of the total amount of data at the current moment in the current cycle duration;
  • Step b Multiply the ratio by the maximum total amount of data processed for the sampled data within a preset period of time to obtain the total amount of initial data of the initial sample data identified by any interface;
  • the maximum total amount of data processed for sampling data within the preset duration is K MAX
  • the total amount of data at the current moment in the current cycle duration is N
  • any application interface at the current moment in the current cycle duration is N APP_ID , so it can be determined that the total amount of initial data identified by each interface is:
  • Step c determine the first interface data total amount of the first interface data corresponding to any interface identifier at any moment after the current moment in the current cycle duration;
  • Step d When it is determined that the total amount of the first interface data is not greater than the corresponding initial data amount, determine the first probability that each piece of data in the first interface data is returned to the corresponding array, and based on the first probability and the first For the data in the interface data, obtain the first data in the corresponding array;
  • Step e use the first data as the sample data of the interface identifier to determine the sample data of any interface identifier.
  • the first probability can be determined as:
  • x may represent the sequence identifier of the application interface, for example, if the sequence identifier of the first application interface is 1, then the interface identifier of the first application interface is APP_ID(1).
  • the computer device may obtain the first data in the corresponding array based on the first probability and the data in the first interface data. Then use the first data as the sample data of the interface identifier to determine the sample data of any interface identifier.
  • Step f When it is determined that the total amount of first interface data identified by any interface is greater than the total amount of data of the initial sample data, determine the second probability that each piece of data in the first interface data is returned to the corresponding array.
  • Step g based on the second probability and the data in the first interface data, obtain the second data in the corresponding array, the second probability is different from the first probability;
  • Step h use the second data as the sample data of any interface identifier to determine the sample data of each interface identifier.
  • the second probability can be determined as: Specifically, if the current data starts with The probability is taken out, then continue to The probability of replacing the existing data in the corresponding array, otherwise the array data remains unchanged. Therefore, the probability of retaining the current data is
  • the aforementioned solution for determining the sample data corresponding to the interface identifier needs to meet the preset processing conditions.
  • the preset processing conditions can be expressed in the following manner:
  • X is used to represent the number of types of application interfaces
  • K App_ID is used to represent the sample data volume of each type of application interface within a preset time period
  • K MAX is used to represent the maximum total processing data of sampled data within a preset time period. quantity.
  • K APP_ID(1) , K APP_ID(2) and K APP_ID(3) is not greater than K MAX .
  • application interface x is an application interface with confirmed sample data
  • the data probability of the subsequent feedback of the application interface x is: when the total amount of initial data corresponding to the application interface x is not greater than the total amount of interface data after the reduction, then based on Return data to the array; when the total amount of initial data corresponding to the application interface x is greater than the total amount of interface data after reduction, based on to return data to an array.
  • the solution for determining the sample data of each interface identifier may include but not limited to the following steps:
  • Step A When historical sample data is stored in the array corresponding to any interface identifier, process the historical sample data to obtain the sample identifier of each piece of historical sample data;
  • Step B Determine the total amount of historical sample data corresponding to any interface identifier, and the total amount of data corresponding to any sample identifier, and based on the total amount of historical sample data and the total amount of data corresponding to the sample identifier, determine each The weight coefficient corresponding to the sample ID;
  • Step C When it is determined that the total amount of first interface data identified by any interface is greater than the total amount of data in the initial sample data, determine the third probability that each piece of data in the first interface data is returned to the corresponding array;
  • Step D Obtain the third data in the corresponding array based on the third probability and the data in the first interface data, and use the third data as the sample data of any interface identifier to determine the sample data of each interface identifier, the first
  • the third probability is the product of the second probability and the weight coefficient.
  • the APP_ID is calculated according to the characteristic information or attribute value of the application interface, and the identification of the sensitive type of data needs to be for each piece of data in the corresponding sending data of the application interface, that is, in each piece of data Therefore, when determining the sample data corresponding to the current interface, consider reducing the probability that the same type of data that has been determined as a sample will be determined as a sample again, minimize the problem of data skew, and improve the coverage of sample data.
  • the data content of each piece of sample data in the historical sample data can be analyzed to obtain attributes such as the parameter list P and message length L of the message content, and the unique identifier of the message content can be calculated through a hash algorithm.
  • the third probability can be determined as: Specifically, if the current data starts with The probability is taken out, then continue to The probability of replacing the existing elements in the corresponding array, otherwise the array elements remain unchanged. Therefore, the probability of retaining the current data is
  • the unit time such as 1 minute
  • the current cycle duration is the cycle duration of the sample data of the application interface A determined for the first time, assuming that the maximum total amount of data processed is 100 pieces of data, and the application interface A The corresponding total amount of data is 0 data.
  • the total amount of interface data that is, 101 pieces of data is greater than the initial total amount of data of application interface A, that is, 100 pieces, then for 101 pieces of data, they will be kept in the array with a probability of 100/101, and the original 100 pieces of data in the data will be stored in the array with a probability of 1 /100 probability of being selected for replacement.
  • the final quantity can be determined by rounding up. It can be seen that the total amount of the first interface data of application interface B, i.e. 1 piece, is not greater than the corresponding initial data amount of 1 piece, so that the first piece of data of application interface B can be returned to in the corresponding array.
  • the above-mentioned method that is, based on the improved pond sampling method, can carry out strong random sampling of streaming data, making the sample data coverage more comprehensive, more adaptable to changes in data sources, and improving the effectiveness of sensitive data identification and sorting and stability.
  • Step 203 Determine the conversion data corresponding to each piece of data in each sample data, the conversion data includes field names and corresponding values corresponding to the field names.
  • the computer device can analyze and process each piece of data in each sample data.
  • the message formats such as JSON and XML can be converted into It is a KEY-VALUE key-value pair, that is, the conversion data including the field name and the corresponding value.
  • Step 204 Perform sensitive type identification on corresponding values in each piece of converted data, and obtain sensitive types corresponding to all corresponding values in each piece of converted data.
  • the computer device may perform initial recognition processing on the corresponding values in all the converted data, and obtain the total number of times that all corresponding values are recognized and the number of times that all corresponding values are recognized as corresponding to each sensitive type .
  • the computer device can identify the sensitive type of each piece of data based on the identification strategy.
  • the identification strategy is based on metadata keyword matching and preset algorithm verification, or the identification strategy is based on preset regular expression matching and Default algorithm check.
  • the computer device can identify and match any corresponding value in each piece of converted data based on a preset regular expression or a preset metadata keyword.
  • the preset regular expression may be a VALUE regular expression
  • the preset metadata keyword may be correspondingly determined based on an actual implementation situation, which is not limited in this embodiment of the present invention.
  • the matching is passed, any corresponding value is verified based on the preset algorithm.
  • the verification is passed, the total number of times and the number of sensitive types to which any corresponding value belongs are accumulated to obtain the first total number and the first time number.
  • the preset algorithm may be the VALUE algorithm, and of course, other algorithms may also be used, which is not limited in this embodiment of the present invention.
  • the first recognition rate can be obtained based on the first total number and the first number, wherein the recognition rate is used to characterize the probability that the type of any corresponding value is a specific sensitive type, so that when determining the first recognition rate If it is not less than the corresponding preset threshold, a label is added to any corresponding value, and the label is used to indicate that the type corresponding to any corresponding value is a specific sensitive type.
  • N APP_ID_FIELD the total number of times that all corresponding values are identified
  • N X the number of times that the corresponding value is identified as a sensitive type
  • x is a document number (ID), a mobile phone number ( PHONE), bank card number (BANK) and other sensitive labels.
  • the preset threshold value corresponding to any corresponding value is expressed as RERROR
  • the field name corresponding to any corresponding value is expressed as F
  • the algorithm check is passed, add one to the values of NAPP_ID_FIELD(F) and N BANK(F) to obtain the first total number and the first number, so that the first recognition rate can be obtained: the first recognition rate can be determined
  • a recognition rate: RS(BANK) N' APP_ID_FIELD(F) /N' BANK(F) .
  • R S(BANK) is not less than R ERROR , then add the BANK label to field F, and determine that the application interface corresponding to any corresponding value is a sensitive interface "involving bank card numbers". If R S(BANK) is less than R ERROR , then no BANK tag is added to field F, and if the field already has a BANK tag, it is cleared.
  • the preset algorithm for verifying the bank card number may be a modulo 10 algorithm, of course, it may also be other algorithms, which are not limited in the embodiment of the present invention. It can be seen that different preset algorithms can be used for different specific sensitive types.
  • the computer device determines that any corresponding value in each piece of conversion data identifies a match and/or fails the verification
  • the total number of times is accumulated to obtain the second total number of times, and then the second total number of times can be obtained based on the first 2
  • any field has multiple meanings, that is, it cannot pass the verification and has no previous label, a prompt will be output, and the user of the computer device can manually mark the field to achieve The label is determined.
  • the present invention provides a device for processing sensitive data, the device includes a first processing unit 301, a determination unit 302, a second processing unit 303 and an obtaining unit 304, wherein:
  • the first processing unit 301 is configured to receive sampled data sent by multiple application interfaces, perform hash processing on feature information corresponding to each of the application interfaces, and determine an interface identifier of each of the application interfaces;
  • the determination unit 302 is configured to determine the sample data corresponding to each of the interface identifiers based on the maximum total amount of processed data for the sampled data within a preset time period, the total amount of data at the current moment within the current cycle time period, and preset processing conditions;
  • the second processing unit 303 is configured to determine the conversion data corresponding to each piece of data in each of the sample data, the conversion data includes a field name and a corresponding value corresponding to the field name;
  • the obtaining unit 304 is configured to perform sensitivity type identification on corresponding values in each piece of converted data, and obtain sensitive types corresponding to all corresponding values in each piece of converted data.
  • the preset processing condition is expressed based on the following manner:
  • I is used to represent the number of types of the application interface
  • K App_ID is used to represent the sample data volume of each type of application interface within the preset time length
  • K MAX is used to represent the maximum processing of the sampled data within the preset time length total amount of data.
  • the determining unit 302 is specifically configured to: determine whether the current cycle duration is the cycle duration for determining the sample data of each of the interface identifiers for the first time; When determining the cycle duration of the sample data of each of the interface identifiers for the first time, determine the total amount of interface data corresponding to any of the interface identifiers at the current moment in the current cycle duration, and the total amount of data at the current moment in the current cycle duration Ratio; multiply the ratio by the total amount of maximum processed data to obtain the total amount of initial data of any initial sample data identified by the interface, and store the initial sample data in a corresponding array; determine At any time after the current moment within the duration of the current period, the first interface data total amount of the first interface data corresponding to any one of the interface identifiers; when it is determined that the total amount of any one of the first interface data is not greater than the corresponding When the total amount of initial data is used, determine the first probability that each piece of data in the first interface data is returned to the corresponding array
  • the determining unit 302 is further configured to: when it is determined that the total amount of the first interface data of any one of the interface identifiers is greater than the total amount of data of the initial sample data, determine the The second probability that each piece of data in the first interface data is returned to the corresponding array; based on the second probability and the data in the first interface data, obtain the second probability in the corresponding array data, the second probability is different from the first probability; the second data is used as sample data of any one of the interface identifiers to determine the sample data of each of the interface identifiers.
  • the determining unit 302 is specifically configured to: determine that the any When the historical sample data is stored in the array corresponding to the interface identifier, the historical sample data is processed to obtain the sample identifier of each piece of historical sample data; determine the total amount of historical sample data corresponding to any one of the interface identifiers , and the total amount of data corresponding to any of the sample identifiers, and based on the total amount of data of the historical sample data and the total amount of data corresponding to the sample identifier, determine the weight coefficient corresponding to any sample identifier; when determining any When the total amount of the first interface data identified by the interface is greater than the total amount of data of the initial sample data, determine a third probability that each piece of data in the first interface data is returned to the corresponding array; Based on the third probability and the data in the first interface data, the third data in the corresponding array is obtained, and the third data is used as the sample data of any of the interface identifiers to determine each For the sample data identified by the interface, the
  • the obtaining unit 304 is specifically configured to: perform initial identification processing on all corresponding values in the converted data, and obtain the total number of times that all corresponding values are identified and all corresponding values The number of times identified as corresponding to each sensitive type; based on preset regular expressions or preset metadata keywords, identify and match any corresponding value in each piece of conversion data, and when the match is passed, based on the preset algorithm Verify any corresponding value, and when the verification is passed, accumulate the total number of times and the number of times of the sensitive type to which any corresponding value belongs to obtain the first total number and the first time; based on The first total number and the first number are used to obtain a first recognition rate; the recognition rate is used to characterize the probability that the type of any corresponding value is a specific sensitive type; when it is determined that the first recognition rate is not less than A corresponding preset threshold value, then add a label to the any corresponding value, and the label is used to indicate that the type corresponding to the any corresponding value is the specific sensitive type
  • the obtaining unit 304 is further configured to: when any corresponding value in each piece of converted data identifies a match and/or fails the verification, accumulate the total number of times, Obtaining a second total number of times; obtaining a second recognition rate based on the second total number of times and the number of sensitive types to which any corresponding value belongs; when it is determined that the second recognition rate is not less than the preset threshold, then Keep the label corresponding to any corresponding value unchanged.
  • An embodiment of the present invention provides a computer device, including a program or an instruction.
  • the program or instruction When the program or instruction is executed, it is used to execute a method for processing sensitive data and any optional method provided in the embodiment of the present invention.
  • An embodiment of the present invention provides a storage medium, including a program or an instruction.
  • the program or instruction When executed, it is used to execute a method for processing sensitive data and any optional method provided in the embodiment of the present invention.
  • the embodiments of the present invention may be provided as methods, systems, or computer program products. Accordingly, the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, optical storage, etc.) having computer-usable program code embodied therein.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions
  • the device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Debugging And Monitoring (AREA)

Abstract

Sont divulgués dans la présente invention un procédé et un dispositif de traitement de données sensibles. Le procédé consiste à : recevoir des données d'échantillonnage envoyées par une pluralité d'interfaces d'application, réaliser un traitement de hachage sur des informations de caractéristiques correspondant à chaque interface d'application et déterminer un identifiant d'interface de chaque interface d'application ; déterminer, sur la base du volume de données de traitement total maximal des données d'échantillonnage dans une durée prédéfinie, le volume de données total à un moment actuel dans une durée de période actuelle, et une condition de traitement prédéfinie, des données d'échantillon correspondant à chaque identifiant d'interface ; déterminer des données de conversion correspondant à chaque élément de données dans chaque élément de données d'échantillon, les données de conversion comprenant des noms de champ et des valeurs correspondantes correspondant aux noms de champ ; et réaliser une identification de type sensible sur les valeurs correspondantes dans chaque élément de données de conversion et obtenir des types sensibles correspondant à toutes les valeurs correspondantes dans chaque élément de données de conversion. Le procédé peut réduire efficacement l'influence d'une augmentation ou d'un changement brusque du volume de données et des types d'interfaces d'application sur l'identification de type sensible des données, et achever rapidement et simplement un tri de type sensible des données d'échantillonnage correspondant à chaque interface d'application.
PCT/CN2022/099611 2021-11-03 2022-06-17 Procédé et dispositif de traitement de données sensibles WO2023077815A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111294701.7 2021-11-03
CN202111294701.7A CN114048512B (zh) 2021-11-03 2021-11-03 一种处理敏感数据的方法及装置

Publications (1)

Publication Number Publication Date
WO2023077815A1 true WO2023077815A1 (fr) 2023-05-11

Family

ID=80207059

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/099611 WO2023077815A1 (fr) 2021-11-03 2022-06-17 Procédé et dispositif de traitement de données sensibles

Country Status (2)

Country Link
CN (1) CN114048512B (fr)
WO (1) WO2023077815A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114048512B (zh) * 2021-11-03 2024-06-21 深圳前海微众银行股份有限公司 一种处理敏感数据的方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10275396B1 (en) * 2014-09-23 2019-04-30 Symantec Corporation Techniques for data classification based on sensitive data
CN110222170A (zh) * 2019-04-25 2019-09-10 平安科技(深圳)有限公司 一种识别敏感数据的方法、装置、存储介质及计算机设备
US20200380160A1 (en) * 2019-05-29 2020-12-03 Microsoft Technology Licensing, Llc Data security classification sampling and labeling
CN112487447A (zh) * 2020-11-25 2021-03-12 平安信托有限责任公司 数据安全处理方法、装置、设备及存储介质
CN113489704A (zh) * 2021-06-29 2021-10-08 平安信托有限责任公司 基于流量的敏感数据识别方法、装置、电子设备及介质
CN114048512A (zh) * 2021-11-03 2022-02-15 深圳前海微众银行股份有限公司 一种处理敏感数据的方法及装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11449635B2 (en) * 2018-05-16 2022-09-20 Microsoft Technology Licensing, Llc. Rule-based document scrubbing of sensitive data
US11941135B2 (en) * 2019-08-23 2024-03-26 International Business Machines Corporation Automated sensitive data classification in computerized databases
CN112528315A (zh) * 2019-09-19 2021-03-19 华为技术有限公司 识别敏感数据的方法和装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10275396B1 (en) * 2014-09-23 2019-04-30 Symantec Corporation Techniques for data classification based on sensitive data
CN110222170A (zh) * 2019-04-25 2019-09-10 平安科技(深圳)有限公司 一种识别敏感数据的方法、装置、存储介质及计算机设备
US20200380160A1 (en) * 2019-05-29 2020-12-03 Microsoft Technology Licensing, Llc Data security classification sampling and labeling
CN112487447A (zh) * 2020-11-25 2021-03-12 平安信托有限责任公司 数据安全处理方法、装置、设备及存储介质
CN113489704A (zh) * 2021-06-29 2021-10-08 平安信托有限责任公司 基于流量的敏感数据识别方法、装置、电子设备及介质
CN114048512A (zh) * 2021-11-03 2022-02-15 深圳前海微众银行股份有限公司 一种处理敏感数据的方法及装置

Also Published As

Publication number Publication date
CN114048512B (zh) 2024-06-21
CN114048512A (zh) 2022-02-15

Similar Documents

Publication Publication Date Title
US9886670B2 (en) Feature processing recipes for machine learning
WO2018166113A1 (fr) Procédé d'apprentissage de modèle de forêt aléatoire, appareil électronique et support d'informations
EP2715565B1 (fr) Reclassement dynamique de règles pour une classification de messages
US9639444B2 (en) Architecture for end-to-end testing of long-running, multi-stage asynchronous data processing services
WO2016069065A1 (fr) Recherche de similarité et priorisation de logiciel malveillant
CN112527649A (zh) 一种测试用例的生成方法和装置
WO2021068563A1 (fr) Procédé, dispositif et équipement informatique de traitement de date d'échantillon, et support de stockage
US11570078B2 (en) Collecting route-based traffic metrics in a service-oriented system
CN111581258B (zh) 一种安全数据分析方法、装置、系统、设备及存储介质
WO2019056496A1 (fr) Procédé de génération d'intervalle de probabilité d'examen d'image et procédé de détermination d'examen d'image
WO2023077815A1 (fr) Procédé et dispositif de traitement de données sensibles
US11182386B2 (en) Offloading statistics collection
CN113282630A (zh) 基于接口切换的数据查询方法及装置
CN111444364B (zh) 一种图像检测方法和装置
CN111865576B (zh) 一种同步url分类数据的方法和装置
CN113760484A (zh) 数据处理的方法和装置
CN113157911A (zh) 一种服务验证方法和装置
CN116108132B (zh) 短信文本的审核方法和装置
CN117272970B (zh) 一种文档生成方法、装置、设备以及存储介质
CN112570287B (zh) 一种垃圾分类方法和装置
CN118296374A (en) Method and device for constructing data set
CN112819018A (zh) 生成样本的方法、装置、电子设备和存储介质
CN113761182A (zh) 一种确定业务问题的方法和装置
CN117574186A (zh) 会话特征识别方法、装置、设备和存储介质
CN116841505A (zh) 指标生成方法、装置、计算机设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22888851

Country of ref document: EP

Kind code of ref document: A1