CN113642326A - Sensitive data recognition model training method, sensitive data recognition method and system - Google Patents

Sensitive data recognition model training method, sensitive data recognition method and system Download PDF

Info

Publication number
CN113642326A
CN113642326A CN202110935771.XA CN202110935771A CN113642326A CN 113642326 A CN113642326 A CN 113642326A CN 202110935771 A CN202110935771 A CN 202110935771A CN 113642326 A CN113642326 A CN 113642326A
Authority
CN
China
Prior art keywords
sensitive data
identification
data
model
sensitive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110935771.XA
Other languages
Chinese (zh)
Inventor
吕丹
洪俊鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Hongshu Technology Co ltd
Original Assignee
Guangdong Hongshu Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Hongshu Technology Co ltd filed Critical Guangdong Hongshu Technology Co ltd
Priority to CN202110935771.XA priority Critical patent/CN113642326A/en
Publication of CN113642326A publication Critical patent/CN113642326A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention relates to a sensitive data identification model training method, a sensitive data identification method and a system. And on the premise that the corpus data or sample data is stored in the database, the server node completes sensitive data identification according to the sensitive data identification model, the AI sensitive data discovery server allocates corpus data, stop words or sample data between the sensitive data identification model and the server node and acquires an identification result of the server node, and trains the sensitive data identification model. Sensitive data identification is carried out through a sensitive data identification model obtained through multiple training, the identification rate of non-standard sensitive data is continuously improved, the problem that the identification rate of the non-standard sensitive data is low in the traditional technology is solved, meanwhile, the sensitive data identification is transversely expanded into a multi-node cluster through distributed deployment of server nodes, the condition that the identification request amount of the sensitive data is suddenly increased is met, load balance is achieved, and the stability of sensitive data identification is guaranteed.

Description

Sensitive data recognition model training method, sensitive data recognition method and system
Technical Field
The invention relates to the technical field of data security, in particular to a sensitive data recognition model training method, a sensitive data recognition method and a sensitive data recognition system.
Background
The sensitive data discovery is a data security management technology developed based on privacy data protection and industry regulations, the sensitive data is comprehensively, quickly and accurately discovered through the business data characteristics of enterprises, a continuously updated enterprise data asset management catalogue is constructed, and a foundation is provided for data security work.
In the traditional sensitive data discovery technology, sensitive data is identified and positioned based on technical means such as regular expression matching, keyword code table mapping, data type definition discrimination, data characteristic calculation and the like, for the traditional technical means, the premise that the sensitive data can be accurately discovered is that the data quality is high, the data quality is poor due to the fact that the data acquisition process of certain enterprises is not standard, for example, some special characters exist in a client address field, key identification information such as province and city areas is lost, non-address data and the like, the identification accuracy rate by the traditional technical means is low, the requirement of the enterprises on the accuracy of sensitive data discovery cannot be met, the production cost of the enterprises can be improved due to too much manual intervention, and the private data of users can be indirectly leaked due to the omission of visual inspection.
Meanwhile, because the traditional sensitive data discovery service is in a single-point deployment mode, single-point faults are easy to occur when a user requests for sudden increase, automatic recovery is difficult, and daily business of an enterprise is influenced.
In summary, it can be seen that the above disadvantages exist in the conventional sensitive data discovery service.
Disclosure of Invention
Therefore, it is necessary to provide a sensitive data recognition model training method, a sensitive data recognition method and a sensitive data recognition system for overcoming the defects of the conventional sensitive data discovery service.
A sensitive data recognition model training method comprises the following steps:
obtaining corpus data and stop words;
preprocessing the material data and stop words to obtain a preprocessing result;
and performing multiple times of model training according to the preprocessing result to obtain a sensitive data recognition model.
According to the sensitive data recognition model training method, after the corpus data and the stop words are obtained, the corpus data and the stop words are preprocessed to obtain a preprocessing result, and multiple times of model training are carried out according to the preprocessing result to obtain the sensitive data recognition model. Based on the method, the recognition rate of the non-standard sensitive data is continuously improved through multiple times of model training, so that the problem that the recognition rate of the traditional technology to the non-standard sensitive data is low is solved.
In one embodiment, the process of preprocessing the speech data and stop words to obtain the preprocessing result includes the following steps:
and encapsulating the corpus data and the word stopping parameters to obtain parameters serving as preprocessing results.
In one embodiment, the process of encapsulating the corpus data and the stop word parameter to obtain the parameter as the preprocessing result includes the steps of:
performing word segmentation processing on the corpus data to obtain a word segmentation list;
removing stop words of the word segmentation list to obtain a targeted word segmentation list;
and packaging the targeted word segmentation list into vectorized parameters as a preprocessing result.
In one embodiment, the sensitive data recognition model is a Doc2Vec model.
In one embodiment, the process of performing multiple model training according to the preprocessing result includes the steps of:
and performing more than 10 times of model training according to the preprocessing result.
A sensitive data recognition model training apparatus, comprising:
the first acquisition module is used for acquiring corpus data and stop words;
the first preprocessing module is used for preprocessing the speech data and stop words to obtain a preprocessing result;
and the data training module is used for carrying out model training for multiple times according to the preprocessing result so as to obtain a sensitive data recognition model.
After the corpus data and the stop words are obtained, the sensitive data recognition model training device preprocesses the corpus data and the stop words to obtain a preprocessing result, and performs multiple times of model training according to the preprocessing result to obtain the sensitive data recognition model. Based on the method, the recognition rate of the non-standard sensitive data is continuously improved through multiple times of model training, so that the problem that the recognition rate of the traditional technology to the non-standard sensitive data is low is solved.
A computer storage medium having stored thereon computer instructions which, when executed by a processor, implement the sensitive data recognition model training method of any of the above embodiments.
After the corpus data and the stop words are obtained, the computer storage medium preprocesses the corpus data and the stop words to obtain a preprocessing result, and performs multiple times of model training according to the preprocessing result to obtain a sensitive data recognition model. Based on the method, the recognition rate of the non-standard sensitive data is continuously improved through multiple times of model training, so that the problem that the recognition rate of the traditional technology to the non-standard sensitive data is low is solved.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the sensitive data recognition model training method of any of the above embodiments when executing the program.
After the corpus data and the stop words are obtained, the computer device preprocesses the corpus data and the stop words to obtain a preprocessing result, and performs multiple times of model training according to the preprocessing result to obtain a sensitive data recognition model. Based on the method, the recognition rate of the non-standard sensitive data is continuously improved through multiple times of model training, so that the problem that the recognition rate of the traditional technology to the non-standard sensitive data is low is solved.
A sensitive data identification method, comprising the steps of:
acquiring sample data to be identified;
preprocessing sample data to obtain a sample processing result;
and loading the sample processing result into the sensitive data identification model to obtain the identification rate which is output by the sensitive data identification model and is used as the identification result.
According to the sensitive data identification method, after the sample data to be identified is obtained, the sample data is preprocessed to obtain a sample processing result, and finally the sample processing result is loaded into the sensitive data identification model to obtain the identification rate which is output by the sensitive data identification model and serves as the identification result. Based on the method, the sensitive data identification is carried out on the sensitive data identification model obtained through multiple times of training, the identification rate of the non-standard sensitive data is continuously improved, and the problem that the identification rate of the traditional technology to the non-standard sensitive data is low is solved.
In one embodiment, the process of preprocessing sample data to obtain a sample processing result includes the steps of:
and packaging the sample data parameters to obtain parameters serving as sample processing results.
In one embodiment, the process of loading the sample processing result into the sensitive data recognition model and obtaining the recognition rate output by the sensitive data recognition model as the recognition result includes the following steps:
and loading the sample processing result into the sensitive data identification model, outputting the identification rate as an identification result when the identification rate output by the sensitive data identification model is greater than the sensitive type threshold, and otherwise, controlling the sensitive data identification model to repeat model operation.
A sensitive data identification device comprising:
the second acquisition module is used for acquiring sample data to be identified;
the second preprocessing module is used for preprocessing the sample data to obtain a sample processing result;
and the model identification module is used for loading the sample processing result into the sensitive data identification model and acquiring the identification rate which is output by the sensitive data identification model and is used as the identification result.
According to the sensitive data identification device, after the sample data to be identified is obtained, the sample data is preprocessed to obtain a sample processing result, and finally the sample processing result is loaded into the sensitive data identification model to obtain the identification rate which is output by the sensitive data identification model and serves as the identification result. Based on the method, the sensitive data identification is carried out on the sensitive data identification model obtained through multiple times of training, the identification rate of the non-standard sensitive data is continuously improved, and the problem that the identification rate of the traditional technology to the non-standard sensitive data is low is solved.
A computer storage medium having stored thereon computer instructions which, when executed by a processor, implement the sensitive data identification method of any of the above embodiments.
After the computer storage medium obtains the sample data to be identified, the sample data is preprocessed to obtain a sample processing result, and finally the sample processing result is loaded into the sensitive data identification model to obtain the identification rate which is output by the sensitive data identification model and serves as the identification result. Based on the method, the sensitive data identification is carried out on the sensitive data identification model obtained through multiple times of training, the identification rate of the non-standard sensitive data is continuously improved, and the problem that the identification rate of the traditional technology to the non-standard sensitive data is low is solved.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the sensitive data identification method of any of the above embodiments when executing the program.
After the computer equipment obtains the sample data to be identified, the sample data is preprocessed to obtain a sample processing result, and finally the sample processing result is loaded into the sensitive data identification model to obtain the identification rate which is output by the sensitive data identification model and serves as the identification result. Based on the method, the sensitive data identification is carried out on the sensitive data identification model obtained through multiple times of training, the identification rate of the non-standard sensitive data is continuously improved, and the problem that the identification rate of the traditional technology to the non-standard sensitive data is low is solved.
A sensitive data identification system comprising:
the server node is used for completing sensitive data identification according to the sensitive data identification model;
and the AI sensitive data discovery server is used for allocating corpus data, stop words or sample data between the sensitive data recognition model and the server node, acquiring the recognition result of the server node and training the sensitive data recognition model.
The sensitive data identification system comprises a server node and an AI sensitive data discovery server. And the AI sensitive data discovery server allocates corpus data, stop words or sample data among the database, the sensitive data identification model and the server node, and is used for acquiring the identification result of the server node and training the sensitive data identification model. Based on the method, the sensitive data identification model obtained through multiple training is used for carrying out sensitive data identification, the identification rate of the non-standard sensitive data is continuously improved, the problem that the identification rate of the non-standard sensitive data is low in the traditional technology is solved, meanwhile, the sensitive data identification is transversely expanded into a multi-node cluster through distributed deployment of the server nodes, the condition that the identification request amount of the sensitive data is suddenly increased is met, load balance is achieved, and the stability of the sensitive data identification is guaranteed.
In one embodiment, the AI-sensitive data discovery server comprises:
the model training module is used for training the sensitive data recognition model;
an AI sensitive data discovery service platform; the method is used for allocating corpus data, stop words or sample data between the sensitive data recognition model and the server node and obtaining the recognition result of the server node.
Drawings
FIG. 1 is a flow diagram of a sensitive data recognition model training method according to an embodiment;
FIG. 2 is a flow chart of a sensitive data recognition model training method according to another embodiment;
FIG. 3 is a flow chart of a sensitive data recognition model training method according to yet another embodiment;
FIG. 4 is a block diagram of an embodiment of a sensitive data recognition model training apparatus;
FIG. 5 is a schematic diagram of the internal structure of a computer according to an embodiment;
FIG. 6 is a flow diagram of a sensitive data identification method according to an embodiment;
FIG. 7 is a flow chart of a sensitive data identification method according to another embodiment;
FIG. 8 is a block diagram of an embodiment of a sensitive data identification device;
FIG. 9 is a schematic diagram of the internal structure of a computer according to another embodiment;
FIG. 10 is a block diagram of a sensitive data identification system module of an embodiment;
FIG. 11 is a block diagram of another embodiment of a sensitive data identification system;
fig. 12 is a block diagram of a sensitive data identification system module according to a specific application example.
Detailed Description
For better understanding of the objects, technical solutions and effects of the present invention, the present invention will be further explained with reference to the accompanying drawings and examples. Meanwhile, the following described examples are only for explaining the present invention, and are not intended to limit the present invention.
The embodiment of the invention provides a sensitive data recognition model training method.
Fig. 1 is a flowchart of a sensitive data recognition model training method according to an embodiment, and as shown in fig. 1, the sensitive data recognition model training method according to an embodiment includes steps S100 to S102:
s100, obtaining corpus data and stop words;
the corpus data and the stop words are obtained from the database, and basic data are provided for model training. In one embodiment, the corpus data includes extracted table field data.
In one embodiment, stop words include special characters, other character strings, etc. that are not relevant to the corpus data.
S101, preprocessing the material data and stop words to obtain a preprocessing result;
the corpus data and stop words are converted into parametric data, e.g., vectorized data, that is adapted to the sensitive data recognition model by preprocessing the corpus data and stop words.
Based on this, in one embodiment, fig. 2 is a flowchart of a sensitive data recognition model training method according to another embodiment, and as shown in fig. 2, the process of preprocessing the speech data and stop words in step S101 to obtain a preprocessing result includes step S200:
and S200, encapsulating the corpus data and the word stop parameters to obtain parameters serving as preprocessing results.
And encapsulating the corpus data and stop words into parameters acceptable by a sensitive data recognition model as a preprocessing result.
In one embodiment, fig. 3 is a flowchart of a sensitive data recognition model training method according to yet another embodiment, and as shown in fig. 3, a process of encapsulating corpus data and stop word parameters in step S200 to obtain parameters as a preprocessing result includes steps S300 to S302:
s300, performing word segmentation processing on the voice data to obtain a word segmentation list;
s301, removing stop words of the word segmentation list to obtain a targeted word segmentation list;
and the stop words of the neglected word segmentation list are removed, so that the subsequent model training is more targeted.
S302, packaging the targeted word segmentation list into vectorized parameters as a preprocessing result.
And S102, performing model training for multiple times according to the preprocessing result to obtain a sensitive data recognition model.
Wherein the model training comprises algebraic training. In one embodiment, the sensitive data recognition model comprises a text classification model, including a Doc2vec model or a word2vec model, and the like. As a preferred embodiment, the sensitive data identification model is a Doc2vec model.
In one embodiment, in step S102, model training is performed on the preprocessing result for multiple times, including performing more than 10-20 times of model training, to generate 10-20 generations of sensitive data recognition models.
In one embodiment, as shown in fig. 2, the process of performing multiple times of model training according to the preprocessing result in step S102 includes step S201:
and S201, performing more than 10 times of model training according to the preprocessing result.
And performing more than 10 times of model training on the preprocessing result to obtain a 10-generation sensitive data recognition model.
In the sensitive data recognition model training method according to any one of the embodiments, after the corpus data and the stop word are obtained, the corpus data and the stop word are preprocessed to obtain a preprocessing result, and multiple times of model training are performed according to the preprocessing result to obtain the sensitive data recognition model. Based on the method, the recognition rate of the non-standard sensitive data is continuously improved through multiple times of model training, so that the problem that the recognition rate of the traditional technology to the non-standard sensitive data is low is solved.
The embodiment of the invention also provides a sensitive data recognition model training device.
Fig. 4 is a block diagram of a sensitive data recognition model training apparatus according to an embodiment, and as shown in fig. 4, the sensitive data recognition model training apparatus according to an embodiment includes a module 100, a module 101, and a module 102:
a first obtaining module 100, configured to obtain corpus data and stop words;
the first preprocessing module 101 is configured to preprocess the speech data and stop words to obtain a preprocessing result;
and the data training module 102 is configured to perform multiple model training according to the preprocessing result to obtain a sensitive data recognition model.
After the corpus data and the stop words are obtained, the sensitive data recognition model training device preprocesses the corpus data and the stop words to obtain a preprocessing result, and performs multiple times of model training according to the preprocessing result to obtain the sensitive data recognition model. Based on the method, the recognition rate of the non-standard sensitive data is continuously improved through multiple times of model training, so that the problem that the recognition rate of the traditional technology to the non-standard sensitive data is low is solved.
The embodiment of the invention also provides a computer storage medium, on which computer instructions are stored, and when the instructions are executed by a processor, the method for training the sensitive data recognition model of any one of the above embodiments is realized.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a terminal, or a network device) to execute all or part of the methods of the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a RAM, a ROM, a magnetic or optical disk, or various other media that can store program code.
Corresponding to the computer storage medium, in one embodiment, a computer device is further provided, where the computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor executes the computer program to implement any one of the sensitive data recognition model training methods in the embodiments.
The computer device may be a terminal, and its internal structure diagram may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a sensitive data recognition model training method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like
After the corpus data and the stop words are obtained, the computer device preprocesses the corpus data and the stop words to obtain preprocessing results, and performs multiple times of model training according to the preprocessing results to obtain a sensitive data recognition model. Based on the method, the recognition rate of the non-standard sensitive data is continuously improved through multiple times of model training, so that the problem that the recognition rate of the traditional technology to the non-standard sensitive data is low is solved.
The embodiment of the invention also provides a sensitive data identification method.
Fig. 6 is a flowchart illustrating a sensitive data identification method according to an embodiment, and as shown in fig. 6, the sensitive data identification method according to an embodiment includes steps S400 to S402:
s400, obtaining sample data to be identified;
wherein, the sample data comprises quantitative data for extracting a certain table field.
S401, preprocessing sample data to obtain a sample processing result;
and packaging the sample data into parameters acceptable by the sensitive data identification model as a sample processing result.
In one embodiment, fig. 7 is a flowchart of a sensitive data identification method according to another embodiment, as shown in fig. 7, a process of preprocessing sample data to obtain a sample processing result in step S401 includes step S500:
and S500, packaging the sample data parameters to obtain parameters serving as sample processing results.
Specifically, word segmentation processing is carried out on the sample data to generate a word segmentation list, and the word segmentation list is packaged into vector parameters acceptable by the sensitive data identification model.
S402, loading the sample processing result into the sensitive data identification model, and obtaining the identification rate which is output by the sensitive data identification model and is used as the identification result.
In one embodiment, as shown in fig. 7, the process of loading the sample processing result into the sensitive data recognition model in step S402 to obtain the recognition rate output by the sensitive data recognition model as the recognition result includes step S501:
s501, loading the sample processing result into a sensitive data identification model, outputting the identification rate as an identification result when the identification rate output by the sensitive data identification model is larger than a sensitive type threshold value, and otherwise, controlling the sensitive data identification model to repeat model operation.
Wherein sensitivity type threshold = number of identified samples/total number of samples, including 70% to 90%. As a preferred embodiment, the sensitivity type threshold is 80%. And when the recognition rate output by the sensitive data recognition model is greater than 80%, outputting the recognition rate as a recognition result, otherwise, controlling the sensitive data recognition model to repeat model operation.
In the sensitive data identification method in any embodiment, after the sample data to be identified is obtained, the sample data is preprocessed to obtain a sample processing result, and finally, the sample processing result is loaded into the sensitive data identification model to obtain the identification rate output by the sensitive data identification model and used as the identification result. Based on the method, the sensitive data identification is carried out on the sensitive data identification model obtained through multiple times of training, the identification rate of the non-standard sensitive data is continuously improved, and the problem that the identification rate of the traditional technology to the non-standard sensitive data is low is solved.
The embodiment of the invention also provides a sensitive data identification device.
Fig. 8 is a block diagram of a sensitive data recognition apparatus according to an embodiment, and as shown in fig. 8, the sensitive data recognition apparatus according to an embodiment includes a module 200, a module 201, and a module 202:
a second obtaining module 200, configured to obtain sample data to be identified;
the second preprocessing module 201 is configured to preprocess the sample data to obtain a sample processing result;
and the model identification module 202 is configured to load the sample processing result into the sensitive data identification model, and obtain an identification rate output by the sensitive data identification model as an identification result.
According to the sensitive data identification device, after the sample data to be identified is obtained, the sample data is preprocessed to obtain a sample processing result, and finally the sample processing result is loaded into the sensitive data identification model to obtain the identification rate which is output by the sensitive data identification model and serves as the identification result. Based on the method, the sensitive data identification is carried out on the sensitive data identification model obtained through multiple times of training, the identification rate of the non-standard sensitive data is continuously improved, and the problem that the identification rate of the traditional technology to the non-standard sensitive data is low is solved.
The embodiment of the invention also provides a computer storage medium, on which computer instructions are stored, and the instructions are executed by a processor to implement the sensitive data identification method of any one of the above embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a terminal, or a network device) to execute all or part of the methods of the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a RAM, a ROM, a magnetic or optical disk, or various other media that can store program code.
Corresponding to the computer storage medium, in one embodiment, a computer device is further provided, where the computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor executes the computer program to implement any one of the sensitive data identification methods in the embodiments.
The computer device may be a terminal, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a sensitive data recognition method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like
After the computer equipment obtains the sample data to be identified, the sample data is preprocessed to obtain a sample processing result, and finally the sample processing result is loaded into the sensitive data identification model to obtain the identification rate which is output by the sensitive data identification model and serves as the identification result. Based on the method, the sensitive data identification is carried out on the sensitive data identification model obtained through multiple times of training, the identification rate of the non-standard sensitive data is continuously improved, and the problem that the identification rate of the traditional technology to the non-standard sensitive data is low is solved.
The embodiment of the invention also provides a sensitive data identification system.
Fig. 10 is a block diagram of a sensitive data recognition system according to an embodiment, and as shown in fig. 10, the sensitive data recognition system according to an embodiment includes a module 1000 and a module 1001:
the server node 1000 is used for completing sensitive data identification according to the sensitive data identification model;
the AI sensitive data discovery server 1001 is configured to allocate corpus data, stop words, or sample data between the sensitive data recognition model and the server node, acquire a recognition result of the server node, and train the sensitive data recognition model.
The server nodes 1000 are deployed in a distributed manner, a traditional single-point mode is transversely expanded into a multi-node cluster by combining advanced mobile internet technologies such as a distributed technology, a load balancing technology, a container technology and a multithreading technology, and each server node 1000 can be deployed on a cloud host, a virtual machine or a common PC. Sensitive data identification is accomplished by each server node 1000.
The AI sensitive data discovery server 1001 serves as a relay to complete data interaction between the sensitive data recognition model and the server node 1000. In which the special directory of the AI-sensitive data discovery server 1001 stores stop words in the form of text files.
In one embodiment, fig. 11 is a block diagram of a sensitive data identification system according to another embodiment, and as shown in fig. 11, the AI sensitive data discovery server 1001 includes a module 2000 and a module 2001:
the model training module 2000 is used for training the sensitive data recognition model;
AI sensitive data discovery service platform 2001; the method is used for allocating corpus data, stop words or sample data between the sensitive data recognition model and the server node and obtaining the recognition result of the server node.
In one embodiment, fig. 12 is a block diagram of a sensitive data recognition system module according to a specific application example, as shown in fig. 12, a database is implemented by a client production backup library; the AI sensitive data discovery service platform 2001 is implemented by a data asset management platform; the model training module 2000 is implemented by an AI sensitive data discovery service, including a model training service, a sensitive data discovery service, model distribution, and Tengine load balancing-request polling forwarding. The service cluster includes a plurality of server nodes 1000, and the server nodes 1000 are implemented as uwsgi servers including a flash framework interface service and an AI-sensitive data discovery service. And distributing the model to issue a sensitive data identification model, and executing sensitive data identification by the AI sensitive data discovery service according to the sensitive data identification model.
The model training service can be deployed on a cloud or a common terminal, such as a PC.
In one embodiment, in order to ensure the data security of the production backup library, the service is located in a production network environment together with the data asset management platform and the production backup library, and the sensitive data recognition model training service flow is as follows:
the data asset management platform is connected with a production standby library in a JDBC (Java Database Connectivity) mode, table field data is extracted to serve as corpus data of model training, and the data volume is as large and complete as possible.
The data asset management platform calls a model training service external interface, wherein the interface is based on the Rest API standard;
after receiving the request, the model training service extracts related corpus data and loads stop words (including some irrelevant special characters, other character strings and the like);
preprocessing corpus data and stop words, which mainly comprises the following aspects:
performing word segmentation processing on the corpus data to generate a word segmentation list;
the word segmentation list ignores word stopping, so that the model training is more targeted;
packaging the word segmentation list into acceptable parameters of a Doc2Vec document vector model;
adjusting the parameters of the Doc2Vec model and then carrying out algebraic training;
the optimized Doc2Vec model parameter configuration is as follows:
Figure DEST_PATH_IMAGE001
e epochs are trained on the model (E suggests 10 or more) to generate an E generation Doc2Vec model;
and uploading the trained model to a related directory of each node through SFTP (SSH File Transfer Protocol).
The sensitive data discovery service and the model training service can be deployed in the same machine, a Tengine or Nginx is used as a Web server, and a polling mechanism is configured to realize load balancing. The AI sensitive data discovery service is installed on the node, the node interface service adopts a flash frame or an Aiohttp frame, and the uWsgi container management service is adopted. Each node can be deployed on a cloud host, a virtual machine or a common PC.
The sensitive data discovery service flow is as follows:
the data asset management platform is connected with a production standby library in a JDBC mode, and quantitative data of a certain table field is extracted to serve as sample data of sensitive identification (500 or more are suggested);
the data asset management platform calls a sensitive data discovery service external interface, and the interface is based on the Rest API standard;
after receiving the request, the sensitive data discovery service extracts sample data of the table field, and forwards the request to the cluster node identification processing through a polling mechanism of Tengine;
after receiving the forwarding request, the node service performs word segmentation processing on the sample data to generate a word segmentation list, and encapsulates the word segmentation list into vector parameters acceptable by a Doc2Vec model;
the AI sensitive data discovery thread loads a trained E generation sensitive model (such as an address model, a company name model and the like);
the Doc2Vec model carries out identification operation on the vector parameters and then outputs an identification rate Q;
and (3) judging the sensitive type, if the identification rate Q is greater than or equal to a preset sensitive type threshold value P [ description: p = number of identified samples/total number of samples, for example 80% >, the sample field belongs to the sensitive class, otherwise, the 5 th step is repeated to load other sensitive class models to continue the identification operation;
and feeding back the output sensitive result to the data asset management platform through an interface.
The sensitive data identification system of any embodiment comprises a server node and an AI sensitive data discovery server. And the AI sensitive data discovery server allocates corpus data, stop words or sample data among the database, the sensitive data identification model and the server node, and is used for acquiring the identification result of the server node and training the sensitive data identification model. Based on the method, the sensitive data identification model obtained through multiple training is used for carrying out sensitive data identification, the identification rate of the non-standard sensitive data is continuously improved, the problem that the identification rate of the non-standard sensitive data is low in the traditional technology is solved, meanwhile, the sensitive data identification is transversely expanded into a multi-node cluster through distributed deployment of the server nodes, the condition that the identification request amount of the sensitive data is suddenly increased is met, load balance is achieved, and the stability of the sensitive data identification is guaranteed.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above examples only show some embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A sensitive data recognition model training method is characterized by comprising the following steps:
obtaining corpus data and stop words;
preprocessing the corpus data and the stop words to obtain a preprocessing result;
and performing multiple times of model training according to the preprocessing result to obtain a sensitive data recognition model.
2. The sensitive data recognition model training method according to claim 1, wherein the process of preprocessing the corpus data and the stop word to obtain a preprocessing result comprises the steps of:
and packaging the corpus data and the word stopping parameters to obtain parameters serving as preprocessing results.
3. The sensitive data recognition model training method according to claim 2, wherein the process of encapsulating the corpus data and the stop word parameter to obtain a parameter as a preprocessing result comprises the steps of:
performing word segmentation processing on the corpus data to obtain a word segmentation list;
removing word stopping of the word segmentation list to obtain a targeted word segmentation list;
and packaging the targeted word segmentation list into vectorized parameters serving as a preprocessing result.
4. The sensitive data recognition model training method of any one of claims 1 to 3, wherein the sensitive data recognition model is a Doc2Vec model.
5. The sensitive data recognition model training method according to any one of claims 1 to 3, wherein the process of performing multiple times of model training according to the preprocessing result comprises the steps of:
and performing more than 10 times of model training according to the preprocessing result.
6. A method for identifying sensitive data, comprising the steps of:
acquiring sample data to be identified;
preprocessing the sample data to obtain a sample processing result;
and loading the sample processing result into a sensitive data identification model to obtain the identification rate which is output by the sensitive data identification model and is used as an identification result.
7. The method according to claim 6, wherein the step of preprocessing the sample data to obtain a sample processing result comprises:
and packaging the sample data parameters to obtain parameters serving as sample processing results.
8. The sensitive data identification method according to claim 6 or 7, wherein the process of loading the sample processing result into a sensitive data identification model and obtaining the identification rate output by the sensitive data identification model as the identification result comprises the steps of:
and loading the sample processing result into a sensitive data identification model, outputting the identification rate as an identification result when the identification rate output by the sensitive data identification model is greater than a sensitive type threshold, and otherwise, controlling the sensitive data identification model to repeat model operation.
9. A sensitive data identification system, comprising:
the server node is used for completing sensitive data identification according to the sensitive data identification model;
and the AI sensitive data discovery server is used for allocating corpus data, stop words or sample data between the sensitive data recognition model and the server node, acquiring the recognition result of the server node and training the sensitive data recognition model.
10. The sensitive data recognition system of claim 9, wherein the AI sensitive data discovery server comprises:
the model training module is used for training the sensitive data recognition model;
an AI sensitive data discovery service platform; the sensitive data recognition model is used for allocating the corpus data, the stop word or the sample data between the sensitive data recognition model and the server node, and obtaining the recognition result of the server node.
CN202110935771.XA 2021-08-16 2021-08-16 Sensitive data recognition model training method, sensitive data recognition method and system Pending CN113642326A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110935771.XA CN113642326A (en) 2021-08-16 2021-08-16 Sensitive data recognition model training method, sensitive data recognition method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110935771.XA CN113642326A (en) 2021-08-16 2021-08-16 Sensitive data recognition model training method, sensitive data recognition method and system

Publications (1)

Publication Number Publication Date
CN113642326A true CN113642326A (en) 2021-11-12

Family

ID=78421976

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110935771.XA Pending CN113642326A (en) 2021-08-16 2021-08-16 Sensitive data recognition model training method, sensitive data recognition method and system

Country Status (1)

Country Link
CN (1) CN113642326A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110727880A (en) * 2019-10-18 2020-01-24 西安电子科技大学 Sensitive corpus detection method based on word bank and word vector model
CN111310205A (en) * 2020-02-11 2020-06-19 平安科技(深圳)有限公司 Sensitive information detection method and device, computer equipment and storage medium
CN111966875A (en) * 2020-08-18 2020-11-20 中国银行股份有限公司 Sensitive information identification method and device
US20210012237A1 (en) * 2019-07-11 2021-01-14 International Business Machines Corporation De-identifying machine learning models trained on sensitive data
CN112417887A (en) * 2020-11-20 2021-02-26 平安普惠企业管理有限公司 Sensitive word and sentence recognition model processing method and related equipment thereof

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210012237A1 (en) * 2019-07-11 2021-01-14 International Business Machines Corporation De-identifying machine learning models trained on sensitive data
CN110727880A (en) * 2019-10-18 2020-01-24 西安电子科技大学 Sensitive corpus detection method based on word bank and word vector model
CN111310205A (en) * 2020-02-11 2020-06-19 平安科技(深圳)有限公司 Sensitive information detection method and device, computer equipment and storage medium
CN111966875A (en) * 2020-08-18 2020-11-20 中国银行股份有限公司 Sensitive information identification method and device
CN112417887A (en) * 2020-11-20 2021-02-26 平安普惠企业管理有限公司 Sensitive word and sentence recognition model processing method and related equipment thereof

Similar Documents

Publication Publication Date Title
WO2021004132A1 (en) Abnormal data detection method, apparatus, computer device, and storage medium
US20220308942A1 (en) Systems and methods for censoring text inline
CN108509485B (en) Data preprocessing method and device, computer equipment and storage medium
WO2019178914A1 (en) Fraud detection and risk assessment method, system, device, and storage medium
CN109543925B (en) Risk prediction method and device based on machine learning, computer equipment and storage medium
CN109783338A (en) Recording method, device and computer equipment based on business information
JP2019517088A (en) Security vulnerabilities and intrusion detection and remediation in obfuscated website content
WO2021174693A1 (en) Data analysis method and apparatus, and computer system and readable storage medium
CN108809718B (en) Network access method, system, computer device and medium based on virtual resources
US10956522B1 (en) Regular expression generation and screening of textual items
CN109446837B (en) Text auditing method and device based on sensitive information and readable storage medium
WO2021164205A1 (en) Identity identification-based data auditing method and apparatus, and computer device
CN113642030B (en) Sensitive data multi-layer identification method
CN112632268A (en) Complaint work order detection processing method and device, computer equipment and storage medium
WO2016188334A1 (en) Method and device for processing application access data
US11803796B2 (en) System, method, electronic device, and storage medium for identifying risk event based on social information
CN105354506B (en) The method and apparatus of hidden file
CN113642326A (en) Sensitive data recognition model training method, sensitive data recognition method and system
WO2020057023A1 (en) Natural-language semantic parsing method, apparatus, computer device, and storage medium
US20230214451A1 (en) System and method for finding data enrichments for datasets
CN111737090B (en) Log simulation method and device, computer equipment and storage medium
US11880798B2 (en) Determining section conformity and providing recommendations
US20210342530A1 (en) Framework for Managing Natural Language Processing Tools
CN115203339A (en) Multi-data source integration method and device, computer equipment and storage medium
US20140324523A1 (en) Missing String Compensation In Capped Customer Linkage Model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination