CN114595689A - Data processing method, data processing device, storage medium and computer equipment - Google Patents

Data processing method, data processing device, storage medium and computer equipment Download PDF

Info

Publication number
CN114595689A
CN114595689A CN202210212791.9A CN202210212791A CN114595689A CN 114595689 A CN114595689 A CN 114595689A CN 202210212791 A CN202210212791 A CN 202210212791A CN 114595689 A CN114595689 A CN 114595689A
Authority
CN
China
Prior art keywords
data
category
level
processed
chinese
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210212791.9A
Other languages
Chinese (zh)
Inventor
曾壮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Yishi Huolala Technology Co Ltd
Original Assignee
Shenzhen Yishi Huolala Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Yishi Huolala Technology Co Ltd filed Critical Shenzhen Yishi Huolala Technology Co Ltd
Priority to CN202210212791.9A priority Critical patent/CN114595689A/en
Publication of CN114595689A publication Critical patent/CN114595689A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data processing method, which comprises the following steps: acquiring Chinese remark information corresponding to the field name of the data to be processed; performing word segmentation processing and standardization processing on the Chinese remark information to obtain a feature vector of the Chinese remark information; inputting the characteristic vector into a data category identification model, and determining the data category of the data to be processed; the data category identification model is generated by pre-training based on a Chinese text classification deep learning algorithm; the data category is a lowest level subcategory under at least two levels of pre-constructed category catalogues; and determining the data sensitivity level of the data to be processed according to the data category. The method can be applied to application scenes of data classification and sensitive data management, accurate identification of data classes is realized through a data class identification model constructed based on a Chinese text classification deep learning algorithm, and then the data sensitivity classes are determined according to the data classes, so that powerful technical support is provided for safety control processing of data, safe use of the data and sharing of the data.

Description

Data processing method, data processing device, storage medium and computer equipment
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a data processing method, an apparatus, a computer-readable storage medium, and a computer device.
Background
With the development of big data technology, enterprises gather various data to form a uniform data resource pool. The data is used as an asset, is used for different users and society of companies in an open mode, and meanwhile risks of sensitive data leakage and illegal use are increased. Therefore, in data sharing, it is important to secure sensitive data and prevent data leakage. At present, enterprises have massive data, and the traditional unified management and control mode of the data is difficult to carry out fine-granularity security management and control on billions of data fields of the enterprises, so that the existing requirements of data security compliance cannot be met.
In the prior art, data keywords are defined by data field names and are matched and identified, or regular rules are established for regular data values and are matched and identified. For example, when data of the types such as the identification number, the name, the mobile phone number, the address and the like in the sensitive data are identified, the sensitive data are determined when the monitored data meet the agreed matching conditions. However, the method can exert a good effect when facing a situation of few data types or a single data value, but when facing a situation of massive data, when facing complex and various business operation data, user behavior data, financial statement data and other complex data types, the method often has the problems of difficult rule construction, low accuracy, small data identification coverage, weak expansibility and the like in sensitive identification.
Disclosure of Invention
In order to solve at least one of the above technical drawbacks, the present invention provides a data processing method, a corresponding apparatus, a computer-readable storage medium, and a computer device according to the following technical solutions.
According to an aspect, an embodiment of the present invention provides a data processing method, including the steps of:
acquiring Chinese remark information corresponding to the field name of the data to be processed;
performing word segmentation processing and standardization processing on the Chinese remark information to obtain a feature vector of the Chinese remark information;
inputting the feature vector into a data category identification model, and determining the data category of the data to be processed; the data category identification model is generated by pre-training based on a Chinese text classification deep learning algorithm; the data category is a lowest level subcategory under at least two levels of pre-constructed category catalogues;
and determining the data sensitivity level of the data to be processed according to the data category.
Preferably, the data category identification model is generated by pre-training through the following steps:
acquiring sample data and data category labels corresponding to the lowest level subcategory under the at least two levels of category directories;
performing word segmentation processing and standardization processing on the Chinese remark information corresponding to the field name of the sample data to obtain a feature vector for training;
and training an initial model based on a Chinese text classification deep learning algorithm according to the training feature vectors and the corresponding data category labels to obtain the data category identification model.
Preferably, the data category identification model is a FastText model generated based on FastText algorithm pre-training.
Preferably, the at least two levels of category categories are pre-constructed by:
acquiring a data set, and performing word segmentation and word frequency analysis on field names in the data set to obtain a word frequency analysis result;
determining a first class according to the word frequency analysis result;
and performing clustering analysis processing on the data under the first-level category based on a k-means clustering algorithm to generate at least two-level category directories.
Preferably, the clustering processing of the data under the first-level category based on the k-means clustering algorithm to generate at least two-level category directories includes:
performing clustering analysis processing on the data under the primary category based on a k-means clustering algorithm, and determining a secondary sub-category corresponding to the primary category;
and performing clustering analysis processing on the categories of which the data types reach the preset number in the second-level subcategories based on a k-means clustering algorithm, determining the third-level subcategories corresponding to the first-level categories, and generating a third-level category catalog.
Preferably, after determining the data sensitivity level of the data to be processed according to the data category, the method further includes:
and according to the data sensitivity level, performing security control processing matched with the data sensitivity level on the data to be processed.
Preferably, the data sensitivity levels comprise a first level, a second level, a third level and a fourth level with sensitivity degrees from low to high;
the performing, according to the data sensitivity level, security management and control processing matched with the data sensitivity level on the data to be processed includes:
if the data sensitivity level is a first level, configuring the data to be processed to be open to the outside and storing the data on an externally used medium;
if the data sensitivity level is a second level, configuring the data to be processed to be only open to the interior and storing the data in an internal system;
if the data sensitivity level is a third level, configuring the data to be processed to be only open to internal related personnel, encrypting and storing the data in an internal system, and encrypting, transmitting and limiting output;
and if the data sensitivity level is a fourth level, configuring the data to be processed to be open only for internal specific personnel, encrypting and storing the data in an internal system, and encrypting, transmitting and limiting the data to be processed to be used in a specific service scene.
Further, an embodiment of the present invention provides, according to another aspect, a data processing apparatus including:
the information acquisition module is used for acquiring Chinese remark information corresponding to the field name of the data to be processed;
the word segmentation module is used for carrying out word segmentation processing and standardization processing on the Chinese remark information to obtain a feature vector of the Chinese remark information;
the class identification module is used for inputting the feature vector into a data class identification model and determining the data class of the data to be processed; the data category identification model is generated by training in advance based on a Chinese text classification deep learning algorithm; the data category is a lowest level subcategory under at least two levels of pre-constructed category catalogues;
and the sensitivity level determining module is used for determining the data sensitivity level of the data to be processed according to the data category.
According to yet another aspect, an embodiment of the present invention provides a computer-readable storage medium having a computer program stored thereon, the computer program, when executed by a processor, implementing the above-described data processing method.
According to yet another aspect, embodiments of the present invention provide a computer device, the computer comprising one or more processors; a memory; one or more computer programs, wherein the one or more computer programs are stored in the memory and configured to be executed by the one or more processors, the one or more computer programs configured to: the above-described data processing method is performed.
Compared with the prior art, the invention has the following beneficial effects:
the data processing method, the device, the computer readable storage medium and the computer equipment provided by the invention can be applied to application scenes of data classification grading and sensitive data management, realize the accurate identification of data categories through the data category identification model constructed based on the Chinese text classification deep learning algorithm, further determine the data sensitivity grade according to the data categories, can avoid the problems of difficult rule construction, low accuracy, small data identification coverage and the like in sensitive identification compared with the identification method of constructing a keyword matching rule and data value regular judgment by relying on a field name in the traditional method, extract the characteristics of Chinese remark information by using the data category identification model, realize the high-dimensional characteristic analysis and mining of data, have higher identification accuracy, have higher generalization capability on the data category identification model based on the Chinese remark information, and can accurately identify the data categories of full data, the data identification coverage rate is high, and powerful technical support is provided for carrying out safety control processing on data and realizing safe use and sharing of the data.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flow chart of a method of data processing according to an embodiment of the present invention;
FIG. 2 is a flowchart of a method for constructing at least two levels of category directories according to an embodiment of the present invention;
FIG. 3 is a flowchart of a method for training a data class identification model according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
The embodiment of the invention provides a data processing method which can be applied to application scenes of data classification and grading and sensitive data management. Specifically, the data processing method for realizing data classification and classification is an efficient data classification and classification management for mass data in a big data system, and is a service providing method based on the big data system and combining authority management and control, data desensitization and security audit of data. By combining classification and grading of full data, data desensitization and authority control service, an effective technical support is provided for realizing data security service, and a better effect is achieved in actual production application.
As shown in fig. 1, a data processing method provided in an embodiment of the present invention includes the following steps:
step S110: acquiring Chinese remark information corresponding to the field name of the data to be processed.
For the embodiment, the data to be processed is data to be classified and classified, where the classification and classification of data refers to classifying the data into a data category to which the data belongs, and further determining a data sensitivity level of the data according to the data category to which the data belongs.
When data in a platform or a system has data classification grading and/or sensitive data management requirements, acquiring Chinese remark information corresponding to the field name of the data to be processed. The data to be processed is stored in the database by taking the table as an organization unit, each column in the table is a field, the field name refers to the identification of each column in the table taking the relational model as a data structure, and the field name can be set by a user for building the database in a self-defined way and is usually composed of English letters or natural numbers and underlines. In the embodiment of the invention, in order to make the meanings of the field names clearer and more accurate, each field name is preset with Chinese remark information for explaining the meaning of the field name.
For example, the chinese remark information corresponding to the field name mobile _ telephone is a mobile phone number.
Step S120: and performing word segmentation processing and standardization processing on the Chinese remark information to obtain a feature vector of the Chinese remark information.
For the embodiment, a Chinese word segmentation technology is adopted to perform word segmentation processing on the Chinese remark information obtained in step S110, and stop words in the Chinese remark information are removed, so that interference of the stop words on a subsequent model classification result is avoided, and the Chinese remark information is segmented into one or more words.
And then, carrying out standardization processing on one or more words obtained by segmentation to obtain a feature vector of the Chinese remark information, and using the feature vector of the Chinese remark information as input data of a data category identification model.
In other embodiments, after the word segmentation processing is performed on the chinese remark information, a high-frequency special noun associated with a service may be added according to a specific service scene on the basis of the word obtained by segmentation, so as to avoid omitting special keywords related to the service when the chinese remark information is segmented. Then, one or more words obtained by segmentation and the added high-frequency special nouns are subjected to standardization processing, and the feature vector of the Chinese remark information is obtained. By adding high-frequency special nouns to generate the feature vector of the Chinese remark information and using the feature vector as input data of a data category identification model, powerful technical support can be provided for improving identification accuracy.
Step S130: inputting the characteristic vector into a data category identification model, and determining the data category of the data to be processed; the data category identification model is generated by pre-training based on a Chinese text classification deep learning algorithm; the data category is the lowest level subcategory under at least two levels of pre-constructed category catalogues.
For the present embodiment, the data category identification model is generated by pre-training a deep learning algorithm for Chinese text classification.
The deep learning algorithm for Chinese text classification includes FastText (an open source word vector calculation and text classification tool), TextCNN (text convolutional neural network), TextRNN (text convolutional neural network), RCNN (regional convolutional neural network), han (hierarchical Attention network), bert (bidirectional Encoder responses from transformations), and the like, which are not limited in the embodiments of the present invention.
The data category of the data to be processed is determined by inputting the feature vector of the Chinese remark information into the data category identification model, extracting features and mapping the features to obtain a corresponding data category label through an input layer, a middle layer and an output layer of the data category identification model.
Compared with field names, the Chinese remark information has higher category discrimination, and high-dimensional feature analysis and mining of data are realized by using a data category identification model to extract features of the Chinese remark information, so that the identification precision is higher.
In addition, the data type identification model based on the Chinese remark information has high generalization capability, can accurately identify the data types of the full data, and has high data identification coverage rate.
For this embodiment, the at least two-level category directory may be a second-level category directory, a third-level category directory, a fourth-level category directory, a fifth-level category directory, or even a tenth-level category directory, and the directory tree level of the at least two-level category directory may be set in a customized manner according to the business scenario of the enterprise and the national classification standard. At least two levels of category catalogs are constructed in advance according to business scenes of enterprises and national classification standards, and the data category catalogs with enough fine granularity are divided, so that the application scenes of classifying and classifying mass data with various data types can be effectively dealt with.
For the embodiment, the determined data category of the data to be processed is the lowest-level subcategory under at least two levels of pre-constructed category catalogues. The lowest level category is the lowest level of the branch of the directory tree of the current category, which means that no branch exists under the level category.
For example, a two-layer category list is pre-constructed, the two-layer category list includes a first-level category and a second-level sub-category under the first-level category with branches, and the determined data category of the data to be processed is the first-level category without branches or the second-level sub-category under the first-level category with branches.
In other embodiments, after determining the data category of the data to be processed based on the data category identification model, a manual verification may be further performed to verify the accuracy of the data category identification model.
Step S140: and determining the data sensitivity level of the data to be processed according to the data category.
For the embodiment, according to comprehensive analysis of aspects such as data asset importance degree, utilization value, influence range and the like in a service range of a platform or a system, a plurality of data sensitivity levels can be divided in advance according to the data sensitivity degree, wherein the number of the data sensitivity levels can be two, three, five, eight and the like.
For the embodiment, a corresponding data sensitivity level is preset for the lowest-level sub-category under the at least two-level category catalog. Specifically, in an actual application scenario, data sensitivity level setting may be performed on the lowest-level sub-category according to the at least two-level category list, in combination with the range to which the data category belongs and a related service scenario, and the lowest-level sub-category is divided into a plurality of levels according to the degree of sensitivity.
For the embodiment, after the data category of the data to be processed is determined based on the data category identification model, the data sensitivity level of the data to be processed is determined according to the corresponding relationship between the data category and the data sensitivity level, which is set in advance.
The data processing method provided by the invention can be applied to the application scenes of data classification grading and sensitive data management, realizes the accurate identification of the data category through the data category identification model constructed based on the Chinese text classification deep learning algorithm, further determines the data sensitivity grade according to the data category, can avoid the problems of difficult rule construction, low accuracy, small data identification coverage and the like in the sensitive identification compared with the identification method which relies on field name construction keyword matching rule and data value regular judgment in the traditional method, realizes the high-dimensional characteristic analysis and excavation of data by using the data category identification model to extract the characteristics of Chinese remark information, has higher generalization capability, can accurately identify the data category of full data, has high data identification coverage rate, and powerful technical support is provided for carrying out safety control processing on data and realizing safe use and sharing of the data.
In some embodiments, as shown in FIG. 2, the at least two levels of category categories are pre-constructed by:
step S210: and acquiring a data set, and performing word segmentation and word frequency analysis on the field names in the data set to obtain a word frequency analysis result.
For this embodiment, when constructing at least two levels of category directories related to platform services, field information including table names, library names, field notes and the like is randomly extracted from full data stored in a platform or a system, a part of data sets are randomly sampled to obtain the data sets, the field names in the data sets are segmented into a plurality of words, word frequency analysis is performed on the words, and word frequency data of each word is counted to obtain word frequency analysis results.
Step S220: and determining a primary class according to the word frequency analysis result.
For this embodiment, the important words with large word frequency are analyzed according to the word frequency analysis result that can reflect the word frequency of each word, the data category to which the important words belong is divided, and the primary category is determined according to the business scene of the enterprise and the national classification standard, so as to form the primary category directory.
For example, the determined primary categories are a user basic information category, a device information category, an account information category, a corporate financial category, and the like.
Step S230: and performing clustering analysis processing on the data under the first-level category based on a k-means clustering algorithm to generate at least two-level category directories.
For the embodiment, according to the formed primary category list, the data under each category in the primary category is divided into finer granularity. Specifically, firstly, vectorizing the field names of the acquired data under each category in the first-level categories, clustering the vectorized data by using an unsupervised learning k-means clustering algorithm, subdividing a plurality of subclass clusters, labeling the data under each subclass cluster to obtain a second-level subcategory, and generating a two-level category directory.
In addition, one or more times of clustering analysis processing can be carried out on the clusters with more data types under the sub-categories based on a k-means clustering algorithm, the clusters with more data types under the sub-categories are further divided, and a multi-level category catalog is generated.
For this embodiment, the at least two-level category directory may be a second-level category directory, a third-level category directory, a fourth-level category directory, a fifth-level category directory, or even a tenth-level category directory, and the directory tree level of the at least two-level category directory may be set in a customized manner according to the business scenario of the enterprise and the national classification standard.
In the embodiment, in a scene facing full-scale data, at least two levels of category directories are constructed through word frequency analysis of field names and a k-means clustering algorithm, according to business scenes of enterprises and national classification standards, the data category directories with sufficient granularity can be divided, and application scenes needing classification and classification of mass data with various data types can be effectively handled.
In addition, the data of the lowest level subcategory under the at least two levels of category catalogues are cleaned, noise data are removed, interference data are removed, data category labels are marked on the data, and finally a complete sample data set which can be used as the at least two levels of category catalogues and the lowest level subcategory of the data asset category catalogues is formed.
In some embodiments, three levels are used as the preferred number of directory levels, and the at least two levels of category directories are specifically three levels of category directories.
For the present embodiment, for the third-level category catalog, the step S230 performs cluster analysis processing on the data under the first-level category based on a k-means clustering algorithm to generate at least two-level category catalogs, which specifically is: performing clustering analysis processing on the data under the primary category based on a k-means clustering algorithm, and determining a secondary sub-category corresponding to the primary category; and performing clustering analysis processing on the categories of which the data types reach the preset number in the second-level subcategories based on a k-means clustering algorithm, determining the third-level subcategories corresponding to the first-level categories, and generating a third-level category catalog.
For the embodiment, firstly, vectorizing the field names of the data under each category in the acquired first-level categories, clustering the vectorized data by using an unsupervised learning k-means clustering algorithm, subdividing a plurality of subclass clusters, labeling the data under each subclass cluster to obtain a second-level subcategory, and generating a two-level category directory; then, performing clustering analysis processing again on the categories of which the data types reach the preset number under the second-level subcategory based on a k-means clustering algorithm, further dividing the categories of which the data types reach the preset number under the second-level subcategory to obtain third-level subcategories, and generating a third-level category catalog.
The preset number may be determined to be set to any value greater than 1 according to the actual application requirement, and the specific value of the preset number is not limited in this embodiment. For example, the second clustering process may be performed for categories in which the data category reaches two under the second-level sub-category.
For example, in the category directory of the third-level category, the user basic information as the first-level category includes contact information of second-level subcategories such as contact address, gender, age, name, and living address, and the contact information as the second-level subcategories includes third-level subcategories such as a mobile phone number, a landline number, and a mailbox number.
In some embodiments, as shown in FIG. 3, the data class identification model is generated by pre-training:
step S310: and acquiring sample data and data category labels corresponding to the lowest-level subcategory under the at least two levels of category directories.
For this embodiment, when the at least two-level category catalog is constructed, a sample data set corresponding to the lowest-level sub-category of the at least two-level category catalog is formed, and in this step, a data category identification model is constructed based on the sample data and the data category tags in the sample data set corresponding to the lowest-level sub-category of the at least two-level category catalog.
Step S320: and performing word segmentation processing and standardization processing on the Chinese remark information corresponding to the field name of the sample data to obtain a feature vector for training.
For the embodiment, after the sample data corresponding to the lowest-level subcategory is obtained, the Chinese remark information of the field name of the sample data is subjected to word segmentation by adopting a Chinese word segmentation technology, stop words in the sample data are removed, the interference of the stop words on the model training result is avoided, and the Chinese remark information is segmented into one or more words.
Then, one or more words obtained by segmentation are subjected to standardization processing to obtain a feature vector for training, and the feature vector for training is used as training data of the data type identification model.
In other embodiments, after the word segmentation is performed on the chinese remark information of the field name of the sample data, a high-frequency special noun associated with a service may be added according to a specific service scene on the basis of the word obtained by segmentation, so as to avoid omitting special keywords related to the service when the chinese remark information of the field name of the sample data is segmented. Then, one or more words obtained by segmentation and the added high-frequency special nouns are subjected to standardization processing, and the feature vector for training is obtained. By adding high-frequency special nouns and generating training feature vectors, and using the training feature vectors as training data of a data type identification model, powerful technical support can be provided for training a data type identification model with high identification accuracy.
Step S330: and training an initial model based on a Chinese text classification deep learning algorithm according to the training feature vectors and the corresponding data category labels to obtain the data category identification model.
And training an initial model of a corresponding Chinese text classification deep learning algorithm according to the training feature vectors and the corresponding data category labels, and solving optimal model parameters to obtain the data category identification model.
The deep learning algorithm for Chinese text classification includes FastText (an open source word vector calculation and text classification tool), TextCNN (text convolutional neural network), TextRNN (text convolutional neural network), RCNN (regional convolutional neural network), han (hierarchical Attention network), bert (bidirectional Encoder responses from transformations), and the like, which are not limited in the embodiments of the present invention.
The deep learning algorithm for Chinese text classification includes FastText (an open source word vector calculation and text classification tool), TextCNN (text convolutional neural network), TextRNN (text convolutional neural network), RCNN (regional convolutional neural network), han (hierarchical Attention network), bert (bidirectional Encoder responses from transformations), and the like, which are not limited in the embodiments of the present invention.
In some embodiments, the data set of the newly added data category is added into the sample data set, and the data category identification model is trained again according to the steps S310 to S320, so as to meet the application requirement of iterative update of category data, and significantly enhance the extensibility of data category identification.
For this embodiment, in actual application, the class of the identification data is increased due to service expansion or class refinement of the platform or the system, iterative identification of the newly added class data needs to be implemented, and the newly added class data identification function can be implemented by retraining the model for the newly added data. Updating at least two levels of category catalogs which are used as data asset catalogs, acquiring sample data from a newly-added category data set, fusing the sample data into the sample data set, and retraining a data category identification model to obtain a new data category identification model.
In some embodiments, the chinese text classification deep learning algorithm is specifically a FastText algorithm, and the data class identification model is a FastText model generated based on FastText algorithm pre-training. The FastText model is simple in structure, suitable for large-scale data, high in training speed and capable of effectively dealing with application scenes in which massive data with various data types need to be classified and graded.
In some embodiments, in step S130, the feature vector is input into a data category identification model, and a data category of the data to be processed is determined, specifically, the feature vector is input into a FastText model, and the data category of the data to be processed is determined.
For the embodiment, the FastText model comprises an input layer, a hidden layer and a SoftMax layer, wherein the input layer acquires feature vectors of words and phrases in the Chinese remark information, maps the feature vectors to the hidden layer unit through linear transformation, and maps the result to the data category label through the SoftMax layer, so that the data category identification of the data to be processed is realized. The fast text model is used for extracting characteristics of Chinese remark information and determining data type identification, so that the speed is high, the accuracy is high, and the application scene in which a large amount of data with various types needs to be classified and graded can be effectively dealt with.
In some embodiments, after the step S140 determines the data sensitivity level of the data to be processed according to the data category, the method further includes: and according to the data sensitivity level, performing security control processing matched with the data sensitivity level on the data to be processed.
For this embodiment, according to the application requirement of the platform or the system on data security control, the corresponding security control processing is set for the plurality of divided data sensitivity levels in advance, and then the platform or the system may adopt different security control processing measures for data with different data sensitivity degrees.
The management and control items for performing security management and control processing on the data include but are not limited to: open permissions, storage location, whether to encrypt storage, whether to encrypt transmission, application permissions.
For this embodiment, after the data sensitivity level of the data to be processed is determined, the security control processing measure suitable for the current data to be processed may be determined according to the preset corresponding relationship between the data sensitivity level and the security control processing measure, and the security control processing matched with the data sensitivity level may be performed on the data to be processed.
In the embodiment, the method can be applied to an application scene of data classification grading and sensitive data management, the data classification is accurately identified through a data classification identification model constructed based on a Chinese text classification deep learning algorithm, the data sensitivity level is further determined according to the data classification, and the data is subjected to safety control processing matched with the data sensitivity level of the data.
In some embodiments, the data sensitivity levels include a first level, a second level, a third level, and a fourth level from low to high sensitivity.
For the embodiment, according to the comprehensive analysis of the data asset importance degree, the utilization value, the influence range and the like in the service range of the platform or the system, four data sensitivity levels are preferably divided according to the data sensitivity degree, wherein the four data sensitivity levels are respectively a first level, a second level, a third level and a fourth level with the sensitivity degree from low to high.
An example of the correspondence between the data sensitivity level and the security management and control processing measure is shown below:
the data to be processed with the data sensitivity level of the first level is public data, has no value of being utilized, has a general importance degree, and is configured to be open to the outside and stored on an externally-usable medium.
For the data to be processed with the data sensitivity level of the second level, in order to limit the data, the low-valued utilized value and the more sensitive data should be limited to the internal use of the enterprise, the data to be processed is configured to be only open to the inside and stored in the internal system of the enterprise business system.
For the data to be processed with the data sensitivity level of the third level, the data is the commercial secret data, the medium value can be indirectly utilized, the data belongs to more important data, and the data is only used by related personnel in an enterprise, and the data is configured to be only opened for the related personnel in the enterprise, encrypted and stored in an internal system, and encrypted for transmission and limited for output.
For the data to be processed with the fourth level of data sensitivity, the core secret data is high-value data which can be directly utilized, belongs to extremely key data and is only used by specific personnel in important departments of enterprises, and the data to be processed is configured to be only opened for the specific personnel in the internal departments, encrypted and stored in an internal system, encrypted and transmitted and limited to be used in specific business scenes.
Furthermore, an embodiment of the present invention provides a data processing apparatus, as shown in fig. 4, the apparatus including:
the information acquisition module 41 is configured to acquire Chinese remark information corresponding to a field name of the data to be processed;
a word segmentation module 42, configured to perform word segmentation processing and standardization processing on the chinese remark information to obtain a feature vector of the chinese remark information;
a category identification module 43, configured to input the feature vector into a data category identification model, and determine a data category of the to-be-processed data; the data category identification model is generated by pre-training based on a Chinese text classification deep learning algorithm; the data category is a lowest level subcategory under at least two levels of pre-constructed category catalogues;
and the sensitivity level determining module 44 is configured to determine a data sensitivity level of the data to be processed according to the data category.
In some embodiments, the data class identification model is generated by pre-training by:
acquiring sample data and data category labels corresponding to the lowest level subcategory under the at least two levels of category directories;
performing word segmentation processing and standardization processing on the Chinese remark information corresponding to the field name of the sample data to obtain a feature vector for training;
and training an initial model based on a Chinese text classification deep learning algorithm according to the training feature vectors and the corresponding data category labels to obtain the data category identification model.
In some embodiments, the data class identification model is a FastText model generated based on FastText algorithm pre-training.
In some embodiments, the at least two levels of category categories are pre-constructed by:
acquiring a data set, and performing word segmentation and word frequency analysis on field names in the data set to obtain a word frequency analysis result;
determining a first class according to the word frequency analysis result;
and performing clustering analysis processing on the data under the first-level category based on a k-means clustering algorithm to generate at least two-level category directories.
In some embodiments, the clustering process on the data under the first-level category based on the k-means clustering algorithm to generate at least two-level category directories includes:
performing clustering analysis processing on the data under the primary category based on a k-means clustering algorithm, and determining a secondary sub-category corresponding to the primary category;
and performing clustering analysis processing on the categories of which the data types reach the preset number in the second-level subcategories based on a k-means clustering algorithm, determining the third-level subcategories corresponding to the first-level categories, and generating a third-level category catalog.
In some embodiments, the data processing apparatus further comprises a security management module configured to:
after the data sensitivity level of the data to be processed is determined according to the data category, security management and control processing matched with the data sensitivity level is carried out on the data to be processed according to the data sensitivity level.
In some embodiments, the data sensitivity levels include a first level, a second level, a third level, and a fourth level from low to high sensitivity; the security management and control model is specifically configured to:
if the data sensitivity level is a first level, configuring the data to be processed to be open to the outside and storing the data on an externally used medium;
if the data sensitivity level is a second level, configuring the data to be processed to be only open to the interior and storing the data in an internal system;
if the data sensitivity level is a third level, configuring the data to be processed to be only open to internal related personnel, encrypting and storing the data in an internal system, and encrypting, transmitting and limiting output;
and if the data sensitivity level is a fourth level, configuring the data to be processed to be open only for internal specific personnel, encrypting and storing the data in an internal system, and encrypting, transmitting and limiting the data to be processed to be used in a specific service scene.
The contents of the method embodiments of the present invention are all applicable to the apparatus embodiments, the functions specifically implemented by the apparatus embodiments are the same as those of the method embodiments, and the beneficial effects achieved by the apparatus embodiments are also the same as those achieved by the method described above, and for details, refer to the description of the method embodiments, and are not described herein again.
Furthermore, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the data processing method described in any one of the above embodiments. The computer-readable storage medium includes, but is not limited to, any type of disk including floppy disks, hard disks, optical disks, CD-ROMs, and magnetic-optical disks, ROMs (Read-Only memories), RAMs (Random AcceSS memories), EPROMs (EraSable Programmable Read-Only memories), EEPROMs (Electrically EraSable Programmable Read-Only memories), flash memories, magnetic cards, or optical cards. That is, a storage device includes any medium that stores or transmits information in a form readable by a device (e.g., a computer, a cellular phone), and may be a read-only memory, a magnetic or optical disk, or the like.
The contents of the method embodiment of the present invention are all applicable to the embodiment of the storage medium, the functions specifically implemented by the embodiment of the storage medium are the same as those of the method embodiment described above, and the beneficial effects achieved by the embodiment of the storage medium are also the same as those achieved by the method described above.
In addition, an embodiment of the present invention further provides a computer device, where the computer device described in this embodiment may be a server, a personal computer, a network device, and other devices. The computer device includes: one or more processors, memory, one or more computer programs stored in the memory and configured to be executed by the one or more processors, the one or more computer programs configured to perform the data processing method of any of the above embodiments.
The contents of the method embodiment of the present invention are all applicable to the computer apparatus embodiment, the functions specifically implemented by the computer apparatus embodiment are the same as those of the method embodiment, and the beneficial effects achieved by the method embodiment are also the same as those achieved by the method.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A data processing method, comprising the steps of:
acquiring Chinese remark information corresponding to the field name of the data to be processed;
performing word segmentation processing and standardization processing on the Chinese remark information to obtain a feature vector of the Chinese remark information;
inputting the characteristic vector into a data category identification model, and determining the data category of the data to be processed; the data category identification model is generated by pre-training based on a Chinese text classification deep learning algorithm; the data category is a lowest level subcategory under at least two levels of pre-constructed category catalogues;
and determining the data sensitivity level of the data to be processed according to the data category.
2. The data processing method of claim 1, wherein the data class recognition model is generated by pre-training by:
acquiring sample data and data category labels corresponding to the lowest level subcategory under the at least two levels of category directories;
performing word segmentation processing and standardization processing on the Chinese remark information corresponding to the field name of the sample data to obtain a feature vector for training;
and training an initial model based on a Chinese text classification deep learning algorithm according to the training feature vectors and the corresponding data category labels to obtain the data category identification model.
3. The data processing method of claim 1, wherein the data class identification model is a FastText model generated based on FastText algorithm pre-training.
4. The data processing method according to claim 1, characterized in that said at least two levels of category categories are pre-constructed by the steps of:
acquiring a data set, and performing word segmentation and word frequency analysis on field names in the data set to obtain a word frequency analysis result;
determining a primary category according to the word frequency analysis result;
and performing clustering analysis processing on the data under the first-level category based on a k-means clustering algorithm to generate at least two-level category directories.
5. The data processing method of claim 4, wherein the clustering process is performed on the data under the first-level category based on the k-means clustering algorithm to generate at least two-level category directories, comprising:
performing clustering analysis processing on the data under the primary category based on a k-means clustering algorithm, and determining a secondary sub-category corresponding to the primary category;
and performing clustering analysis processing on the categories of which the data types reach the preset number in the second-level subcategories based on a k-means clustering algorithm, determining the third-level subcategories corresponding to the first-level categories, and generating a third-level category catalog.
6. The data processing method according to claim 1, wherein after determining the data sensitivity level of the data to be processed according to the data category, the method further comprises:
and according to the data sensitivity level, performing security control processing matched with the data sensitivity level on the data to be processed.
7. The data processing method of claim 6, wherein the data sensitivity levels comprise a first level, a second level, a third level and a fourth level with sensitivity levels from low to high;
the performing, according to the data sensitivity level, security management and control processing matched with the data sensitivity level on the data to be processed includes:
if the data sensitivity level is a first level, configuring the data to be processed to be open to the outside and storing the data on an externally used medium;
if the data sensitivity level is a second level, configuring the data to be processed to be only open to the interior and storing the data in an internal system;
if the data sensitivity level is a third level, configuring the data to be processed to be only open to internal related personnel, encrypting and storing the data in an internal system, and encrypting, transmitting and limiting output;
and if the data sensitivity level is a fourth level, configuring the data to be processed to be open only for internal specific personnel, encrypting and storing the data in an internal system, and encrypting, transmitting and limiting the data to be processed to be used in a specific service scene.
8. A data processing apparatus, comprising:
the information acquisition module is used for acquiring Chinese remark information corresponding to the field name of the data to be processed;
the word segmentation module is used for carrying out word segmentation processing and standardization processing on the Chinese remark information to obtain a feature vector of the Chinese remark information;
the class identification module is used for inputting the feature vector into a data class identification model and determining the data class of the data to be processed; the data category identification model is generated by pre-training based on a Chinese text classification deep learning algorithm; the data category is a lowest level subcategory under at least two levels of pre-constructed category catalogues;
and the sensitivity level determining module is used for determining the data sensitivity level of the data to be processed according to the data category.
9. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, carries out the data processing method of any one of claims 1 to 7.
10. A computer device, comprising:
one or more processors;
a memory;
one or more computer programs, wherein the one or more computer programs are stored in the memory and configured to be executed by the one or more processors, the one or more computer programs configured to: performing the data processing method according to any one of claims 1 to 7.
CN202210212791.9A 2022-02-28 2022-02-28 Data processing method, data processing device, storage medium and computer equipment Pending CN114595689A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210212791.9A CN114595689A (en) 2022-02-28 2022-02-28 Data processing method, data processing device, storage medium and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210212791.9A CN114595689A (en) 2022-02-28 2022-02-28 Data processing method, data processing device, storage medium and computer equipment

Publications (1)

Publication Number Publication Date
CN114595689A true CN114595689A (en) 2022-06-07

Family

ID=81807845

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210212791.9A Pending CN114595689A (en) 2022-02-28 2022-02-28 Data processing method, data processing device, storage medium and computer equipment

Country Status (1)

Country Link
CN (1) CN114595689A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115168345A (en) * 2022-06-27 2022-10-11 天翼爱音乐文化科技有限公司 Database classification method, system, device and storage medium
CN115859372A (en) * 2023-03-04 2023-03-28 成都安哲斯生物医药科技有限公司 Medical data desensitization method and system
CN117391076A (en) * 2023-12-11 2024-01-12 东亚银行(中国)有限公司 Acquisition method and device of identification model of sensitive data, electronic equipment and medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115168345A (en) * 2022-06-27 2022-10-11 天翼爱音乐文化科技有限公司 Database classification method, system, device and storage medium
CN115168345B (en) * 2022-06-27 2023-04-18 天翼爱音乐文化科技有限公司 Database classification method, system, device and storage medium
CN115859372A (en) * 2023-03-04 2023-03-28 成都安哲斯生物医药科技有限公司 Medical data desensitization method and system
CN115859372B (en) * 2023-03-04 2023-04-25 成都安哲斯生物医药科技有限公司 Medical data desensitization method and system
CN117391076A (en) * 2023-12-11 2024-01-12 东亚银行(中国)有限公司 Acquisition method and device of identification model of sensitive data, electronic equipment and medium
CN117391076B (en) * 2023-12-11 2024-02-27 东亚银行(中国)有限公司 Acquisition method and device of identification model of sensitive data, electronic equipment and medium

Similar Documents

Publication Publication Date Title
US11385942B2 (en) Systems and methods for censoring text inline
de Oliveira et al. A sensitive stylistic approach to identify fake news on social networking
Stvilia et al. A framework for information quality assessment
CN114595689A (en) Data processing method, data processing device, storage medium and computer equipment
Peng et al. Astroturfing detection in social media: a binary n‐gram–based approach
CN107368542B (en) Method for evaluating security-related grade of security-related data
US11983297B2 (en) Efficient statistical techniques for detecting sensitive data
CN111125460A (en) Information recommendation method and device
Chen et al. Bert-log: Anomaly detection for system logs based on pre-trained language model
CN112132238A (en) Method, device, equipment and readable medium for identifying private data
Alzhrani et al. Automated big text security classification
Nowroozi et al. An adversarial attack analysis on malicious advertisement url detection framework
CN115730087A (en) Knowledge graph-based contradiction dispute analysis and early warning method and application thereof
CN112417887A (en) Sensitive word and sentence recognition model processing method and related equipment thereof
CN106649262B (en) Method for protecting sensitive information of enterprise hardware facilities in social media
JP2023517518A (en) Vector embedding model for relational tables with null or equivalent values
Xiao Towards a two-phase unsupervised system for cybersecurity concepts extraction
WO2020240327A1 (en) Automated resolution of over and under-specification in a knowledge graph
CN114266255B (en) Corpus classification method, apparatus, device and storage medium based on clustering model
US20210319184A1 (en) Recognition of sensitive terms in textual content using a relationship graph of the entire code and artificial intelligence on a subset of the code
Bateni et al. Content Analysis of Privacy Policies Before and After GDPR
Gutierrez et al. Contextminer: Mining contextual features for conceptualizing knowledge in security texts
Dai et al. Approach for text classification based on the similarity measurement between normal cloud models
CA3126789A1 (en) Data management system for web based data services
CN117591770B (en) Policy pushing method and device and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination