CN116910650A - Data identification method, device, storage medium and computer equipment - Google Patents

Data identification method, device, storage medium and computer equipment Download PDF

Info

Publication number
CN116910650A
CN116910650A CN202310855260.6A CN202310855260A CN116910650A CN 116910650 A CN116910650 A CN 116910650A CN 202310855260 A CN202310855260 A CN 202310855260A CN 116910650 A CN116910650 A CN 116910650A
Authority
CN
China
Prior art keywords
data
identified
identification
index
index item
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310855260.6A
Other languages
Chinese (zh)
Inventor
衡相忠
汪争起
胡绍勇
王亭景
胡理兵
陆彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information and Data Security Solutions Co Ltd
Original Assignee
Information and Data Security Solutions Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information and Data Security Solutions Co Ltd filed Critical Information and Data Security Solutions Co Ltd
Priority to CN202310855260.6A priority Critical patent/CN116910650A/en
Publication of CN116910650A publication Critical patent/CN116910650A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24564Applying rules; Deductive queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Abstract

The invention relates to the technical field of data processing, and discloses a data identification method, a device, a storage medium and computer equipment, wherein the method comprises the following steps: responding to a data identification request, acquiring a data screening rule in the data identification request, extracting data to be identified from a source data table based on the data screening rule, determining a plurality of index items to be identified of the data to be identified and corresponding identification algorithms, identifying field contents of the index items to be identified based on the identification algorithms to obtain identification results, substituting the identification results into a data matching calculation formula to calculate, obtaining a calculation result, and determining the data category of the data to be identified according to the calculation result. The method improves the data identification efficiency, combines a plurality of dimensions to identify the data, improves the identification precision of the data, accurately obtains the sensitivity identification result of the data, and realizes accurate classification and management and control of the content of the sensitive data.

Description

Data identification method, device, storage medium and computer equipment
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a data identification method, a data identification device, a storage medium, and a computer device.
Background
With the gradual standardization and standardization management of data classification and data security protection law implementation of various industries, various enterprises pay more attention to data stored in a database, and especially, attention to data related to sensitive content is continuously increased. In order to better classify and classify the stored data, the sensitivity of the data content needs to be accurately identified in advance so as to reasonably manage and control the data.
In the prior art, the identification of the sensitivity of the stored data only stays in the primary identification and judgment based on the data content or the simple judgment of the two-dimensional compound rule of the data content, but the problem of low data identification efficiency exists when the sensitivity identification is carried out on a large amount of data in the two modes, the acquired data sensitivity identification result is easy to generate errors, the sensitive data cannot be accurately identified, and further the accurate identification and classification of the stored data cannot be achieved.
Disclosure of Invention
In view of this, the data identification method, device, storage medium and computer equipment provided by the application mainly aim to solve the technical problems of low identification efficiency and low accuracy of identification results of the identification method for sensitive data in the prior art.
According to a first aspect of the present invention, there is provided a data identification method comprising:
responding to a data identification request, acquiring a data screening rule carried in the data identification request, and extracting data to be identified from a source data table based on the data screening rule;
determining a plurality of index items to be identified of the data to be identified, and acquiring an identification algorithm corresponding to each index item to be identified;
identifying field content of each index item to be identified of the data to be identified based on the identification algorithm to obtain an identification result corresponding to each index item to be identified of the data to be identified;
inputting a plurality of identification results of the data to be identified into a preset rule matching calculation expression for calculation, obtaining a calculation result, and determining a sensitivity identification result of the data to be identified according to the calculation result.
According to a second aspect of the present invention, there is provided a data identification apparatus comprising:
the data extraction module is used for responding to a data identification request, acquiring a data screening rule carried in the data identification request and extracting data to be identified from a source data table based on the data screening rule;
The algorithm confirmation module is used for determining a plurality of index items to be identified of the data to be identified and acquiring an identification algorithm corresponding to each index item to be identified;
the data identification module is used for identifying the field content of each index item to be identified of the data to be identified based on the identification algorithm to obtain an identification result corresponding to each index item to be identified of the data to be identified;
the result output module is used for inputting a plurality of identification results of the data to be identified into a preset rule matching calculation expression for calculation, obtaining a calculation result, and determining a sensitivity identification result of the data to be identified according to the calculation result.
According to a third aspect of the present application, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described data identification method.
According to a fourth aspect of the present application there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the data identification method as described above when executing the program.
The application provides a data identification method, a device, a storage medium and computer equipment, wherein the method comprises the steps of firstly responding to a data identification request, obtaining a data screening rule carried in the data identification request, extracting data to be identified from a source data table based on the data screening rule, then determining a plurality of index items to be identified of the data to be identified, obtaining an identification algorithm corresponding to each index item to be identified, then identifying field content of each index item to be identified of the data to be identified based on the identification algorithm to obtain an identification result corresponding to each index item to be identified of the data to be identified, finally inputting a plurality of identification results of the data to be identified into a preset rule matching calculation expression to calculate, obtaining a calculation result, and determining a sensitivity identification result of the data to be identified according to the calculation result.
Before the data is identified, the data in the source data table is screened in advance based on the data screening rule in the data identification request, so that the data which is really needed to be identified is obtained, and when the data size of the data to be identified is large, the method can be used for quickly and preliminarily screening a large amount of data, so that the data processing efficiency is effectively improved; then, a plurality of index items to be identified of the data to be identified are determined, the field content of each index item to be identified is identified in a targeted mode one by one based on an identification algorithm of each index item to be identified, the data is identified from a plurality of index dimensions, the data can be more comprehensively known, and an identification result of the data in each index dimension can be accurately obtained; and finally, calculating the identification result of each index item to be identified through a rule matching calculation expression, determining the sensitivity identification result of the data to be identified based on the calculation results among a plurality of index dimensions, improving the data identification rate, finally obtaining the data classification result of the data to be identified, and accurately identifying the sensitive data. The method can improve the efficiency of data identification, particularly can accurately identify large quantities of complex data, and can identify the data by combining multiple dimensions, so that the identification accuracy of the data is effectively improved, the sensitivity identification result of the data is accurately obtained, and accurate classification and management and control of sensitive data content are realized.
The foregoing description is only an overview of the present application, and is intended to be implemented in accordance with the teachings of the present application in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present application more readily apparent.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
FIG. 1 is a schematic flow chart of a data identification method in one embodiment of the present application;
FIG. 2 is a flow chart of a method for identifying data in one embodiment of the application;
FIG. 3 is a schematic flow chart of a data identification method in one embodiment provided by the present application;
FIG. 4 is a schematic diagram showing a structure of a data recognition device according to an embodiment of the present application;
FIG. 5 is a schematic diagram showing a structure of a data recognition device according to an embodiment of the present application;
fig. 6 is a schematic diagram showing the structure of a computer device according to an embodiment of the present application.
Detailed Description
Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the application to those skilled in the art.
The embodiment of the application provides a data identification method, as shown in fig. 1, which comprises the following steps:
101. and responding to the data identification request, acquiring a data screening rule carried in the data identification request, and extracting data to be identified from the source data table based on the data screening rule.
Specifically, the data screening rule specifically refers to screening out data to be identified or analyzed according to a specific condition, where the specific condition may be a numerical range, a date range, or a specific value in a certain column of data, and the data screening rule can be used to pre-determine target data before identifying the data, so that the efficiency of data identification is effectively improved.
The application further provides a data identification method, which is firstly used for responding to a data identification request sent by a user, wherein the data identification request carries a data screening rule, the data screening rule can be specifically edited and set by the user, and a prestored rule can be selected, but no matter what implementation form is adopted, the data screening rule can help the user to rapidly screen out data needing to be identified in processing a large amount of data, so that the data identification efficiency is improved. The range of the data screening rule can be set by the user, and the more severe the data screening rule is, the smaller the screened data range is, so that accurate analysis is facilitated. In the application, the source data table contains a large amount of data, but not all data need to be identified, for example, for a source data table configured with a large amount of personnel information, a personnel list over 30 years old needs to be identified, further a data screening rule can be set, the data content in the age data list is more than 30, the personnel list with age index items less than 30 years old is effectively eliminated, and meaningless data identification is avoided, so that a plurality of pieces of data to be identified in the source data table can be rapidly screened out by utilizing the data screening rule for subsequent processing.
102. And determining a plurality of index items to be identified of the data to be identified, and acquiring an identification algorithm corresponding to each index item to be identified.
Specifically, the data to be identified includes a plurality of data index items, each of which can be used as an index item to be identified, and each of which can be used as a data characteristic index for data identification analysis, which is a dimension for identifying or judging the data. In the actual data identification process, deep mining and analysis of data content are needed, and data are better understood and analyzed by screening representative index items. For each index item to be identified, a specific identification method is needed to ensure the accuracy and the credibility of the data, so that the data identification analysis is better supported, wherein the identification algorithm is a mathematical method and an algorithm for analyzing and identifying the data index item, each data index item corresponds to the identification algorithm corresponding to the field content of the data index item, and the selection of a proper identification algorithm is very important for correctly carrying out the data identification and the data analysis.
In the embodiment of the application, after the data to be identified is extracted, the index item to be identified of the data to be identified is determined based on the data index item of the data to be identified, and the index item to be identified can be specifically all the data index items, or can be part of the data index items selected by a user, namely, the user needs to know the data to be identified from certain specific dimensions, and after the index item to be identified is determined, an identification algorithm corresponding to each index item to be identified is further acquired, so that preparation is made for subsequent identification work. In the application, each index item to be identified represents one dimension for identifying the data to be identified, and the plurality of index items to be identified for obtaining the data to be identified are the dimensions for determining the data to be identified, such as database names, resource names, table notes, field names, field notes, data types, data length, data duty ratio, duplication removal duty ratio, any field and any note can be used as one identification dimension, and the identification algorithm corresponding to each index item to be identified is used for further carrying out identification processing on the data to be identified from each dimension so as to finally obtain the accurate identification result of the data to be identified in the current dimension.
103. And identifying the field content of each index item to be identified of the data to be identified based on an identification algorithm to obtain an identification result corresponding to each index item to be identified of the data to be identified.
In the embodiment of the application, after the index item to be identified and the corresponding identification algorithm are determined, the field content in the index item to be identified is extracted, the field content is identified by the corresponding identification algorithm, and an identification result is obtained, for example, when the data to be identified is a series of sales data, a specific identification algorithm, such as a decision tree algorithm, can be adopted to identify whether a terminal product of the index item to be identified belongs to a specific product category, so as to determine the sales condition of the product. The field content in the index item to be identified is identified in a targeted mode by utilizing an identification algorithm corresponding to the index item to be identified, an accurate identification result can be obtained, and more visual and accurate understanding and analysis of the data to be identified are facilitated.
104. Inputting a plurality of identification results of the data to be identified into a preset rule matching calculation expression for calculation, obtaining a calculation result, and determining a sensitivity identification result of the data to be identified according to the calculation result.
The rule matching calculation expression is used for calculating a rule matching result according to the recognition results of different index dimensions and operators, specifically inputting the recognition result of each index dimension into the expression formed by the operators, and calculating according to a predefined operation rule to obtain a final rule matching result. In this process, the recognition result of each index dimension is mapped into a numerical value or a logic value, and the corresponding filling rule is matched with the corresponding position in the calculation expression, specifically used for calculating the relationship between two index dimensions, and also used for judging whether a certain condition is satisfied.
In the embodiment of the application, the process of substituting the identification results corresponding to the plurality of data identification index items into the rule matching calculation expression for calculation refers to combining and calculating the identification results of the plurality of index items to be identified according to a predefined calculation formula, so as to obtain a calculation result. The calculation result may be a specific numerical value for evaluating the data to be identified, for example, the score of the data in a certain aspect, or a classification label, for example, excellent, good, poor, etc., or may be an index classification, for example, normal, risk, etc., and the data category of the data to be identified may be accurately determined according to different types of calculation results, so as to accurately obtain the classification result of the data to be identified, so as to facilitate clear understanding of the value and the feature of the data.
The specific principle flow chart is shown in fig. 3, the data screening rule carried in the data identification request is firstly obtained in response to the data identification request, the data to be identified is extracted from the source data table based on the data screening rule, then a plurality of index items to be identified of the data to be identified are determined, an identification algorithm corresponding to each index item to be identified is obtained, then the field content of each index item to be identified of the data to be identified is identified based on the identification algorithm, an identification result corresponding to each index item to be identified of the data to be identified is obtained, finally a plurality of identification results of the data to be identified are input into a preset rule matching calculation expression for calculation, a calculation result is obtained, and the sensitivity identification result of the data to be identified is determined according to the calculation result.
Before the data is identified, the data in the source data table is screened in advance based on the data screening rule in the data identification request, so that the data which is really needed to be identified is obtained, and when the data size of the data to be identified is large, the method can be used for quickly and preliminarily screening a large amount of data, so that the data processing efficiency is effectively improved; then, a plurality of index items to be identified of the data to be identified are determined, the field content of each index item to be identified is identified in a targeted mode one by one based on an identification algorithm of each index item to be identified, the data is identified from a plurality of index dimensions, the data can be more comprehensively known, and an identification result of the data in each index dimension can be accurately obtained; and finally, calculating the identification result of each index item to be identified through a rule matching calculation expression, determining the sensitivity identification result of the data to be identified based on the calculation results among a plurality of index dimensions, improving the data identification rate, finally obtaining the data classification result of the data to be identified, and accurately identifying the sensitive data. The method can improve the efficiency of data identification, particularly can accurately identify large quantities of complex data, and can identify the data by combining multiple dimensions, so that the identification accuracy of the data is effectively improved, the sensitivity identification result of the data is accurately obtained, and accurate classification and management and control of sensitive data content are realized.
The embodiment of the application also provides a data identification method, as shown in fig. 2, comprising the following steps:
201. and responding to the data identification request, and acquiring a data screening rule carried in the data identification request.
In the embodiment of the application, the data identification request refers to a request sent to the system by a user through a certain mode, such as a network interface, an application program or a man-machine interaction interface, and requires that certain specific data be identified and relevant identification results are returned, while in an actual application scenario, the data identification request is often triggered by a certain service requirement, problem or scenario. Meanwhile, under different application scenes, the data identification request may have different forms and carry different contents, and in the application, the data identification request carries the data screening rule which is used for primarily screening the data of the source data table, so that the method provided by the application can be applied to the scene of processing a large amount of data.
202. And extracting the data to be identified from the source data table based on the data screening rule.
Specifically, the source data table includes a plurality of data records, each data record including a plurality of data index items; firstly, acquiring a data screening rule, and determining data screening conditions in the data screening rule, wherein the data screening conditions comprise a judging index item and judging conditions, then matching the judging index item with a plurality of data index items one by one, extracting the data index item identical to the judging index item, finally judging the field content of the data index item according to the judging conditions corresponding to the judging index item, if the field content of the data index item meets the judging conditions, determining that the data record corresponding to the data index item meets the data screening conditions, and marking the data record as data to be identified.
In the embodiment of the application, the data screening rule is obtained, the data screening condition in the data screening rule is determined, the data screening condition specifically comprises a judging index item and a judging condition, for example, in a source data table, the judging index item can be various indexes such as a database name, a resource name, a table annotation, a field name, a field annotation, a data type, a data length, a data duty ratio, a deduplication duty ratio, an arbitrary field, an arbitrary annotation and the like, and various judging modes can be arranged between the judging index item and the judging condition, the judging modes can be the content of characters such as inclusion, non-inclusion, equality, non-equality, regularization, starting with a designated character, ending with a designated character, intervening between, dictionary and the like, the method comprises the steps of obtaining a database name, a column name, a data content address identification algorithm, a table annotation, a user and other data screening conditions, matching the database name with the data index items in the existing data records according to the judging index items, extracting the data index items completely consistent with the judging index items, judging the field content of the data index items according to the judging conditions, and if the field content meets the judging conditions, the data records are the data to be identified, so that the primary screening of the data is completed.
Further, the number of the data screening conditions is multiple, firstly, screening is conducted on multiple data records in a source data table one by one based on the multiple data screening conditions, then when all the data screening conditions are met simultaneously when the data records exist, the data records are extracted, redundancy processing is conducted on the data records, finally, the data records after the redundancy processing are integrated, and the integrated data records are marked as data to be identified.
In the embodiment of the application, the number of the data screening conditions in the data screening rule is multiple under the normal condition, and a smaller data screening range is set so as to be convenient for accurately and quickly finding the data to be identified, therefore, the data records are screened one by one based on each data screening condition, finally, the data records meeting all the data screening conditions are determined, namely the data to be identified which needs to be identified next, all the acquired data records are integrated and assembled, redundancy processing is carried out, and after some repeated invalid data is removed, the formed data records are wholly changed into the data to be identified which needs to be identified next.
203. And determining a plurality of index items to be identified of the data to be identified, wherein the index items to be identified comprise qualitative index items and quantitative index items.
Specifically, firstly, source data table information of data to be identified is obtained, metadata index items are generated according to the source data table information, then, data index items of the data to be identified are obtained, the metadata index items and the data index items are integrated to obtain qualitative index items of the data to be identified, then, field content of each data index item in the data to be identified is obtained, statistical calculation is carried out on the field content based on a preset statistical algorithm to obtain a statistical result, and finally, quantitative index items of the data to be identified are generated based on the statistical result.
In the embodiment of the application, the category of the index to be identified is divided into a qualitative index item and a quantitative index item, and the qualitative index item is further divided into a metadata index item and a data index item. Specifically, after the data to be identified is obtained, source data table information is obtained according to the source of the data to be identified, namely a source data table to which the data to be identified belongs, wherein the source data table information specifically comprises information such as a library name of a database in which the source data table is located, a table name of the source data table, table notes of the source data table and the like, the information is metadata index items of the data to be identified, the data index items of the data to be identified correspond to related information such as column notes and column names in the data to be identified, and qualitative index items are formed by the source data index items and the data index items. And the field content in the qualitative index item of the data to be identified is subjected to data processing, so that a specific quantitative index item of the data to be identified can be obtained, for example, the field content is subjected to statistical calculation through a statistical algorithm, and the equivalent indexes of the data length, the data duty ratio and the duplication removal duty ratio are obtained and serve as the quantitative index items of the data table to be identified. And finally, integrating the qualitative index item and the quantitative index item to obtain the index item to be identified of the data table to be identified, wherein the subsequent identification process of the data to be identified is carried out based on the index item to be identified.
204. And acquiring an identification algorithm corresponding to each index item to be identified of the data to be identified, and identifying the field content of each index item to be identified of the data to be identified based on the identification algorithm to obtain an identification result corresponding to each index item to be identified of the data to be identified.
Specifically, firstly, a recognition algorithm corresponding to an index item to be recognized is obtained, wherein the recognition algorithm comprises at least one of a character string matching algorithm, a feature extraction algorithm, a statistical learning algorithm and a machine learning algorithm, then the field content of the index item to be recognized is extracted, the field to be recognized in the field content is determined according to the recognition algorithm corresponding to the index item to be recognized, and finally the field to be recognized is recognized based on the recognition algorithm, so that the recognition result of the index item to be recognized is obtained.
In the embodiment of the application, each index item to be identified of the data to be identified corresponds to an identification algorithm, and the commonly used identification algorithm specifically comprises a character string matching algorithm for identifying and filtering the sensitive content of the data by matching appointed character strings, keywords, regular expressions and the like; aiming at the characteristic information in the data, different methods are adopted to extract, analyze, model and mine, so that a characteristic extraction algorithm of sensitive content is identified; a statistical model is established by carrying out statistical analysis and modeling on the data, so that a statistical learning algorithm for identifying and classifying sensitive contents in the data is realized, and a machine learning algorithm for predicting and classifying the sensitive contents in the data is realized by training and learning the data by utilizing the machine learning algorithm and the model. After the respective recognition algorithms of the index items to be recognized are obtained, extracting the fields to be recognized of the field content in the index items to be recognized and recognizing the fields based on the data types to be recognized by different recognition algorithms, and finally obtaining the corresponding recognition results of the index items to be recognized.
Specifically, as shown in fig. 3, each index item to be identified represents one dimension for data identification, and the data to be identified is processed and identified based on each index item to be identified, so that the data to be processed can be accurately known and identified from one dimension.
205. Editing the rule matching calculation expression, and substituting a plurality of recognition results into the rule matching calculation expression for calculation.
Specifically, in response to an expression editing instruction, an editing rule of a rule matching calculation expression is acquired, firstly, operators associated with a plurality of index items to be identified are acquired based on the editing rule, wherein the number of the operators is at least one, the operators comprise at least one of arithmetic operators, relational operators and logical operators, and then the plurality of index items to be identified and the at least one operator are synthesized to obtain the rule matching calculation expression.
In the embodiment of the application, the data are identified from a plurality of dimensions, and the specific identification result is determined by integrating the identification results of each dimension, so that after the identification result of each index item to be identified is obtained, a rule matching calculation expression is required to be synthesized to comprehensively calculate each result, and the data identification result combined with each dimension is finally obtained. The specific editing process of the rule matching calculation expression needs to acquire operators associated with each index item to be identified according to the index items to be identified determined, wherein the operators support common arithmetic operators, relational operators and logic operators and can support various operation modes, and then after the operators associated with all index items to be identified are determined, all index items to be identified and the operators are synthesized to obtain the data matching calculation expression, for example, when the operator is a logical operator, the rule matching calculation expression generated based on editing of the data index item to be recognized may be (1||2) &3, that is, condition 1 or condition 2 is satisfied, and condition 3 is satisfied, or the rule matching calculation expression can be 1|2|3|4, namely, the condition 1, the condition 2, the condition 3 and the condition 4 are satisfied simultaneously, wherein 1234 in the formula is the identification result of each data index item to be identified.
206. And obtaining a calculation result, comparing the calculation result with a preset sensitivity threshold, and determining a sensitivity recognition result of the data to be recognized.
Specifically, a preset sensitivity threshold is obtained, and the calculation result is compared based on the sensitivity threshold; when the calculation result is larger than the sensitivity threshold, marking the calculation result as a first data category, and adding a data tag of sensitive content for the first data category; when the calculation result is less than or equal to the sensitivity threshold, marking the calculation result as a second data category, and adding a data tag of non-sensitive content to the second data category.
In the embodiment of the application, according to the calculation result obtained by the unused data matching calculation formula, the calculation result is judged by using the sensitivity threshold value for distinguishing different data types, and finally, the specific identification and classification of the data to be identified are finished, whether the data to be identified is sensitive content is determined, for example, when the obtained calculation result is a specific value, the preset sensitivity threshold value is also a specific value, the two specific values are compared, when the calculation result is larger than the preset threshold value, the data to be identified can be divided into the first data type, the data to be identified is determined to be sensitive content, and when the numerical value of the calculation result is smaller than or equal to the preset sensitivity threshold value, the data to be identified can be divided into the second data type, and the data to be identified is determined to be sensitive content. The preset sensitivity threshold is set so that a user can analyze the calculation result quickly, the accurate data category of the data to be identified can be obtained accurately, the whole data analysis process is completed, whether the data to be identified belongs to sensitive content or not is determined quickly, and subsequent processing is facilitated.
The application provides a data identification method, a device, a storage medium and computer equipment, which are characterized in that firstly, a data screening rule carried in a data identification request is obtained in response to the data identification request, data to be identified is extracted from a source data table based on the data screening rule, then a plurality of index items to be identified of the data to be identified are determined, the index items to be identified comprise qualitative index items and quantitative index items, then an identification algorithm corresponding to each index item to be identified of the data to be identified is obtained, field content of each index item to be identified of the data to be identified is identified based on the identification algorithm, an identification result corresponding to each index item to be identified of the data to be identified is obtained, a rule matching calculation expression is synthesized, a plurality of identification results are substituted into the rule matching calculation expression for calculation, finally, a calculation result is obtained, and the sensitivity identification result of the data to be identified is determined by comparing a preset sensitivity threshold with the calculation result.
The method determines the data to be identified based on a plurality of data screening conditions in the data screening rule, and completes the accurate screening of large-batch data; determining a plurality of index items to be identified of the data to be identified, wherein the index items specifically comprise qualitative index items and quantitative index items, and then carrying out targeted identification on field contents by utilizing an identification algorithm corresponding to the index items to be identified, so that an accurate identification result can be obtained; and finally, synthesizing rule matching calculation expressions based on the index items to be identified, calculating each identification result through the rule matching calculation expressions, and comparing the obtained calculation result with a preset sensitivity threshold to determine whether the data to be identified is sensitive content. The method effectively improves the efficiency of processing a large amount of data, can combine multiple dimensions to identify the data, improves the identification accuracy of the data, and finally accurately judges whether the data to be identified is sensitive content or not by utilizing the preset sensitivity threshold.
Further, as a specific implementation of the method of fig. 1, an embodiment of the present application provides a data identifying apparatus, as shown in fig. 4, where the apparatus includes: a data extraction module 301, an algorithm confirmation module 302, a data identification module 303 and a result output module 304.
The data extraction module 301 may be configured to respond to the data identification request, obtain a data filtering rule carried in the data identification request, and extract data to be identified from the source data table based on the data filtering rule;
the algorithm confirmation module 302 may be configured to determine a plurality of index items to be identified of the data to be identified, and obtain an identification algorithm corresponding to each index item to be identified;
the data identifying module 303 may be configured to identify field content of each index item to be identified of the data to be identified based on an identifying algorithm, so as to obtain an identifying result corresponding to each index item to be identified of the data to be identified;
the result output module 304 may be configured to input a plurality of recognition results of the data to be recognized into a preset rule matching calculation expression for calculation, obtain a calculation result, and determine a sensitivity recognition result of the data to be recognized according to the calculation result.
In a specific application scenario, the data extraction module 301 may be configured to obtain a data filtering rule, and determine a data filtering condition in the data filtering rule, where each data filtering condition includes a decision index item and a decision condition; matching the judging index item with a plurality of data index items one by one, and extracting the data index items which are the same as the judging index item; judging the field content of the data index item according to the judging condition corresponding to the judging index item, if the field content of the data index item meets the judging condition, determining that the data record corresponding to the data index item meets the data screening condition, and marking the data record as the data to be identified.
In a specific application scenario, the data extraction module 301 may be further configured to screen a plurality of data records in the source data table one by one based on a plurality of data screening conditions; when the data records exist and all data screening conditions are met, extracting the data records, and performing redundancy processing on the data records; integrating the redundant data records, and marking the integrated data records as data to be identified.
In a specific application scenario, the algorithm confirmation module 302 may be specifically configured to obtain source data table information of data to be identified, and generate a metadata index item according to the source data table information; acquiring data index items of the data to be identified, and integrating the metadata index items with the data index items to obtain qualitative index items of the data to be identified; acquiring field content of each data index item in the data to be identified, and carrying out statistical calculation on the field content based on a preset statistical algorithm to obtain a calculation result; based on the calculation result, a quantitative index item of the data to be identified is generated.
In a specific application scenario, the data recognition module 303 may be further configured to obtain a recognition algorithm corresponding to the index item to be recognized, where the recognition algorithm includes at least one of a character string matching algorithm, a feature extraction algorithm, a statistical learning algorithm, and a machine learning algorithm; extracting field content of the index item to be identified, and determining a field to be identified in the field content according to an identification algorithm corresponding to the index item to be identified; and identifying the field to be identified based on an identification algorithm to obtain an identification result of the index item to be identified.
In a specific application scenario, as shown in fig. 5, the present application further includes a formula editing module 305, where the formula editing module 305 is specifically further configured to obtain an editing rule of a rule matching calculation expression in response to an expression editing instruction; acquiring operators associated with a plurality of index items to be identified based on an editing rule, wherein the number of operators is at least one, and the operators comprise at least one of arithmetic operators, relational operators and logical operators; and synthesizing the plurality of index items to be identified and at least one operator to obtain a rule matching calculation expression.
In a specific application scenario, the result output module 304 may be specifically configured to obtain a preset sensitivity threshold, and compare the calculation result based on the sensitivity threshold; when the calculation result is larger than the sensitivity threshold, marking the calculation result as a first data category, and adding a data tag of sensitive content for the first data category; when the calculation result is less than or equal to the sensitivity threshold, marking the calculation result as a second data category, and adding a data tag of non-sensitive content to the second data category.
It should be noted that, for other corresponding descriptions of each functional unit related to the data identifying apparatus provided in this embodiment, reference may be made to corresponding descriptions in fig. 1 and fig. 2, and no further description is given here.
Based on the above method as shown in fig. 1, correspondingly, the present embodiment further provides a storage medium, on which a computer program is stored, which when executed by a processor, implements the above data identification method.
Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, where the software product to be identified may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disc, a mobile hard disk, etc.), and includes several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to execute the method for identifying scene data according to each embodiment of the present application.
Based on the method shown in fig. 1 and fig. 2 and the embodiment of the data identifying apparatus shown in fig. 4 and fig. 5, in order to achieve the above object, as shown in fig. 6, the embodiment further provides a data identifying entity device, where the device includes a communication bus, a processor, a memory, a communication interface, and may further include an input/output interface and a display device, where each functional unit may complete communication with each other through the bus. The memory stores a computer program and a processor for executing the program stored in the memory to perform the data identification method in the above embodiment.
Optionally, the physical device may further include a user interface, a network interface, a camera, radio Frequency (RF) circuitry, sensors, audio circuitry, WI-FI modules, and the like. The user interface may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), etc.
It will be appreciated by those skilled in the art that the structure of the data identifying entity device provided in this embodiment is not limited to the entity device, and may include more or fewer components, or may combine certain components, or may be a different arrangement of components.
The storage medium may also include an operating system, a network communication module. The operating system is a program for managing the entity equipment hardware and the software resources to be identified, and supports the operation of the information processing program and other software and/or programs to be identified. The network communication module is used for realizing communication among all components in the storage medium and communication with other hardware and software in the information processing entity equipment.
From the above description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general hardware platforms, or may be implemented by hardware. By applying the technical scheme of the application, firstly, a data screening rule carried in a data identification request is obtained in response to the data identification request, the data to be identified is extracted from a source data table based on the data screening rule, then a plurality of index items to be identified of the data to be identified are determined, an identification algorithm corresponding to each index item to be identified is obtained, then the field content of each index item to be identified of the data to be identified is identified based on the identification algorithm, an identification result corresponding to each index item to be identified of the data to be identified is obtained, finally, a plurality of identification results of the data to be identified are input into a preset rule matching calculation expression for calculation, a calculation result is obtained, and a sensitivity identification result of the data to be identified is determined according to the calculation result.
Before the data is identified, the data in the source data table is screened in advance based on the data screening rule in the data identification request, so that the data which is really needed to be identified is obtained, and when the data size of the data to be identified is large, the method can be used for quickly and preliminarily screening a large amount of data, so that the data processing efficiency is effectively improved; then, a plurality of index items to be identified of the data to be identified are determined, the field content of each index item to be identified is identified in a targeted mode one by one based on an identification algorithm of each index item to be identified, the data is identified from a plurality of index dimensions, the data can be more comprehensively known, and an identification result of the data in each index dimension can be accurately obtained; and finally, calculating the identification result of each index item to be identified through a rule matching calculation expression, determining the sensitivity identification result of the data to be identified based on the calculation results among a plurality of index dimensions, improving the data identification rate, finally obtaining the data classification result of the data to be identified, and accurately identifying the sensitive data. The method can improve the efficiency of data identification, particularly can accurately identify large quantities of complex data, and can identify the data by combining multiple dimensions, so that the identification accuracy of the data is effectively improved, the sensitivity identification result of the data is accurately obtained, and accurate classification and management and control of sensitive data content are realized.
Those skilled in the art will appreciate that the drawing is merely a schematic illustration of a preferred implementation scenario and that the modules or flows in the drawing are not necessarily required to practice the application. Those skilled in the art will appreciate that modules in an apparatus in an implementation scenario may be distributed in an apparatus in an implementation scenario according to an implementation scenario description, or that corresponding changes may be located in one or more apparatuses different from the implementation scenario. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.
The above-mentioned inventive sequence numbers are merely for description and do not represent advantages or disadvantages of the implementation scenario. The foregoing disclosure is merely illustrative of some embodiments of the application, and the application is not limited thereto, as modifications may be made by those skilled in the art without departing from the scope of the application.

Claims (10)

1. A method of data identification, the method comprising:
responding to a data identification request, acquiring a data screening rule carried in the data identification request, and extracting data to be identified from a source data table based on the data screening rule;
determining a plurality of index items to be identified of the data to be identified, and acquiring an identification algorithm corresponding to each index item to be identified;
Identifying field content of each index item to be identified of the data to be identified based on the identification algorithm to obtain an identification result corresponding to each index item to be identified of the data to be identified;
inputting a plurality of identification results of the data to be identified into a preset rule matching calculation expression for calculation, obtaining a calculation result, and determining a sensitivity identification result of the data to be identified according to the calculation result.
2. The method of claim 1, wherein the source data table comprises a plurality of data records, each data record comprising a plurality of data index entries; the extracting the data to be identified from the source data table based on the data screening rule comprises the following steps:
acquiring the data screening rule and determining a data screening condition in the data screening rule, wherein the data screening condition comprises a judging index item and a judging condition;
matching the judging index item with a plurality of data index items one by one, and extracting the data index items identical to the judging index item;
judging the field content of the data index item according to the judging condition corresponding to the judging index item, if the field content of the data index item meets the judging condition, determining that the data record corresponding to the data index item meets the data screening condition, and marking the data record as data to be identified.
3. The method of claim 2, wherein the number of data screening conditions is a plurality; and if the field content of the data index item meets the judging condition, determining that the data record corresponding to the data index item meets the data screening condition, marking the data record as data to be identified, and including:
screening a plurality of data records in the source data table one by one based on a plurality of data screening conditions;
when the data records exist and all the data screening conditions are met, extracting the data records, and carrying out redundancy processing on the data records;
integrating the data records after redundancy processing, and marking the integrated data records as data to be identified.
4. The method according to claim 1, wherein the index items to be identified include qualitative index items and quantitative index items; the determining a plurality of index items to be identified of the data to be identified comprises:
acquiring source data table information of the data to be identified, and generating metadata index items according to the source data table information;
acquiring a data index item of the data to be identified, and integrating the metadata index item and the data index item to obtain a qualitative index item of the data to be identified;
Acquiring field content of each data index item in the data to be identified, and carrying out statistical calculation on the field content based on a preset statistical algorithm to obtain a statistical result;
and generating quantitative index items of the data to be identified based on the statistical result.
5. The method according to claim 1, wherein the identifying, based on the identifying algorithm, the field content of each of the index items to be identified of the data to be identified, to obtain an identification result corresponding to each of the index items to be identified of the data to be identified, includes:
acquiring an identification algorithm corresponding to the index item to be identified, wherein the identification algorithm comprises at least one of a character string matching algorithm, a feature extraction algorithm, a statistical learning algorithm and a machine learning algorithm;
extracting field content of the index item to be identified, and determining a field to be identified in the field content according to the identification algorithm corresponding to the index item to be identified;
and identifying the field to be identified based on the identification algorithm to obtain an identification result of the index item to be identified.
6. The method according to claim 1, wherein before the inputting of the plurality of the recognition results of the data to be recognized into a preset rule matching calculation expression for calculation, the method further comprises:
Responding to an expression editing instruction, and acquiring an editing rule of the rule matching calculation expression;
acquiring operators associated with a plurality of index items to be identified based on the editing rule, wherein the number of the operators is at least one, and the operators comprise at least one of arithmetic operators, relational operators and logical operators;
and synthesizing the plurality of index items to be identified and at least one operator to obtain the rule matching calculation expression.
7. The method according to claim 1, wherein the determining the sensitivity recognition result of the data to be recognized according to the calculation result includes:
acquiring a preset sensitivity threshold, and comparing the calculation result based on the sensitivity threshold;
when the calculation result is larger than the sensitivity threshold, marking the calculation result as a first data category, and adding a data tag of sensitive content for the first data category;
and when the calculation result is smaller than or equal to the sensitivity threshold, marking the calculation result as a second data category, and adding a data tag of non-sensitive content for the second data category.
8. A data recognition device, the device comprising:
the data extraction module is used for responding to a data identification request, acquiring a data screening rule carried in the data identification request and extracting data to be identified from a source data table based on the data screening rule;
the algorithm confirmation module is used for determining a plurality of index items to be identified of the data to be identified and acquiring an identification algorithm corresponding to each index item to be identified;
the data identification module is used for identifying the field content of each index item to be identified of the data to be identified based on the identification algorithm to obtain an identification result corresponding to each index item to be identified of the data to be identified;
the result output module is used for inputting a plurality of identification results of the data to be identified into a preset rule matching calculation expression for calculation, obtaining a calculation result, and determining a sensitivity identification result of the data to be identified according to the calculation result.
9. A storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the method according to any of claims 1 to 7.
10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program when executed by the processor implements the steps of the method according to any one of claims 1 to 7.
CN202310855260.6A 2023-07-12 2023-07-12 Data identification method, device, storage medium and computer equipment Pending CN116910650A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310855260.6A CN116910650A (en) 2023-07-12 2023-07-12 Data identification method, device, storage medium and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310855260.6A CN116910650A (en) 2023-07-12 2023-07-12 Data identification method, device, storage medium and computer equipment

Publications (1)

Publication Number Publication Date
CN116910650A true CN116910650A (en) 2023-10-20

Family

ID=88366093

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310855260.6A Pending CN116910650A (en) 2023-07-12 2023-07-12 Data identification method, device, storage medium and computer equipment

Country Status (1)

Country Link
CN (1) CN116910650A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117648635B (en) * 2024-01-30 2024-05-03 深圳昂楷科技有限公司 Sensitive information classification and classification method and system and electronic equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117648635B (en) * 2024-01-30 2024-05-03 深圳昂楷科技有限公司 Sensitive information classification and classification method and system and electronic equipment

Similar Documents

Publication Publication Date Title
JP5567049B2 (en) Document sorting system, document sorting method, and document sorting program
US20220237230A1 (en) System and method for automated file reporting
CN112889042A (en) Identification and application of hyper-parameters in machine learning
CN109800354B (en) Resume modification intention identification method and system based on block chain storage
CN110674360B (en) Tracing method and system for data
JP2014109871A (en) Document management system and document management method and document management program
CN110910175B (en) Image generation method for travel ticket product
CN112926045A (en) Group control equipment identification method based on logistic regression model
US11308102B2 (en) Data catalog automatic generation system and data catalog automatic generation method
CN110968664A (en) Document retrieval method, device, equipment and medium
CN116881430B (en) Industrial chain identification method and device, electronic equipment and readable storage medium
CN113760891A (en) Data table generation method, device, equipment and storage medium
Langfelder et al. Package ‘WGCNA’
CN111797395A (en) Malicious code visualization and variety detection method, device, equipment and storage medium
CN116501979A (en) Information recommendation method, information recommendation device, computer equipment and computer readable storage medium
CN116910650A (en) Data identification method, device, storage medium and computer equipment
JP4234841B2 (en) Data analyzer
US20170293863A1 (en) Data analysis system, and control method, program, and recording medium therefor
CN111460268B (en) Method and device for determining database query request and computer equipment
CN113837836A (en) Model recommendation method, device, equipment and storage medium
CN110059180B (en) Article author identity recognition and evaluation model training method and device and storage medium
CN111400375A (en) Business opportunity mining method and device based on financial service data
JP6496078B2 (en) Analysis support device, analysis support method, and analysis support program
CN110737749A (en) Entrepreneurship plan evaluation method, entrepreneurship plan evaluation device, computer equipment and storage medium
JP2016189036A (en) Document fractionation system, document fractionation method and document fractionation program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination