CN113569005A - Large-scale data feature intelligent extraction method based on data content - Google Patents
Large-scale data feature intelligent extraction method based on data content Download PDFInfo
- Publication number
- CN113569005A CN113569005A CN202110670587.7A CN202110670587A CN113569005A CN 113569005 A CN113569005 A CN 113569005A CN 202110670587 A CN202110670587 A CN 202110670587A CN 113569005 A CN113569005 A CN 113569005A
- Authority
- CN
- China
- Prior art keywords
- data
- type
- field
- field type
- identification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 21
- 238000005070 sampling Methods 0.000 claims abstract description 7
- 238000000034 method Methods 0.000 claims description 33
- 238000012795 verification Methods 0.000 claims description 30
- 238000003058 natural language processing Methods 0.000 claims description 14
- 230000011218 segmentation Effects 0.000 claims description 8
- 238000004458 analytical method Methods 0.000 claims description 6
- 238000003066 decision tree Methods 0.000 claims description 5
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 238000001514 detection method Methods 0.000 abstract description 16
- 238000012545 processing Methods 0.000 abstract description 8
- 238000011835 investigation Methods 0.000 abstract description 4
- 230000000694 effects Effects 0.000 abstract description 3
- 238000005516 engineering process Methods 0.000 description 4
- 230000014509 gene expression Effects 0.000 description 3
- 238000007405 data analysis Methods 0.000 description 2
- 238000013075 data extraction Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 239000013589 supplement Substances 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000012550 audit Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000007717 exclusion Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000000047 product Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/383—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Biophysics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Library & Information Science (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a large-scale data feature intelligent extraction method based on data content, which comprises the following steps: performing initial identification on field types of the data, and eliminating invalid data; judging the Chinese description and the field type of the data, sampling unmatched data, calculating the ratio of each field type in the sample, and revising the field type according to the ratio result; features are extracted according to field types. The substantial effects of the invention include: the universality of table processing and detection and the relation between meanings represented by different field type data are considered, corresponding features can be extracted for each field only according to the header information and the data content, the automation and scale of the feature extraction of the data are realized, accurate detection object identification and positioning are provided for the data quality problem investigation, and a foundation is provided for improving the subsequent quality detection work efficiency.
Description
Technical Field
The invention relates to the technical field of data feature extraction, in particular to a large-scale data feature intelligent extraction method based on data content.
Background
With the development of digital economy, various industries do not pursue the scale of data volume one by one, the requirement on data quality is higher and higher in the data application process, and in the face of massive data resources, how to find the problem of positioning data quality more quickly, accurately and intelligently, and develop corresponding treatment work is the key and core of current enterprise-level data asset management.
In the prior art, the invention disclosed in publication No. CN105554152A discloses a method and apparatus for extracting data features. In more detailed technical content, the invention disclosed in publication No. CN108256074A also discloses a method for verification processing, including obtaining models of a data warehouse to be verified, each model including a plurality of field information, the field information including field definitions and field types; checking the field information according to a pre-stored data dictionary, wherein the data dictionary comprises a plurality of standard expressions, and each standard expression comprises a standard definition and a standard type; and if the field definition is matched with the standard definition and the field type is not matched with the standard type, modifying the field type to be consistent with the standard type. The method verifies the model of the data warehouse according to the standard expression, and when the field definition is matched with the standard definition and the field type is not matched with the standard type, the field type is pertinently modified to be consistent with the standard type, so that a standard consistent model is obtained.
In the prior art, the related problems are solved in different ways, and in the traditional data quality management mode, the selection of the problem detection object needs to specify specific and specific data tables and fields by service experts according to service specifications and experience knowledge, and needs to specify the characteristics of each field and the applicable rules, the method and the result have extremely high requirements on the experience and professional skill of the service expert, the range of the detection object of the data quality problem is relatively limited, and the method is highly dependent on the service expert, for large-scale mass data, service experts are required to respectively and one by one designate corresponding detection objects and ranges, and the universality of the data characteristics is weak, the maintenance is time-consuming and labor-consuming, the definition of a large-scale and automatic data quality detection object and the extraction of the corresponding data characteristics cannot be realized, and the efficiency of data quality audit is low and is seriously influenced by manual experience.
Disclosure of Invention
Aiming at the problems in the prior art, the invention aims to provide a large-scale data feature intelligent extraction method based on data content, which considers the universality of processing and detecting tables, does not need to provide any additional corresponding knowledge of the tables, can extract corresponding features for each field only according to the header information and the data content, realizes the automation and the large-scale of the feature extraction of the data, does not need to appoint data objects and feature conditions of data quality detection one by one, reduces the dependence on the knowledge and experience of service personnel, provides accurate detection object identification and positioning for the data quality problem investigation, and provides a foundation for improving the working efficiency of subsequent quality detection.
The technical scheme of the invention is as follows.
A large-scale data feature intelligent extraction method based on data content comprises the following steps:
performing initial identification on field types of the data, and eliminating invalid data;
judging the Chinese description and the field type of the data, sampling unmatched data, calculating the ratio of each field type in the sample, and revising the field type according to the ratio result;
features are extracted according to field types.
In the extraction process, the characteristics of different field types and the Chinese description of corresponding data are combined and considered, the field types are comprehensively judged by two steps of primary identification and revision, and the identification accuracy is improved.
Preferably, the process of preliminary identification includes: and carrying out primary identification on data to be identified according to the existing field type database, or introducing an identification model trained by a neural network to carry out primary identification to obtain a primary identification result of the field type. Different field types have respective characteristics, and the prior art in the field generally adopts a database, a training model and the like for comparison and identification, but the technology can only be used for primary identification, so that the implementation cost is reduced, and certain basic accuracy is guaranteed.
Preferably, the process of eliminating invalid data includes: defining an invalid table and invalid fields, and judging through metadata information and data content of the table to uniformly judge an empty table, a zombie table, a log table, a backup table, a temporary table, a single-character field table and a low-heat table into the invalid table; uniformly judging null fields and single-value fields as invalid fields; and identifying and eliminating invalid tables and fields. The invalid table and the invalid fields cover various common invalid data, and processing pressure of subsequent data extraction and analysis can be reduced after the invalid data are removed.
Preferably, the process of revising the field type includes: performing word segmentation and semantic recognition on the Chinese description of the data by using an NLP (natural language processing) module, performing path recognition on approximate words or approximate characters through a type decision tree after analysis, and marking the Chinese description as a suspected revised field type when the semantic of the Chinese description is not matched with the field type; and then sampling the data contents with the same or similar semantic meanings described in the Chinese text for multiple times, counting the occupation ratio conditions of different field types in the sampled data, taking the type with the occupation ratio exceeding a threshold value as a recommended revision field type, and finally revising the recommended revision field type as the field type to which the real storage data belongs. The natural language processing technology can perform word segmentation and semantic recognition on the Chinese description, the decision tree can perform path recognition with similar meanings to help judge whether the Chinese description belongs to a suspected revised field type, and finally, the result is determined by setting a threshold value in a mode of taking the proportion as a judgment standard, and the revision process is a supplement to the primary recognition, so that the recognition accuracy is further improved.
Preferably, the field type includes at least one of a numeric type, a text type, and a date type.
Preferably, the extracting features according to field types includes: extracting features and characteristic values of the numerical type field by using a mean value, a maximum value, a minimum value, a median, a variance, a quartile distance, numerical clustering and length clustering; for text type fields, counting attribute features from length clustering and structure distribution, and extracting content features through word segmentation and semantic recognition of data contents; and carrying out structure analysis on the date type field, and carrying out feature extraction on the date format and the date length.
Preferably, after the revision field type is finished, the method further comprises the following verification step: converting date data into text data, copying the text data into a verification group and an interference group, inserting the verification group into year, month and day description according to an original date format, adding a counting unit description according to the original date data digit number of the interference group, inserting the verification group and the interference group into self adjacent text data, performing semantic identification on the spliced text data through an NLP natural language processing module, and recording the identification speed of each pair of the interference group and the verification group, wherein if the identification speed of the verification group is higher than that of the interference group and exceeds an amplitude threshold value, the verification is passed, and otherwise, the corresponding original date data is listed as a suspected error type. Whether the data is date type data or numerical value type data, the data is often connected with the adjacent text type data, when the original identification is correct, the text spliced by the verification group is easier to identify, so the identification speed is high, and if the original identification is wrong, the text spliced by the verification group is wrong, so the text spliced by the verification group has no advantage of the identification speed compared with an interference group, and is even slower, so the text spliced by the verification group is classified as a suspected error type.
The substantial effects of the invention include: the universality of table processing and detection and the relation between meanings represented by different field type data are considered, corresponding features can be extracted for each field only according to the header information and the data content, the automation and scale of the feature extraction of the data are realized, accurate detection object identification and positioning are provided for the data quality problem investigation, and a foundation is provided for improving the subsequent quality detection work efficiency.
Detailed Description
The technical solution of the present application will be described with reference to the following examples. In addition, numerous specific details are set forth below in order to provide a better understanding of the present invention. It will be understood by those skilled in the art that the present invention may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present invention.
Example 1:
a large-scale data feature intelligent extraction method based on data content mainly aims at field types including numerical type, text type and date type.
The method comprises the following steps:
s01: and carrying out initial identification on the field type of the data, and rejecting invalid data.
Wherein the process of preliminary identification comprises: and carrying out primary identification on data to be identified according to the existing field type database, or introducing an identification model trained by a neural network to carry out primary identification to obtain a primary identification result of the field type. Different field types have respective characteristics, and the prior art in the field generally adopts a database, a training model and the like for comparison and identification, but in the embodiment, the technology can only be used for primary identification, the implementation cost is reduced, and certain basic accuracy is guaranteed.
The process of eliminating invalid data comprises the following steps: defining an invalid table and invalid fields, and judging through metadata information and data content of the table to uniformly judge an empty table, a zombie table, a log table, a backup table, a temporary table, a single-character field table and a low-heat table into the invalid table; uniformly judging null fields and single-value fields as invalid fields; and identifying and eliminating invalid tables and fields. The invalid table and the invalid fields cover various common invalid data, and processing pressure of subsequent data extraction and analysis can be reduced after the invalid data are removed.
S02: judging the Chinese description and the field type of the data, sampling the unmatched data, calculating the ratio of each field type in the sample, and revising the field type according to the ratio result.
The process of revising the field type includes: performing word segmentation and semantic recognition on the Chinese description of the data by using an NLP (natural language processing) module, performing path recognition on approximate words or approximate characters through a type decision tree after analysis, and marking the Chinese description as a suspected revised field type when the semantic of the Chinese description is not matched with the field type; and then sampling the data contents with the same or similar semantic meanings described in the Chinese text for multiple times, counting the occupation ratio conditions of different field types in the sampled data, taking the type with the occupation ratio exceeding a threshold value as a recommended revision field type, and finally revising the recommended revision field type as the field type to which the real storage data belongs. The natural language processing technology can perform word segmentation and semantic recognition on the Chinese description, the decision tree can perform path recognition with similar meanings to help judge whether the Chinese description belongs to a suspected revised field type, and finally, the result is determined by setting a threshold value in a mode of taking the proportion as a judgment standard, and the revision process is a supplement to the primary recognition, so that the recognition accuracy is further improved.
S03: features are extracted according to field types.
The process of extracting features according to field types comprises the following steps: extracting features and characteristic values of the numerical type field by using a mean value, a maximum value, a minimum value, a median, a variance, a quartile distance, numerical clustering and length clustering; for text type fields, counting attribute features from length clustering and structure distribution, and extracting content features through word segmentation and semantic recognition of data contents; and carrying out structure analysis on the date type field, and carrying out feature extraction on the date format and the date length.
In addition, more specifically, the data features and the feature extraction methods applicable to the field type can be searched from the data feature library, and all the applicable data feature extraction methods of the field type are traversed according to the dependency of the corresponding data features and the mutual exclusion relationship network, for example, after a certain data field is determined to be a numerical type, the feature extraction algorithm loads the methods for extracting the attribute features such as length, integer, positive number, negative number, decimal number and the like, and the methods for extracting the service features such as mobile phone number, zip code and the like.
Example 2:
this embodiment is generally consistent with the previous embodiment, and is different in that after the revision field type is finished and before the feature is extracted, the method further includes a verification step: converting date data into text data, copying the text data into a verification group and an interference group, inserting the verification group into a year, month and day description according to an original date format, adding a counting unit description according to the original date data digit number by the interference group, inserting the verification group and the interference group into the adjacent text data, performing semantic identification on the spliced text data through an NLP (non-line-segment) natural language processing module, and recording the identification speed of each pair of the interference group and the verification group, wherein if the identification speed of the verification group is higher than that of the interference group and exceeds an amplitude threshold value, the verification is passed, and otherwise, the corresponding original date data is listed as a suspected error type. Whether the data is date type data or numerical value type data, the data is often connected with the adjacent text type data, when the original identification is correct, the text spliced by the verification group is easier to identify, so the identification speed is high, and if the original identification is wrong, the text spliced by the verification group is wrong, so the text spliced by the verification group has no advantage of the identification speed compared with an interference group, and is even slower, so the text spliced by the verification group is classified as a suspected error type.
The substantial effects of the above embodiments include: the universality of table processing and detection and the relation between meanings represented by different field type data are considered, corresponding features can be extracted for each field only according to the header information and the data content, the automation and scale of the feature extraction of the data are realized, accurate detection object identification and positioning are provided for the data quality problem investigation, and a foundation is provided for improving the subsequent quality detection work efficiency.
Through the description of the above embodiments, those skilled in the art can understand that in practical applications, the above function distribution may be performed by different functional modules as needed, that is, the internal structure of a specific device is divided into different functional modules, so as to perform all or part of the above described functions.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partially contributed to by the prior art, or all or part of the technical solutions may be embodied in the form of a software product, where the software product is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (7)
1. A large-scale data feature intelligent extraction method based on data content is characterized by comprising the following steps:
performing initial identification on field types of the data, and eliminating invalid data;
judging the Chinese description and the field type of the data, sampling unmatched data, calculating the ratio of each field type in the sample, and revising the field type according to the ratio result;
features are extracted according to field types.
2. The method for intelligently extracting large-scale data features based on data contents according to claim 1, wherein the process of preliminary identification comprises: and carrying out primary identification on data to be identified according to the existing field type database, or introducing an identification model trained by a neural network to carry out primary identification to obtain a primary identification result of the field type.
3. The method for intelligently extracting large-scale data features based on data contents according to claim 1, wherein the process of eliminating invalid data comprises the following steps: defining an invalid table and invalid fields, and judging through metadata information and data content of the table to uniformly judge an empty table, a zombie table, a log table, a backup table, a temporary table, a single-character field table and a low-heat table into the invalid table; uniformly judging null fields and single-value fields as invalid fields; and identifying and eliminating invalid tables and fields.
4. The method for intelligently extracting large-scale data features based on data contents according to claim 1 or 2, wherein the process of revising the field type comprises the following steps: performing word segmentation and semantic recognition on the Chinese description of the data by using an NLP (natural language processing) module, performing path recognition on approximate words or approximate characters through a type decision tree after analysis, and marking the Chinese description as a suspected revised field type when the semantic of the Chinese description is not matched with the field type; and then sampling the data contents with the same or similar semantic meanings described in the Chinese text for multiple times, counting the occupation ratio conditions of different field types in the sampled data, taking the type with the occupation ratio exceeding a threshold value as a recommended revision field type, and finally revising the recommended revision field type as the field type to which the real storage data belongs.
5. The method according to claim 1, wherein the field type includes at least one of numeric type, text type and date type.
6. The method for intelligently extracting large-scale data features based on data contents according to claim 5, wherein the process of extracting features according to field types comprises the following steps: extracting features and characteristic values of the numerical type field by using a mean value, a maximum value, a minimum value, a median, a variance, a quartile distance, numerical clustering and length clustering; for text type fields, counting attribute features from length clustering and structure distribution, and extracting content features through word segmentation and semantic recognition of data contents; and carrying out structure analysis on the date type field, and carrying out feature extraction on the date format and the date length.
7. The method for intelligently extracting large-scale data features based on data contents according to claim 4, wherein after the revision field type is finished, the method further comprises a verification step of: converting date data into text data, copying the text data into a verification group and an interference group, inserting the verification group into year, month and day description according to an original date format, adding a counting unit description according to the original date data digit number of the interference group, inserting the verification group and the interference group into self adjacent text data, performing semantic identification on the spliced text data through an NLP natural language processing module, and recording the identification speed of each pair of the interference group and the verification group, wherein if the identification speed of the verification group is higher than that of the interference group and exceeds an amplitude threshold value, the verification is passed, and otherwise, the corresponding original date data is listed as a suspected error type.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110670587.7A CN113569005B (en) | 2021-06-17 | 2021-06-17 | Large-scale data characteristic intelligent extraction method based on data content |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110670587.7A CN113569005B (en) | 2021-06-17 | 2021-06-17 | Large-scale data characteristic intelligent extraction method based on data content |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113569005A true CN113569005A (en) | 2021-10-29 |
CN113569005B CN113569005B (en) | 2024-02-20 |
Family
ID=78162177
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110670587.7A Active CN113569005B (en) | 2021-06-17 | 2021-06-17 | Large-scale data characteristic intelligent extraction method based on data content |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113569005B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115221893A (en) * | 2022-09-21 | 2022-10-21 | 中国电子信息产业集团有限公司 | Quality inspection rule automatic configuration method and device based on rule and semantic analysis |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060036637A1 (en) * | 2004-08-13 | 2006-02-16 | Mehmet Sayal | System and method for developing a star schema |
CN104731976A (en) * | 2015-04-14 | 2015-06-24 | 海量云图(北京)数据技术有限公司 | Method for finding and sorting private data in data table |
CN109271377A (en) * | 2018-08-10 | 2019-01-25 | 蜜小蜂智慧(北京)科技有限公司 | A kind of data quality checking method and device |
CN109948164A (en) * | 2019-04-02 | 2019-06-28 | 北京三快在线科技有限公司 | Processing method, device, computer equipment and the storage medium of statistical demand information |
KR20200070775A (en) * | 2018-12-10 | 2020-06-18 | 한국전자통신연구원 | Apparatus and method for normalizing security information of heterogeneous systems |
CN111506731A (en) * | 2020-04-17 | 2020-08-07 | 支付宝(杭州)信息技术有限公司 | Method, device and equipment for training field classification model |
CN111539021A (en) * | 2020-04-26 | 2020-08-14 | 支付宝(杭州)信息技术有限公司 | Data privacy type identification method, device and equipment |
CN112181936A (en) * | 2019-07-03 | 2021-01-05 | 北京京东尚科信息技术有限公司 | Database detection method and device |
WO2021012382A1 (en) * | 2019-07-25 | 2021-01-28 | 深圳壹账通智能科技有限公司 | Method and apparatus for configuring chat robot, computer device and storage medium |
CN112434032A (en) * | 2020-11-17 | 2021-03-02 | 北京融七牛信息技术有限公司 | Automatic feature generation system and method |
-
2021
- 2021-06-17 CN CN202110670587.7A patent/CN113569005B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060036637A1 (en) * | 2004-08-13 | 2006-02-16 | Mehmet Sayal | System and method for developing a star schema |
CN104731976A (en) * | 2015-04-14 | 2015-06-24 | 海量云图(北京)数据技术有限公司 | Method for finding and sorting private data in data table |
CN109271377A (en) * | 2018-08-10 | 2019-01-25 | 蜜小蜂智慧(北京)科技有限公司 | A kind of data quality checking method and device |
KR20200070775A (en) * | 2018-12-10 | 2020-06-18 | 한국전자통신연구원 | Apparatus and method for normalizing security information of heterogeneous systems |
CN109948164A (en) * | 2019-04-02 | 2019-06-28 | 北京三快在线科技有限公司 | Processing method, device, computer equipment and the storage medium of statistical demand information |
CN112181936A (en) * | 2019-07-03 | 2021-01-05 | 北京京东尚科信息技术有限公司 | Database detection method and device |
WO2021012382A1 (en) * | 2019-07-25 | 2021-01-28 | 深圳壹账通智能科技有限公司 | Method and apparatus for configuring chat robot, computer device and storage medium |
CN111506731A (en) * | 2020-04-17 | 2020-08-07 | 支付宝(杭州)信息技术有限公司 | Method, device and equipment for training field classification model |
CN111539021A (en) * | 2020-04-26 | 2020-08-14 | 支付宝(杭州)信息技术有限公司 | Data privacy type identification method, device and equipment |
CN112434032A (en) * | 2020-11-17 | 2021-03-02 | 北京融七牛信息技术有限公司 | Automatic feature generation system and method |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115221893A (en) * | 2022-09-21 | 2022-10-21 | 中国电子信息产业集团有限公司 | Quality inspection rule automatic configuration method and device based on rule and semantic analysis |
Also Published As
Publication number | Publication date |
---|---|
CN113569005B (en) | 2024-02-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110826320B (en) | Sensitive data discovery method and system based on text recognition | |
CN110852856B (en) | Invoice false invoice identification method based on dynamic network representation | |
WO2012034733A2 (en) | Method and arrangement for handling data sets, data processing program and computer program product | |
CN112163553B (en) | Material price accounting method, device, storage medium and computer equipment | |
CN112199512B (en) | Scientific and technological service-oriented case map construction method, device, equipment and storage medium | |
CN111191051B (en) | Method and system for constructing emergency knowledge map based on Chinese word segmentation technology | |
CN112307820A (en) | Text recognition method, device, equipment and computer readable medium | |
CN113569005B (en) | Large-scale data characteristic intelligent extraction method based on data content | |
CN110795607A (en) | Equipment guarantee data matching method and system based on multi-stage similarity calculation | |
CN113409001A (en) | Method for controlling automatic pricing of construction engineering quantity list | |
CN116136955B (en) | Text transcription method, text transcription device, electronic equipment and storage medium | |
CN111815162A (en) | Digital auditing tool and method | |
CN113591485A (en) | Intelligent data quality auditing system and method based on data science | |
CN111723182A (en) | Key information extraction method and device for vulnerability text | |
CN112257425A (en) | Power data analysis method and system based on data classification model | |
CN111754352A (en) | Method, device, equipment and storage medium for judging correctness of viewpoint statement | |
CN114611515B (en) | Method and system for identifying enterprise actual control person based on enterprise public opinion information | |
CN112395854B (en) | Standard element consistency inspection method | |
CN113129057A (en) | Software cost information processing method and device, computer equipment and storage medium | |
CN111859896B (en) | Formula document detection method and device, computer readable medium and electronic equipment | |
CN117827991B (en) | Method and system for identifying personal identification information in semi-structured data | |
CN115221013B (en) | Method, device and equipment for determining log mode | |
CN117272123B (en) | Sensitive data processing method and device based on large model and storage medium | |
CN113158645A (en) | Message analysis method and device, electronic equipment and computer storage medium | |
CN115455054A (en) | Method and device for auditing pre-tax deduction materials and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |