CN113569005B - Large-scale data characteristic intelligent extraction method based on data content - Google Patents
Large-scale data characteristic intelligent extraction method based on data content Download PDFInfo
- Publication number
- CN113569005B CN113569005B CN202110670587.7A CN202110670587A CN113569005B CN 113569005 B CN113569005 B CN 113569005B CN 202110670587 A CN202110670587 A CN 202110670587A CN 113569005 B CN113569005 B CN 113569005B
- Authority
- CN
- China
- Prior art keywords
- data
- field
- type
- groups
- invalid
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 28
- 238000000034 method Methods 0.000 claims abstract description 31
- 238000005070 sampling Methods 0.000 claims abstract description 7
- 238000012795 verification Methods 0.000 claims description 27
- 238000003058 natural language processing Methods 0.000 claims description 14
- 230000011218 segmentation Effects 0.000 claims description 8
- 238000003066 decision tree Methods 0.000 claims description 5
- 238000004458 analytical method Methods 0.000 claims description 3
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 238000012916 structural analysis Methods 0.000 claims description 3
- 238000001514 detection method Methods 0.000 abstract description 17
- 238000012545 processing Methods 0.000 abstract description 8
- 238000011835 investigation Methods 0.000 abstract description 4
- 230000000694 effects Effects 0.000 abstract description 3
- 238000005516 engineering process Methods 0.000 description 4
- 230000000295 complement effect Effects 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 2
- 238000013075 data extraction Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000007717 exclusion Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/383—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Biophysics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Library & Information Science (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a large-scale data characteristic intelligent extraction method based on data content, which comprises the following steps: performing preliminary identification of field types on the data, and eliminating invalid data; judging Chinese description and field types of the data, sampling the unmatched data, calculating the duty ratio of each field type in the sample, and revising the field types according to the duty ratio result; and extracting the characteristics according to the field types. The essential effects of the invention include: the method has the advantages that the relation between the universality of table processing and detection and the meaning represented by data of different field types is considered, the corresponding characteristics can be extracted for each field only according to the table header information and the data content, the automation and the scale of the characteristic extraction of the data are realized, the accurate detection object identification and positioning are provided for the data quality problem investigation, and the foundation is provided for the improvement of the subsequent quality detection work efficiency.
Description
Technical Field
The invention relates to the technical field of data feature extraction, in particular to a large-scale data feature intelligent extraction method based on data content.
Background
Along with the development of digital economy, each industry does not pursue the scale of data volume, the requirement on data quality is higher and higher in the process of data application, and how to find and locate the data quality problem faster, more accurately and more intelligently, and develop corresponding treatment work is the key and core of the current enterprise-level data asset management.
In the prior art, the invention of publication No. CN105554152A discloses a method and a device for extracting data characteristics. In more detailed technical content, another invention, such as publication number CN108256074a, discloses a method of verification processing, comprising obtaining models of a data warehouse to be verified, each model comprising a plurality of field information, the field information comprising field definitions and field types; verifying the field information according to a pre-stored data dictionary, wherein the data dictionary comprises a plurality of standard expressions, and each standard expression comprises standard definition and standard type; if the field definition matches the standard definition and the field type does not match the standard type, the field type is modified to be consistent with the standard type. The method comprises the steps of verifying a model of a data warehouse according to standard terms, and purposefully modifying a field type to be consistent with a standard type when the field definition is matched with the standard definition and the field type is not matched with the standard type, so that a standard consistent model is obtained.
In the prior art, the mode of solving the related problems is thousands of times, in the traditional data quality management mode, the selection of the problem detection objects is that a business expert needs to specify specific and specific data tables and fields according to business specifications and experience knowledge, what kind of characteristics each field has and what kind of rules are applicable are required to be specified, the mode and the result have extremely high requirements on the experience and the expertise of the business expert, the range of the detection objects of the data quality problems is limited, the business expert is highly depended on, the business expert is required to respectively specify the corresponding detection objects and the range one by one for large-scale massive data, the universality of the data characteristics is weak, the maintenance is time-consuming and labor-consuming, the definition of the large-scale and automatic data quality detection objects and the extraction of the corresponding data characteristics cannot be realized, the efficiency of the data quality inspection is low, and the influence of the manual experience is serious.
Disclosure of Invention
Aiming at the problems in the prior art, the invention aims to provide a large-scale data characteristic intelligent extraction method based on data content, considers the universality of table processing and detection, can extract corresponding characteristics for each field only according to table header information and data content without providing any additional corresponding knowledge, realizes the automation and scale of data characteristic extraction, does not need to designate data objects and characteristic conditions of data quality detection one by one, reduces the dependence on knowledge and experience of service personnel, provides accurate detection object identification and positioning for data quality problem investigation, and provides a basis for improving the subsequent quality detection work efficiency.
The following is a technical scheme of the invention.
A large-scale data characteristic intelligent extraction method based on data content comprises the following steps:
performing preliminary identification of field types on the data, and eliminating invalid data;
judging Chinese description and field types of the data, sampling the unmatched data, calculating the duty ratio of each field type in the sample, and revising the field types according to the duty ratio result;
and extracting the characteristics according to the field types.
In the extraction process of the invention, the characteristics of different field types and Chinese descriptions of corresponding data are combined and considered, and the method is divided into two steps of primary identification and revision to comprehensively judge the field types, thereby improving the identification accuracy.
Preferably, the preliminary identification process includes: and carrying out preliminary identification on the data to be identified according to the existing field type database, or introducing an identification model trained by a neural network to carry out preliminary identification, so as to obtain a preliminary identification result of the field type. The different field types have respective characteristics, and the prior art in the field generally adopts a database, a training model and the like for comparison and identification, but the technology can only be used for primary identification, so that the implementation cost is reduced and a certain basic accuracy is ensured.
Preferably, the process of rejecting invalid data includes: defining an invalid table and an invalid field, and uniformly judging an empty table, a zombie table, a log table, a backup table, a temporary table, a single-field table and a low-heat table as invalid tables through metadata information and data content judgment of the tables; uniformly judging the null field and the single value field as invalid fields; and identifying and rejecting invalid tables and fields. The invalid table and the invalid field cover various common invalid data, and after being removed, the processing pressure of the subsequent data extraction and analysis can be reduced.
Preferably, the process of revising the field type includes: the Chinese description of the data is subjected to word segmentation and semantic recognition by utilizing an NLP natural language processing module, the path recognition of the approximate word or the approximate word is carried out through a type decision tree after analysis, the semantics of the Chinese description are not matched with the field type, and the Chinese description is marked as a suspected revised field type; and then sampling the data content with the same or similar semantic meaning for multiple times, counting the duty ratio conditions of different field types in the sampled data, taking the type with the duty ratio exceeding a threshold value as a recommended revised field type, and finally revising the recommended revised field type into the field type to which the real storage data belongs. The natural language processing technology can perform word segmentation and semantic recognition on the middle description, and the decision tree can perform path recognition with similar meaning so as to help judge whether the middle description belongs to a suspected revised field type, and finally, the revision process is complementary to the primary recognition by taking the duty ratio as a judgment standard determination result in a mode of setting a threshold value, so that the recognition accuracy is further improved.
Preferably, the field type includes at least one of a numeric type, a text type, and a date type.
Preferably, the extracting the feature according to the field type includes: extracting features and feature values of the numerical value type fields by means of average value, maximum value, minimum value, median, variance, quartile range, numerical value cluster and length cluster; for text type fields, statistical attribute features are distributed from length clusters and structures, and extraction on content features is performed through word segmentation and semantic recognition of data content; and carrying out structural analysis on the date type field, and carrying out feature extraction on the date format and the length.
Preferably, after the revision field type is finished, the method further comprises the step of verifying: the date data are converted into text data and copied into verification groups and interference groups, the verification groups are inserted into the year, month and day descriptions according to the original date format, the interference groups are added with counting unit descriptions according to the original date data digits, the verification groups and the interference groups are inserted into the text data adjacent to the verification groups, semantic recognition is carried out on the spliced text data through an NLP natural language processing module, the recognition speed of each pair of the interference groups and the verification groups is recorded, if the recognition speed of the verification groups is faster than that of the interference groups and exceeds an amplitude threshold value, the corresponding original date data are classified as a suspected error type through verification, and otherwise, the corresponding original date data are classified as a suspected error type. Since the date type data or the numerical value type data are often associated with the adjacent text type data, when the original recognition is correct, the text spliced by the verification group is easier to recognize, so that the recognition speed is higher, and if the original recognition is wrong, the text spliced by the verification group is wrong, so that the text is even slower compared with the advantage that the interference group does not have the recognition speed, and is classified as a suspected error type.
The essential effects of the invention include: the method has the advantages that the relation between the universality of table processing and detection and the meaning represented by data of different field types is considered, the corresponding characteristics can be extracted for each field only according to the table header information and the data content, the automation and the scale of the characteristic extraction of the data are realized, the accurate detection object identification and positioning are provided for the data quality problem investigation, and the foundation is provided for the improvement of the subsequent quality detection work efficiency.
Detailed Description
The technical scheme of the present application will be described below with reference to examples. In addition, numerous specific details are set forth in the following description in order to provide a better understanding of the present invention. It will be understood by those skilled in the art that the present invention may be practiced without some of these specific details. In some instances, well known methods, procedures, components, and circuits have not been described in detail so as not to obscure the present invention.
Example 1:
a large-scale data characteristic intelligent extraction method based on data content mainly aims at field types including numerical value type, text type and date type.
The method comprises the following steps:
s01: and carrying out preliminary identification of field types on the data, and eliminating invalid data.
Wherein the preliminary identification process comprises: and carrying out preliminary identification on the data to be identified according to the existing field type database, or introducing an identification model trained by a neural network to carry out preliminary identification, so as to obtain a preliminary identification result of the field type. The different field types have respective characteristics, and in the prior art in the field, a database, a training model and the like are generally adopted for comparison and identification, but in the embodiment, the technology can only be used for preliminary identification, so that the implementation cost is reduced, and a certain basic accuracy is ensured.
The process of rejecting invalid data includes: defining an invalid table and an invalid field, and uniformly judging an empty table, a zombie table, a log table, a backup table, a temporary table, a single-field table and a low-heat table as invalid tables through metadata information and data content judgment of the tables; uniformly judging the null field and the single value field as invalid fields; and identifying and rejecting invalid tables and fields. The invalid table and the invalid field cover various common invalid data, and after being removed, the processing pressure of the subsequent data extraction and analysis can be reduced.
S02: judging the Chinese description and field type of the data, sampling the unmatched data, calculating the duty ratio of each field type in the sample, and revising the field type according to the duty ratio result.
The process of revising the field type includes: the Chinese description of the data is subjected to word segmentation and semantic recognition by utilizing an NLP natural language processing module, the path recognition of the approximate word or the approximate word is carried out through a type decision tree after analysis, the semantics of the Chinese description are not matched with the field type, and the Chinese description is marked as a suspected revised field type; and then sampling the data content with the same or similar semantic meaning for multiple times, counting the duty ratio conditions of different field types in the sampled data, taking the type with the duty ratio exceeding a threshold value as a recommended revised field type, and finally revising the recommended revised field type into the field type to which the real storage data belongs. The natural language processing technology can perform word segmentation and semantic recognition on the middle description, and the decision tree can perform path recognition with similar meaning so as to help judge whether the middle description belongs to a suspected revised field type, and finally, the revision process is complementary to the primary recognition by taking the duty ratio as a judgment standard determination result in a mode of setting a threshold value, so that the recognition accuracy is further improved.
S03: and extracting the characteristics according to the field types.
The process of extracting features from field types includes: extracting features and feature values of the numerical value type fields by means of average value, maximum value, minimum value, median, variance, quartile range, numerical value cluster and length cluster; for text type fields, statistical attribute features are distributed from length clusters and structures, and extraction on content features is performed through word segmentation and semantic recognition of data content; and carrying out structural analysis on the date type field, and carrying out feature extraction on the date format and the length.
In addition, more specifically, the data characteristics and the characteristic extraction method applicable to the field type can be searched from the data characteristic library, all applicable data characteristic extraction methods of the field type are traversed according to the dependence and the mutual exclusion relation network of the corresponding data characteristics, for example, after a certain data field is determined to be numerical value, the characteristic extraction algorithm loads the characteristic extraction methods of length, integer, positive number, negative number, decimal and the like, and the service characteristic extraction method of mobile phone number, postal code and the like, and the characteristics of concentrated length, integer, mobile phone number and the like can be obtained through continuous identification and extraction of the data content, and meanwhile, the two mutually exclusive characteristics of positive and negative are distinguished, so that the multi-angle characteristics and characteristic values of the field are obtained.
Example 2:
this embodiment is generally identical to the previous embodiment, except that after the end of revising the field type, before extracting the features, a verification step is further included: the date data are converted into text data and copied into verification groups and interference groups, the verification groups are inserted into the year, month and day descriptions according to the original date format, the interference groups are added with counting unit descriptions according to the number of bits of the original date data, the verification groups and the interference groups are inserted into the text data adjacent to the verification groups, semantic recognition is carried out on the spliced text data through an NLP natural language processing module, the recognition speed of each pair of the interference groups and the verification groups is recorded, if the recognition speed of the verification groups is faster than that of the interference groups and exceeds an amplitude threshold, the verification is passed, otherwise, the corresponding original date data are listed as suspected error types. Since the date type data or the numerical value type data are often associated with the adjacent text type data, when the original recognition is correct, the text spliced by the verification group is easier to recognize, so that the recognition speed is higher, and if the original recognition is wrong, the text spliced by the verification group is wrong, so that the text is even slower compared with the advantage that the interference group does not have the recognition speed, and is classified as a suspected error type.
The substantial effects of the above embodiments include: the method has the advantages that the relation between the universality of table processing and detection and the meaning represented by data of different field types is considered, the corresponding characteristics can be extracted for each field only according to the table header information and the data content, the automation and the scale of the characteristic extraction of the data are realized, the accurate detection object identification and positioning are provided for the data quality problem investigation, and the foundation is provided for the improvement of the subsequent quality detection work efficiency.
From the description of the above embodiments, those skilled in the art will appreciate that, in practical applications, the above-mentioned functions may be distributed by different functional modules according to needs, that is, the internal structure of a specific apparatus is divided into different functional modules, so as to complete all or part of the functions described above.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be essentially or a part contributing to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions to cause a device (may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (5)
1. The large-scale data characteristic intelligent extraction method based on the data content is characterized by comprising the following steps of:
performing preliminary identification of field types on the data, and eliminating invalid data;
judging Chinese description and field types of the data, sampling the unmatched data, calculating the duty ratio of each field type in the sample, and revising the field types according to the duty ratio result;
extracting features according to the field types;
the process of revising the field type includes: the Chinese description of the data is subjected to word segmentation and semantic recognition by utilizing an NLP natural language processing module, the path recognition of the approximate word or the approximate word is carried out through a type decision tree after analysis, the semantics of the Chinese description are not matched with the field type, and the Chinese description is marked as a suspected revised field type; then, sampling the data content with the same or similar semantic meaning in the Chinese description for multiple times, counting the proportion situation of different field types in the sampled data, taking the type with the proportion exceeding a threshold value as a recommended revised field type, and finally revising the recommended revised field type into the field type of the real stored data;
after the revision field type is finished, the method further comprises the step of verifying: the date data are converted into text data and copied into verification groups and interference groups, the verification groups are inserted into the year, month and day descriptions according to the original date format, the interference groups are added with counting unit descriptions according to the original date data digits, the verification groups and the interference groups are inserted into the text data adjacent to the verification groups, semantic recognition is carried out on the spliced text data through an NLP natural language processing module, the recognition speed of each pair of the interference groups and the verification groups is recorded, if the recognition speed of the verification groups is faster than that of the interference groups and exceeds an amplitude threshold value, the corresponding original date data are classified as a suspected error type through verification, and otherwise, the corresponding original date data are classified as a suspected error type.
2. The method for intelligently extracting large-scale data features based on data content according to claim 1, wherein the preliminary identification process comprises the following steps: and carrying out preliminary identification on the data to be identified according to the existing field type database, or introducing an identification model trained by a neural network to carry out preliminary identification, so as to obtain a preliminary identification result of the field type.
3. The method for intelligently extracting large-scale data features based on data content according to claim 1, wherein the process of rejecting invalid data comprises the following steps: defining an invalid table and an invalid field, and uniformly judging an empty table, a zombie table, a log table, a backup table, a temporary table, a single-field table and a low-heat table as invalid tables through metadata information and data content judgment of the tables; uniformly judging the null field and the single value field as invalid fields; and identifying and rejecting invalid tables and fields.
4. The method for intelligent extraction of large-scale data features based on data content according to claim 1, wherein the field type includes at least one of a numeric type, a text type, and a date type.
5. The method for intelligently extracting large-scale data features based on data content according to claim 4, wherein the process of extracting features according to field types comprises the following steps: extracting features and feature values of the numerical value type fields by means of average value, maximum value, minimum value, median, variance, quartile range, numerical value cluster and length cluster; for text type fields, statistical attribute features are distributed from length clusters and structures, and extraction on content features is performed through word segmentation and semantic recognition of data content; and carrying out structural analysis on the date type field, and carrying out feature extraction on the date format and the length.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110670587.7A CN113569005B (en) | 2021-06-17 | 2021-06-17 | Large-scale data characteristic intelligent extraction method based on data content |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110670587.7A CN113569005B (en) | 2021-06-17 | 2021-06-17 | Large-scale data characteristic intelligent extraction method based on data content |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113569005A CN113569005A (en) | 2021-10-29 |
CN113569005B true CN113569005B (en) | 2024-02-20 |
Family
ID=78162177
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110670587.7A Active CN113569005B (en) | 2021-06-17 | 2021-06-17 | Large-scale data characteristic intelligent extraction method based on data content |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113569005B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115221893B (en) * | 2022-09-21 | 2023-01-13 | 中国电子信息产业集团有限公司 | Quality inspection rule automatic configuration method and device based on rule and semantic analysis |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104731976A (en) * | 2015-04-14 | 2015-06-24 | 海量云图(北京)数据技术有限公司 | Method for finding and sorting private data in data table |
CN109271377A (en) * | 2018-08-10 | 2019-01-25 | 蜜小蜂智慧(北京)科技有限公司 | A kind of data quality checking method and device |
CN109948164A (en) * | 2019-04-02 | 2019-06-28 | 北京三快在线科技有限公司 | Processing method, device, computer equipment and the storage medium of statistical demand information |
KR20200070775A (en) * | 2018-12-10 | 2020-06-18 | 한국전자통신연구원 | Apparatus and method for normalizing security information of heterogeneous systems |
CN111506731A (en) * | 2020-04-17 | 2020-08-07 | 支付宝(杭州)信息技术有限公司 | Method, device and equipment for training field classification model |
CN111539021A (en) * | 2020-04-26 | 2020-08-14 | 支付宝(杭州)信息技术有限公司 | Data privacy type identification method, device and equipment |
CN112181936A (en) * | 2019-07-03 | 2021-01-05 | 北京京东尚科信息技术有限公司 | Database detection method and device |
WO2021012382A1 (en) * | 2019-07-25 | 2021-01-28 | 深圳壹账通智能科技有限公司 | Method and apparatus for configuring chat robot, computer device and storage medium |
CN112434032A (en) * | 2020-11-17 | 2021-03-02 | 北京融七牛信息技术有限公司 | Automatic feature generation system and method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8412671B2 (en) * | 2004-08-13 | 2013-04-02 | Hewlett-Packard Development Company, L.P. | System and method for developing a star schema |
-
2021
- 2021-06-17 CN CN202110670587.7A patent/CN113569005B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104731976A (en) * | 2015-04-14 | 2015-06-24 | 海量云图(北京)数据技术有限公司 | Method for finding and sorting private data in data table |
CN109271377A (en) * | 2018-08-10 | 2019-01-25 | 蜜小蜂智慧(北京)科技有限公司 | A kind of data quality checking method and device |
KR20200070775A (en) * | 2018-12-10 | 2020-06-18 | 한국전자통신연구원 | Apparatus and method for normalizing security information of heterogeneous systems |
CN109948164A (en) * | 2019-04-02 | 2019-06-28 | 北京三快在线科技有限公司 | Processing method, device, computer equipment and the storage medium of statistical demand information |
CN112181936A (en) * | 2019-07-03 | 2021-01-05 | 北京京东尚科信息技术有限公司 | Database detection method and device |
WO2021012382A1 (en) * | 2019-07-25 | 2021-01-28 | 深圳壹账通智能科技有限公司 | Method and apparatus for configuring chat robot, computer device and storage medium |
CN111506731A (en) * | 2020-04-17 | 2020-08-07 | 支付宝(杭州)信息技术有限公司 | Method, device and equipment for training field classification model |
CN111539021A (en) * | 2020-04-26 | 2020-08-14 | 支付宝(杭州)信息技术有限公司 | Data privacy type identification method, device and equipment |
CN112434032A (en) * | 2020-11-17 | 2021-03-02 | 北京融七牛信息技术有限公司 | Automatic feature generation system and method |
Also Published As
Publication number | Publication date |
---|---|
CN113569005A (en) | 2021-10-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113591485B (en) | Intelligent data quality auditing system and method based on data science | |
CN110826320B (en) | Sensitive data discovery method and system based on text recognition | |
CN112800113B (en) | Bidding auditing method and system based on data mining analysis technology | |
CN112651296A (en) | Method and system for automatically detecting data quality problem without prior knowledge | |
CN111191051B (en) | Method and system for constructing emergency knowledge map based on Chinese word segmentation technology | |
CN112163553B (en) | Material price accounting method, device, storage medium and computer equipment | |
CN112257425A (en) | Power data analysis method and system based on data classification model | |
CN113129057A (en) | Software cost information processing method and device, computer equipment and storage medium | |
CN113569005B (en) | Large-scale data characteristic intelligent extraction method based on data content | |
CN114398891B (en) | Method for generating KPI curve and marking wave band characteristics based on log keywords | |
CN116401343A (en) | Data compliance analysis method | |
CN118133221A (en) | Classification and classification method for private data | |
CN116842021B (en) | Data dictionary standardization method, equipment and medium based on AI generation technology | |
CN112182225A (en) | Knowledge management method for multi-modal scene target based on semi-supervised deep learning | |
CN115221013B (en) | Method, device and equipment for determining log mode | |
CN105573984B (en) | The recognition methods of socio-economic indicator and device | |
CN114611515B (en) | Method and system for identifying enterprise actual control person based on enterprise public opinion information | |
CN109918638B (en) | Network data monitoring method | |
CN115774840A (en) | Data identification method based on industrial internet | |
CN113657443B (en) | On-line Internet of things equipment identification method based on SOINN network | |
CN116226371A (en) | Digital economic patent classification method | |
CN114610882A (en) | Abnormal equipment code detection method and system based on electric power short text classification | |
CN109299456B (en) | Geographical name recognition method | |
CN111680986B (en) | Method and device for identifying serial case | |
CN118193664B (en) | Unified social credit code administrative division data complement method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |