CN113569005B - Large-scale data characteristic intelligent extraction method based on data content - Google Patents

Large-scale data characteristic intelligent extraction method based on data content Download PDF

Info

Publication number
CN113569005B
CN113569005B CN202110670587.7A CN202110670587A CN113569005B CN 113569005 B CN113569005 B CN 113569005B CN 202110670587 A CN202110670587 A CN 202110670587A CN 113569005 B CN113569005 B CN 113569005B
Authority
CN
China
Prior art keywords
data
field
type
groups
invalid
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110670587.7A
Other languages
Chinese (zh)
Other versions
CN113569005A (en
Inventor
葛俊
梁云丹
黄建平
张旭东
张建松
陈浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
State Grid Zhejiang Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
State Grid Zhejiang Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, State Grid Zhejiang Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN202110670587.7A priority Critical patent/CN113569005B/en
Publication of CN113569005A publication Critical patent/CN113569005A/en
Application granted granted Critical
Publication of CN113569005B publication Critical patent/CN113569005B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Library & Information Science (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a large-scale data characteristic intelligent extraction method based on data content, which comprises the following steps: performing preliminary identification of field types on the data, and eliminating invalid data; judging Chinese description and field types of the data, sampling the unmatched data, calculating the duty ratio of each field type in the sample, and revising the field types according to the duty ratio result; and extracting the characteristics according to the field types. The essential effects of the invention include: the method has the advantages that the relation between the universality of table processing and detection and the meaning represented by data of different field types is considered, the corresponding characteristics can be extracted for each field only according to the table header information and the data content, the automation and the scale of the characteristic extraction of the data are realized, the accurate detection object identification and positioning are provided for the data quality problem investigation, and the foundation is provided for the improvement of the subsequent quality detection work efficiency.

Description

Large-scale data characteristic intelligent extraction method based on data content
Technical Field
The invention relates to the technical field of data feature extraction, in particular to a large-scale data feature intelligent extraction method based on data content.
Background
Along with the development of digital economy, each industry does not pursue the scale of data volume, the requirement on data quality is higher and higher in the process of data application, and how to find and locate the data quality problem faster, more accurately and more intelligently, and develop corresponding treatment work is the key and core of the current enterprise-level data asset management.
In the prior art, the invention of publication No. CN105554152A discloses a method and a device for extracting data characteristics. In more detailed technical content, another invention, such as publication number CN108256074a, discloses a method of verification processing, comprising obtaining models of a data warehouse to be verified, each model comprising a plurality of field information, the field information comprising field definitions and field types; verifying the field information according to a pre-stored data dictionary, wherein the data dictionary comprises a plurality of standard expressions, and each standard expression comprises standard definition and standard type; if the field definition matches the standard definition and the field type does not match the standard type, the field type is modified to be consistent with the standard type. The method comprises the steps of verifying a model of a data warehouse according to standard terms, and purposefully modifying a field type to be consistent with a standard type when the field definition is matched with the standard definition and the field type is not matched with the standard type, so that a standard consistent model is obtained.
In the prior art, the mode of solving the related problems is thousands of times, in the traditional data quality management mode, the selection of the problem detection objects is that a business expert needs to specify specific and specific data tables and fields according to business specifications and experience knowledge, what kind of characteristics each field has and what kind of rules are applicable are required to be specified, the mode and the result have extremely high requirements on the experience and the expertise of the business expert, the range of the detection objects of the data quality problems is limited, the business expert is highly depended on, the business expert is required to respectively specify the corresponding detection objects and the range one by one for large-scale massive data, the universality of the data characteristics is weak, the maintenance is time-consuming and labor-consuming, the definition of the large-scale and automatic data quality detection objects and the extraction of the corresponding data characteristics cannot be realized, the efficiency of the data quality inspection is low, and the influence of the manual experience is serious.
Disclosure of Invention
Aiming at the problems in the prior art, the invention aims to provide a large-scale data characteristic intelligent extraction method based on data content, considers the universality of table processing and detection, can extract corresponding characteristics for each field only according to table header information and data content without providing any additional corresponding knowledge, realizes the automation and scale of data characteristic extraction, does not need to designate data objects and characteristic conditions of data quality detection one by one, reduces the dependence on knowledge and experience of service personnel, provides accurate detection object identification and positioning for data quality problem investigation, and provides a basis for improving the subsequent quality detection work efficiency.
The following is a technical scheme of the invention.
A large-scale data characteristic intelligent extraction method based on data content comprises the following steps:
performing preliminary identification of field types on the data, and eliminating invalid data;
judging Chinese description and field types of the data, sampling the unmatched data, calculating the duty ratio of each field type in the sample, and revising the field types according to the duty ratio result;
and extracting the characteristics according to the field types.
In the extraction process of the invention, the characteristics of different field types and Chinese descriptions of corresponding data are combined and considered, and the method is divided into two steps of primary identification and revision to comprehensively judge the field types, thereby improving the identification accuracy.
Preferably, the preliminary identification process includes: and carrying out preliminary identification on the data to be identified according to the existing field type database, or introducing an identification model trained by a neural network to carry out preliminary identification, so as to obtain a preliminary identification result of the field type. The different field types have respective characteristics, and the prior art in the field generally adopts a database, a training model and the like for comparison and identification, but the technology can only be used for primary identification, so that the implementation cost is reduced and a certain basic accuracy is ensured.
Preferably, the process of rejecting invalid data includes: defining an invalid table and an invalid field, and uniformly judging an empty table, a zombie table, a log table, a backup table, a temporary table, a single-field table and a low-heat table as invalid tables through metadata information and data content judgment of the tables; uniformly judging the null field and the single value field as invalid fields; and identifying and rejecting invalid tables and fields. The invalid table and the invalid field cover various common invalid data, and after being removed, the processing pressure of the subsequent data extraction and analysis can be reduced.
Preferably, the process of revising the field type includes: the Chinese description of the data is subjected to word segmentation and semantic recognition by utilizing an NLP natural language processing module, the path recognition of the approximate word or the approximate word is carried out through a type decision tree after analysis, the semantics of the Chinese description are not matched with the field type, and the Chinese description is marked as a suspected revised field type; and then sampling the data content with the same or similar semantic meaning for multiple times, counting the duty ratio conditions of different field types in the sampled data, taking the type with the duty ratio exceeding a threshold value as a recommended revised field type, and finally revising the recommended revised field type into the field type to which the real storage data belongs. The natural language processing technology can perform word segmentation and semantic recognition on the middle description, and the decision tree can perform path recognition with similar meaning so as to help judge whether the middle description belongs to a suspected revised field type, and finally, the revision process is complementary to the primary recognition by taking the duty ratio as a judgment standard determination result in a mode of setting a threshold value, so that the recognition accuracy is further improved.
Preferably, the field type includes at least one of a numeric type, a text type, and a date type.
Preferably, the extracting the feature according to the field type includes: extracting features and feature values of the numerical value type fields by means of average value, maximum value, minimum value, median, variance, quartile range, numerical value cluster and length cluster; for text type fields, statistical attribute features are distributed from length clusters and structures, and extraction on content features is performed through word segmentation and semantic recognition of data content; and carrying out structural analysis on the date type field, and carrying out feature extraction on the date format and the length.
Preferably, after the revision field type is finished, the method further comprises the step of verifying: the date data are converted into text data and copied into verification groups and interference groups, the verification groups are inserted into the year, month and day descriptions according to the original date format, the interference groups are added with counting unit descriptions according to the original date data digits, the verification groups and the interference groups are inserted into the text data adjacent to the verification groups, semantic recognition is carried out on the spliced text data through an NLP natural language processing module, the recognition speed of each pair of the interference groups and the verification groups is recorded, if the recognition speed of the verification groups is faster than that of the interference groups and exceeds an amplitude threshold value, the corresponding original date data are classified as a suspected error type through verification, and otherwise, the corresponding original date data are classified as a suspected error type. Since the date type data or the numerical value type data are often associated with the adjacent text type data, when the original recognition is correct, the text spliced by the verification group is easier to recognize, so that the recognition speed is higher, and if the original recognition is wrong, the text spliced by the verification group is wrong, so that the text is even slower compared with the advantage that the interference group does not have the recognition speed, and is classified as a suspected error type.
The essential effects of the invention include: the method has the advantages that the relation between the universality of table processing and detection and the meaning represented by data of different field types is considered, the corresponding characteristics can be extracted for each field only according to the table header information and the data content, the automation and the scale of the characteristic extraction of the data are realized, the accurate detection object identification and positioning are provided for the data quality problem investigation, and the foundation is provided for the improvement of the subsequent quality detection work efficiency.
Detailed Description
The technical scheme of the present application will be described below with reference to examples. In addition, numerous specific details are set forth in the following description in order to provide a better understanding of the present invention. It will be understood by those skilled in the art that the present invention may be practiced without some of these specific details. In some instances, well known methods, procedures, components, and circuits have not been described in detail so as not to obscure the present invention.
Example 1:
a large-scale data characteristic intelligent extraction method based on data content mainly aims at field types including numerical value type, text type and date type.
The method comprises the following steps:
s01: and carrying out preliminary identification of field types on the data, and eliminating invalid data.
Wherein the preliminary identification process comprises: and carrying out preliminary identification on the data to be identified according to the existing field type database, or introducing an identification model trained by a neural network to carry out preliminary identification, so as to obtain a preliminary identification result of the field type. The different field types have respective characteristics, and in the prior art in the field, a database, a training model and the like are generally adopted for comparison and identification, but in the embodiment, the technology can only be used for preliminary identification, so that the implementation cost is reduced, and a certain basic accuracy is ensured.
The process of rejecting invalid data includes: defining an invalid table and an invalid field, and uniformly judging an empty table, a zombie table, a log table, a backup table, a temporary table, a single-field table and a low-heat table as invalid tables through metadata information and data content judgment of the tables; uniformly judging the null field and the single value field as invalid fields; and identifying and rejecting invalid tables and fields. The invalid table and the invalid field cover various common invalid data, and after being removed, the processing pressure of the subsequent data extraction and analysis can be reduced.
S02: judging the Chinese description and field type of the data, sampling the unmatched data, calculating the duty ratio of each field type in the sample, and revising the field type according to the duty ratio result.
The process of revising the field type includes: the Chinese description of the data is subjected to word segmentation and semantic recognition by utilizing an NLP natural language processing module, the path recognition of the approximate word or the approximate word is carried out through a type decision tree after analysis, the semantics of the Chinese description are not matched with the field type, and the Chinese description is marked as a suspected revised field type; and then sampling the data content with the same or similar semantic meaning for multiple times, counting the duty ratio conditions of different field types in the sampled data, taking the type with the duty ratio exceeding a threshold value as a recommended revised field type, and finally revising the recommended revised field type into the field type to which the real storage data belongs. The natural language processing technology can perform word segmentation and semantic recognition on the middle description, and the decision tree can perform path recognition with similar meaning so as to help judge whether the middle description belongs to a suspected revised field type, and finally, the revision process is complementary to the primary recognition by taking the duty ratio as a judgment standard determination result in a mode of setting a threshold value, so that the recognition accuracy is further improved.
S03: and extracting the characteristics according to the field types.
The process of extracting features from field types includes: extracting features and feature values of the numerical value type fields by means of average value, maximum value, minimum value, median, variance, quartile range, numerical value cluster and length cluster; for text type fields, statistical attribute features are distributed from length clusters and structures, and extraction on content features is performed through word segmentation and semantic recognition of data content; and carrying out structural analysis on the date type field, and carrying out feature extraction on the date format and the length.
In addition, more specifically, the data characteristics and the characteristic extraction method applicable to the field type can be searched from the data characteristic library, all applicable data characteristic extraction methods of the field type are traversed according to the dependence and the mutual exclusion relation network of the corresponding data characteristics, for example, after a certain data field is determined to be numerical value, the characteristic extraction algorithm loads the characteristic extraction methods of length, integer, positive number, negative number, decimal and the like, and the service characteristic extraction method of mobile phone number, postal code and the like, and the characteristics of concentrated length, integer, mobile phone number and the like can be obtained through continuous identification and extraction of the data content, and meanwhile, the two mutually exclusive characteristics of positive and negative are distinguished, so that the multi-angle characteristics and characteristic values of the field are obtained.
Example 2:
this embodiment is generally identical to the previous embodiment, except that after the end of revising the field type, before extracting the features, a verification step is further included: the date data are converted into text data and copied into verification groups and interference groups, the verification groups are inserted into the year, month and day descriptions according to the original date format, the interference groups are added with counting unit descriptions according to the number of bits of the original date data, the verification groups and the interference groups are inserted into the text data adjacent to the verification groups, semantic recognition is carried out on the spliced text data through an NLP natural language processing module, the recognition speed of each pair of the interference groups and the verification groups is recorded, if the recognition speed of the verification groups is faster than that of the interference groups and exceeds an amplitude threshold, the verification is passed, otherwise, the corresponding original date data are listed as suspected error types. Since the date type data or the numerical value type data are often associated with the adjacent text type data, when the original recognition is correct, the text spliced by the verification group is easier to recognize, so that the recognition speed is higher, and if the original recognition is wrong, the text spliced by the verification group is wrong, so that the text is even slower compared with the advantage that the interference group does not have the recognition speed, and is classified as a suspected error type.
The substantial effects of the above embodiments include: the method has the advantages that the relation between the universality of table processing and detection and the meaning represented by data of different field types is considered, the corresponding characteristics can be extracted for each field only according to the table header information and the data content, the automation and the scale of the characteristic extraction of the data are realized, the accurate detection object identification and positioning are provided for the data quality problem investigation, and the foundation is provided for the improvement of the subsequent quality detection work efficiency.
From the description of the above embodiments, those skilled in the art will appreciate that, in practical applications, the above-mentioned functions may be distributed by different functional modules according to needs, that is, the internal structure of a specific apparatus is divided into different functional modules, so as to complete all or part of the functions described above.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be essentially or a part contributing to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions to cause a device (may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (5)

1. The large-scale data characteristic intelligent extraction method based on the data content is characterized by comprising the following steps of:
performing preliminary identification of field types on the data, and eliminating invalid data;
judging Chinese description and field types of the data, sampling the unmatched data, calculating the duty ratio of each field type in the sample, and revising the field types according to the duty ratio result;
extracting features according to the field types;
the process of revising the field type includes: the Chinese description of the data is subjected to word segmentation and semantic recognition by utilizing an NLP natural language processing module, the path recognition of the approximate word or the approximate word is carried out through a type decision tree after analysis, the semantics of the Chinese description are not matched with the field type, and the Chinese description is marked as a suspected revised field type; then, sampling the data content with the same or similar semantic meaning in the Chinese description for multiple times, counting the proportion situation of different field types in the sampled data, taking the type with the proportion exceeding a threshold value as a recommended revised field type, and finally revising the recommended revised field type into the field type of the real stored data;
after the revision field type is finished, the method further comprises the step of verifying: the date data are converted into text data and copied into verification groups and interference groups, the verification groups are inserted into the year, month and day descriptions according to the original date format, the interference groups are added with counting unit descriptions according to the original date data digits, the verification groups and the interference groups are inserted into the text data adjacent to the verification groups, semantic recognition is carried out on the spliced text data through an NLP natural language processing module, the recognition speed of each pair of the interference groups and the verification groups is recorded, if the recognition speed of the verification groups is faster than that of the interference groups and exceeds an amplitude threshold value, the corresponding original date data are classified as a suspected error type through verification, and otherwise, the corresponding original date data are classified as a suspected error type.
2. The method for intelligently extracting large-scale data features based on data content according to claim 1, wherein the preliminary identification process comprises the following steps: and carrying out preliminary identification on the data to be identified according to the existing field type database, or introducing an identification model trained by a neural network to carry out preliminary identification, so as to obtain a preliminary identification result of the field type.
3. The method for intelligently extracting large-scale data features based on data content according to claim 1, wherein the process of rejecting invalid data comprises the following steps: defining an invalid table and an invalid field, and uniformly judging an empty table, a zombie table, a log table, a backup table, a temporary table, a single-field table and a low-heat table as invalid tables through metadata information and data content judgment of the tables; uniformly judging the null field and the single value field as invalid fields; and identifying and rejecting invalid tables and fields.
4. The method for intelligent extraction of large-scale data features based on data content according to claim 1, wherein the field type includes at least one of a numeric type, a text type, and a date type.
5. The method for intelligently extracting large-scale data features based on data content according to claim 4, wherein the process of extracting features according to field types comprises the following steps: extracting features and feature values of the numerical value type fields by means of average value, maximum value, minimum value, median, variance, quartile range, numerical value cluster and length cluster; for text type fields, statistical attribute features are distributed from length clusters and structures, and extraction on content features is performed through word segmentation and semantic recognition of data content; and carrying out structural analysis on the date type field, and carrying out feature extraction on the date format and the length.
CN202110670587.7A 2021-06-17 2021-06-17 Large-scale data characteristic intelligent extraction method based on data content Active CN113569005B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110670587.7A CN113569005B (en) 2021-06-17 2021-06-17 Large-scale data characteristic intelligent extraction method based on data content

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110670587.7A CN113569005B (en) 2021-06-17 2021-06-17 Large-scale data characteristic intelligent extraction method based on data content

Publications (2)

Publication Number Publication Date
CN113569005A CN113569005A (en) 2021-10-29
CN113569005B true CN113569005B (en) 2024-02-20

Family

ID=78162177

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110670587.7A Active CN113569005B (en) 2021-06-17 2021-06-17 Large-scale data characteristic intelligent extraction method based on data content

Country Status (1)

Country Link
CN (1) CN113569005B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115221893B (en) * 2022-09-21 2023-01-13 中国电子信息产业集团有限公司 Quality inspection rule automatic configuration method and device based on rule and semantic analysis

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104731976A (en) * 2015-04-14 2015-06-24 海量云图(北京)数据技术有限公司 Method for finding and sorting private data in data table
CN109271377A (en) * 2018-08-10 2019-01-25 蜜小蜂智慧(北京)科技有限公司 A kind of data quality checking method and device
CN109948164A (en) * 2019-04-02 2019-06-28 北京三快在线科技有限公司 Processing method, device, computer equipment and the storage medium of statistical demand information
KR20200070775A (en) * 2018-12-10 2020-06-18 한국전자통신연구원 Apparatus and method for normalizing security information of heterogeneous systems
CN111506731A (en) * 2020-04-17 2020-08-07 支付宝(杭州)信息技术有限公司 Method, device and equipment for training field classification model
CN111539021A (en) * 2020-04-26 2020-08-14 支付宝(杭州)信息技术有限公司 Data privacy type identification method, device and equipment
CN112181936A (en) * 2019-07-03 2021-01-05 北京京东尚科信息技术有限公司 Database detection method and device
WO2021012382A1 (en) * 2019-07-25 2021-01-28 深圳壹账通智能科技有限公司 Method and apparatus for configuring chat robot, computer device and storage medium
CN112434032A (en) * 2020-11-17 2021-03-02 北京融七牛信息技术有限公司 Automatic feature generation system and method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8412671B2 (en) * 2004-08-13 2013-04-02 Hewlett-Packard Development Company, L.P. System and method for developing a star schema

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104731976A (en) * 2015-04-14 2015-06-24 海量云图(北京)数据技术有限公司 Method for finding and sorting private data in data table
CN109271377A (en) * 2018-08-10 2019-01-25 蜜小蜂智慧(北京)科技有限公司 A kind of data quality checking method and device
KR20200070775A (en) * 2018-12-10 2020-06-18 한국전자통신연구원 Apparatus and method for normalizing security information of heterogeneous systems
CN109948164A (en) * 2019-04-02 2019-06-28 北京三快在线科技有限公司 Processing method, device, computer equipment and the storage medium of statistical demand information
CN112181936A (en) * 2019-07-03 2021-01-05 北京京东尚科信息技术有限公司 Database detection method and device
WO2021012382A1 (en) * 2019-07-25 2021-01-28 深圳壹账通智能科技有限公司 Method and apparatus for configuring chat robot, computer device and storage medium
CN111506731A (en) * 2020-04-17 2020-08-07 支付宝(杭州)信息技术有限公司 Method, device and equipment for training field classification model
CN111539021A (en) * 2020-04-26 2020-08-14 支付宝(杭州)信息技术有限公司 Data privacy type identification method, device and equipment
CN112434032A (en) * 2020-11-17 2021-03-02 北京融七牛信息技术有限公司 Automatic feature generation system and method

Also Published As

Publication number Publication date
CN113569005A (en) 2021-10-29

Similar Documents

Publication Publication Date Title
CN110826320B (en) Sensitive data discovery method and system based on text recognition
CN112800113B (en) Bidding auditing method and system based on data mining analysis technology
CN111191051B (en) Method and system for constructing emergency knowledge map based on Chinese word segmentation technology
CN112163553B (en) Material price accounting method, device, storage medium and computer equipment
CN113569005B (en) Large-scale data characteristic intelligent extraction method based on data content
CN111400446A (en) Standard text duplicate checking method and system
CN114398891B (en) Method for generating KPI curve and marking wave band characteristics based on log keywords
CN112182225A (en) Knowledge management method for multi-modal scene target based on semi-supervised deep learning
CN114611515B (en) Method and system for identifying enterprise actual control person based on enterprise public opinion information
CN112257425A (en) Power data analysis method and system based on data classification model
CN109918638B (en) Network data monitoring method
CN113591485A (en) Intelligent data quality auditing system and method based on data science
CN116401343A (en) Data compliance analysis method
CN113657443B (en) On-line Internet of things equipment identification method based on SOINN network
CN115577269A (en) Blacklist fuzzy matching method based on character string text feature similarity
CN114610882A (en) Abnormal equipment code detection method and system based on electric power short text classification
CN113129057A (en) Software cost information processing method and device, computer equipment and storage medium
CN111859896B (en) Formula document detection method and device, computer readable medium and electronic equipment
CN109299456B (en) Geographical name recognition method
Zou et al. An improved model for spam user identification
CN115221013B (en) Method, device and equipment for determining log mode
CN111680986B (en) Method and device for identifying serial case
CN116842021B (en) Data dictionary standardization method, equipment and medium based on AI generation technology
CN114722819B (en) Entity type classification and identification method, device, equipment and medium
CN117272123B (en) Sensitive data processing method and device based on large model and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant