CN113569005A - Large-scale data feature intelligent extraction method based on data content - Google Patents

Large-scale data feature intelligent extraction method based on data content Download PDF

Info

Publication number
CN113569005A
CN113569005A CN202110670587.7A CN202110670587A CN113569005A CN 113569005 A CN113569005 A CN 113569005A CN 202110670587 A CN202110670587 A CN 202110670587A CN 113569005 A CN113569005 A CN 113569005A
Authority
CN
China
Prior art keywords
data
type
field
field type
identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110670587.7A
Other languages
Chinese (zh)
Other versions
CN113569005B (en
Inventor
葛俊
梁云丹
黄建平
张旭东
张建松
陈浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
State Grid Zhejiang Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
State Grid Zhejiang Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, State Grid Zhejiang Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN202110670587.7A priority Critical patent/CN113569005B/en
Publication of CN113569005A publication Critical patent/CN113569005A/en
Application granted granted Critical
Publication of CN113569005B publication Critical patent/CN113569005B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Library & Information Science (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a large-scale data feature intelligent extraction method based on data content, which comprises the following steps: performing initial identification on field types of the data, and eliminating invalid data; judging the Chinese description and the field type of the data, sampling unmatched data, calculating the ratio of each field type in the sample, and revising the field type according to the ratio result; features are extracted according to field types. The substantial effects of the invention include: the universality of table processing and detection and the relation between meanings represented by different field type data are considered, corresponding features can be extracted for each field only according to the header information and the data content, the automation and scale of the feature extraction of the data are realized, accurate detection object identification and positioning are provided for the data quality problem investigation, and a foundation is provided for improving the subsequent quality detection work efficiency.

Description

Large-scale data feature intelligent extraction method based on data content
Technical Field
The invention relates to the technical field of data feature extraction, in particular to a large-scale data feature intelligent extraction method based on data content.
Background
With the development of digital economy, various industries do not pursue the scale of data volume one by one, the requirement on data quality is higher and higher in the data application process, and in the face of massive data resources, how to find the problem of positioning data quality more quickly, accurately and intelligently, and develop corresponding treatment work is the key and core of current enterprise-level data asset management.
In the prior art, the invention disclosed in publication No. CN105554152A discloses a method and apparatus for extracting data features. In more detailed technical content, the invention disclosed in publication No. CN108256074A also discloses a method for verification processing, including obtaining models of a data warehouse to be verified, each model including a plurality of field information, the field information including field definitions and field types; checking the field information according to a pre-stored data dictionary, wherein the data dictionary comprises a plurality of standard expressions, and each standard expression comprises a standard definition and a standard type; and if the field definition is matched with the standard definition and the field type is not matched with the standard type, modifying the field type to be consistent with the standard type. The method verifies the model of the data warehouse according to the standard expression, and when the field definition is matched with the standard definition and the field type is not matched with the standard type, the field type is pertinently modified to be consistent with the standard type, so that a standard consistent model is obtained.
In the prior art, the related problems are solved in different ways, and in the traditional data quality management mode, the selection of the problem detection object needs to specify specific and specific data tables and fields by service experts according to service specifications and experience knowledge, and needs to specify the characteristics of each field and the applicable rules, the method and the result have extremely high requirements on the experience and professional skill of the service expert, the range of the detection object of the data quality problem is relatively limited, and the method is highly dependent on the service expert, for large-scale mass data, service experts are required to respectively and one by one designate corresponding detection objects and ranges, and the universality of the data characteristics is weak, the maintenance is time-consuming and labor-consuming, the definition of a large-scale and automatic data quality detection object and the extraction of the corresponding data characteristics cannot be realized, and the efficiency of data quality audit is low and is seriously influenced by manual experience.
Disclosure of Invention
Aiming at the problems in the prior art, the invention aims to provide a large-scale data feature intelligent extraction method based on data content, which considers the universality of processing and detecting tables, does not need to provide any additional corresponding knowledge of the tables, can extract corresponding features for each field only according to the header information and the data content, realizes the automation and the large-scale of the feature extraction of the data, does not need to appoint data objects and feature conditions of data quality detection one by one, reduces the dependence on the knowledge and experience of service personnel, provides accurate detection object identification and positioning for the data quality problem investigation, and provides a foundation for improving the working efficiency of subsequent quality detection.
The technical scheme of the invention is as follows.
A large-scale data feature intelligent extraction method based on data content comprises the following steps:
performing initial identification on field types of the data, and eliminating invalid data;
judging the Chinese description and the field type of the data, sampling unmatched data, calculating the ratio of each field type in the sample, and revising the field type according to the ratio result;
features are extracted according to field types.
In the extraction process, the characteristics of different field types and the Chinese description of corresponding data are combined and considered, the field types are comprehensively judged by two steps of primary identification and revision, and the identification accuracy is improved.
Preferably, the process of preliminary identification includes: and carrying out primary identification on data to be identified according to the existing field type database, or introducing an identification model trained by a neural network to carry out primary identification to obtain a primary identification result of the field type. Different field types have respective characteristics, and the prior art in the field generally adopts a database, a training model and the like for comparison and identification, but the technology can only be used for primary identification, so that the implementation cost is reduced, and certain basic accuracy is guaranteed.
Preferably, the process of eliminating invalid data includes: defining an invalid table and invalid fields, and judging through metadata information and data content of the table to uniformly judge an empty table, a zombie table, a log table, a backup table, a temporary table, a single-character field table and a low-heat table into the invalid table; uniformly judging null fields and single-value fields as invalid fields; and identifying and eliminating invalid tables and fields. The invalid table and the invalid fields cover various common invalid data, and processing pressure of subsequent data extraction and analysis can be reduced after the invalid data are removed.
Preferably, the process of revising the field type includes: performing word segmentation and semantic recognition on the Chinese description of the data by using an NLP (natural language processing) module, performing path recognition on approximate words or approximate characters through a type decision tree after analysis, and marking the Chinese description as a suspected revised field type when the semantic of the Chinese description is not matched with the field type; and then sampling the data contents with the same or similar semantic meanings described in the Chinese text for multiple times, counting the occupation ratio conditions of different field types in the sampled data, taking the type with the occupation ratio exceeding a threshold value as a recommended revision field type, and finally revising the recommended revision field type as the field type to which the real storage data belongs. The natural language processing technology can perform word segmentation and semantic recognition on the Chinese description, the decision tree can perform path recognition with similar meanings to help judge whether the Chinese description belongs to a suspected revised field type, and finally, the result is determined by setting a threshold value in a mode of taking the proportion as a judgment standard, and the revision process is a supplement to the primary recognition, so that the recognition accuracy is further improved.
Preferably, the field type includes at least one of a numeric type, a text type, and a date type.
Preferably, the extracting features according to field types includes: extracting features and characteristic values of the numerical type field by using a mean value, a maximum value, a minimum value, a median, a variance, a quartile distance, numerical clustering and length clustering; for text type fields, counting attribute features from length clustering and structure distribution, and extracting content features through word segmentation and semantic recognition of data contents; and carrying out structure analysis on the date type field, and carrying out feature extraction on the date format and the date length.
Preferably, after the revision field type is finished, the method further comprises the following verification step: converting date data into text data, copying the text data into a verification group and an interference group, inserting the verification group into year, month and day description according to an original date format, adding a counting unit description according to the original date data digit number of the interference group, inserting the verification group and the interference group into self adjacent text data, performing semantic identification on the spliced text data through an NLP natural language processing module, and recording the identification speed of each pair of the interference group and the verification group, wherein if the identification speed of the verification group is higher than that of the interference group and exceeds an amplitude threshold value, the verification is passed, and otherwise, the corresponding original date data is listed as a suspected error type. Whether the data is date type data or numerical value type data, the data is often connected with the adjacent text type data, when the original identification is correct, the text spliced by the verification group is easier to identify, so the identification speed is high, and if the original identification is wrong, the text spliced by the verification group is wrong, so the text spliced by the verification group has no advantage of the identification speed compared with an interference group, and is even slower, so the text spliced by the verification group is classified as a suspected error type.
The substantial effects of the invention include: the universality of table processing and detection and the relation between meanings represented by different field type data are considered, corresponding features can be extracted for each field only according to the header information and the data content, the automation and scale of the feature extraction of the data are realized, accurate detection object identification and positioning are provided for the data quality problem investigation, and a foundation is provided for improving the subsequent quality detection work efficiency.
Detailed Description
The technical solution of the present application will be described with reference to the following examples. In addition, numerous specific details are set forth below in order to provide a better understanding of the present invention. It will be understood by those skilled in the art that the present invention may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present invention.
Example 1:
a large-scale data feature intelligent extraction method based on data content mainly aims at field types including numerical type, text type and date type.
The method comprises the following steps:
s01: and carrying out initial identification on the field type of the data, and rejecting invalid data.
Wherein the process of preliminary identification comprises: and carrying out primary identification on data to be identified according to the existing field type database, or introducing an identification model trained by a neural network to carry out primary identification to obtain a primary identification result of the field type. Different field types have respective characteristics, and the prior art in the field generally adopts a database, a training model and the like for comparison and identification, but in the embodiment, the technology can only be used for primary identification, the implementation cost is reduced, and certain basic accuracy is guaranteed.
The process of eliminating invalid data comprises the following steps: defining an invalid table and invalid fields, and judging through metadata information and data content of the table to uniformly judge an empty table, a zombie table, a log table, a backup table, a temporary table, a single-character field table and a low-heat table into the invalid table; uniformly judging null fields and single-value fields as invalid fields; and identifying and eliminating invalid tables and fields. The invalid table and the invalid fields cover various common invalid data, and processing pressure of subsequent data extraction and analysis can be reduced after the invalid data are removed.
S02: judging the Chinese description and the field type of the data, sampling the unmatched data, calculating the ratio of each field type in the sample, and revising the field type according to the ratio result.
The process of revising the field type includes: performing word segmentation and semantic recognition on the Chinese description of the data by using an NLP (natural language processing) module, performing path recognition on approximate words or approximate characters through a type decision tree after analysis, and marking the Chinese description as a suspected revised field type when the semantic of the Chinese description is not matched with the field type; and then sampling the data contents with the same or similar semantic meanings described in the Chinese text for multiple times, counting the occupation ratio conditions of different field types in the sampled data, taking the type with the occupation ratio exceeding a threshold value as a recommended revision field type, and finally revising the recommended revision field type as the field type to which the real storage data belongs. The natural language processing technology can perform word segmentation and semantic recognition on the Chinese description, the decision tree can perform path recognition with similar meanings to help judge whether the Chinese description belongs to a suspected revised field type, and finally, the result is determined by setting a threshold value in a mode of taking the proportion as a judgment standard, and the revision process is a supplement to the primary recognition, so that the recognition accuracy is further improved.
S03: features are extracted according to field types.
The process of extracting features according to field types comprises the following steps: extracting features and characteristic values of the numerical type field by using a mean value, a maximum value, a minimum value, a median, a variance, a quartile distance, numerical clustering and length clustering; for text type fields, counting attribute features from length clustering and structure distribution, and extracting content features through word segmentation and semantic recognition of data contents; and carrying out structure analysis on the date type field, and carrying out feature extraction on the date format and the date length.
In addition, more specifically, the data features and the feature extraction methods applicable to the field type can be searched from the data feature library, and all the applicable data feature extraction methods of the field type are traversed according to the dependency of the corresponding data features and the mutual exclusion relationship network, for example, after a certain data field is determined to be a numerical type, the feature extraction algorithm loads the methods for extracting the attribute features such as length, integer, positive number, negative number, decimal number and the like, and the methods for extracting the service features such as mobile phone number, zip code and the like.
Example 2:
this embodiment is generally consistent with the previous embodiment, and is different in that after the revision field type is finished and before the feature is extracted, the method further includes a verification step: converting date data into text data, copying the text data into a verification group and an interference group, inserting the verification group into a year, month and day description according to an original date format, adding a counting unit description according to the original date data digit number by the interference group, inserting the verification group and the interference group into the adjacent text data, performing semantic identification on the spliced text data through an NLP (non-line-segment) natural language processing module, and recording the identification speed of each pair of the interference group and the verification group, wherein if the identification speed of the verification group is higher than that of the interference group and exceeds an amplitude threshold value, the verification is passed, and otherwise, the corresponding original date data is listed as a suspected error type. Whether the data is date type data or numerical value type data, the data is often connected with the adjacent text type data, when the original identification is correct, the text spliced by the verification group is easier to identify, so the identification speed is high, and if the original identification is wrong, the text spliced by the verification group is wrong, so the text spliced by the verification group has no advantage of the identification speed compared with an interference group, and is even slower, so the text spliced by the verification group is classified as a suspected error type.
The substantial effects of the above embodiments include: the universality of table processing and detection and the relation between meanings represented by different field type data are considered, corresponding features can be extracted for each field only according to the header information and the data content, the automation and scale of the feature extraction of the data are realized, accurate detection object identification and positioning are provided for the data quality problem investigation, and a foundation is provided for improving the subsequent quality detection work efficiency.
Through the description of the above embodiments, those skilled in the art can understand that in practical applications, the above function distribution may be performed by different functional modules as needed, that is, the internal structure of a specific device is divided into different functional modules, so as to perform all or part of the above described functions.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partially contributed to by the prior art, or all or part of the technical solutions may be embodied in the form of a software product, where the software product is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (7)

1. A large-scale data feature intelligent extraction method based on data content is characterized by comprising the following steps:
performing initial identification on field types of the data, and eliminating invalid data;
judging the Chinese description and the field type of the data, sampling unmatched data, calculating the ratio of each field type in the sample, and revising the field type according to the ratio result;
features are extracted according to field types.
2. The method for intelligently extracting large-scale data features based on data contents according to claim 1, wherein the process of preliminary identification comprises: and carrying out primary identification on data to be identified according to the existing field type database, or introducing an identification model trained by a neural network to carry out primary identification to obtain a primary identification result of the field type.
3. The method for intelligently extracting large-scale data features based on data contents according to claim 1, wherein the process of eliminating invalid data comprises the following steps: defining an invalid table and invalid fields, and judging through metadata information and data content of the table to uniformly judge an empty table, a zombie table, a log table, a backup table, a temporary table, a single-character field table and a low-heat table into the invalid table; uniformly judging null fields and single-value fields as invalid fields; and identifying and eliminating invalid tables and fields.
4. The method for intelligently extracting large-scale data features based on data contents according to claim 1 or 2, wherein the process of revising the field type comprises the following steps: performing word segmentation and semantic recognition on the Chinese description of the data by using an NLP (natural language processing) module, performing path recognition on approximate words or approximate characters through a type decision tree after analysis, and marking the Chinese description as a suspected revised field type when the semantic of the Chinese description is not matched with the field type; and then sampling the data contents with the same or similar semantic meanings described in the Chinese text for multiple times, counting the occupation ratio conditions of different field types in the sampled data, taking the type with the occupation ratio exceeding a threshold value as a recommended revision field type, and finally revising the recommended revision field type as the field type to which the real storage data belongs.
5. The method according to claim 1, wherein the field type includes at least one of numeric type, text type and date type.
6. The method for intelligently extracting large-scale data features based on data contents according to claim 5, wherein the process of extracting features according to field types comprises the following steps: extracting features and characteristic values of the numerical type field by using a mean value, a maximum value, a minimum value, a median, a variance, a quartile distance, numerical clustering and length clustering; for text type fields, counting attribute features from length clustering and structure distribution, and extracting content features through word segmentation and semantic recognition of data contents; and carrying out structure analysis on the date type field, and carrying out feature extraction on the date format and the date length.
7. The method for intelligently extracting large-scale data features based on data contents according to claim 4, wherein after the revision field type is finished, the method further comprises a verification step of: converting date data into text data, copying the text data into a verification group and an interference group, inserting the verification group into year, month and day description according to an original date format, adding a counting unit description according to the original date data digit number of the interference group, inserting the verification group and the interference group into self adjacent text data, performing semantic identification on the spliced text data through an NLP natural language processing module, and recording the identification speed of each pair of the interference group and the verification group, wherein if the identification speed of the verification group is higher than that of the interference group and exceeds an amplitude threshold value, the verification is passed, and otherwise, the corresponding original date data is listed as a suspected error type.
CN202110670587.7A 2021-06-17 2021-06-17 Large-scale data characteristic intelligent extraction method based on data content Active CN113569005B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110670587.7A CN113569005B (en) 2021-06-17 2021-06-17 Large-scale data characteristic intelligent extraction method based on data content

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110670587.7A CN113569005B (en) 2021-06-17 2021-06-17 Large-scale data characteristic intelligent extraction method based on data content

Publications (2)

Publication Number Publication Date
CN113569005A true CN113569005A (en) 2021-10-29
CN113569005B CN113569005B (en) 2024-02-20

Family

ID=78162177

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110670587.7A Active CN113569005B (en) 2021-06-17 2021-06-17 Large-scale data characteristic intelligent extraction method based on data content

Country Status (1)

Country Link
CN (1) CN113569005B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115221893A (en) * 2022-09-21 2022-10-21 中国电子信息产业集团有限公司 Quality inspection rule automatic configuration method and device based on rule and semantic analysis

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060036637A1 (en) * 2004-08-13 2006-02-16 Mehmet Sayal System and method for developing a star schema
CN104731976A (en) * 2015-04-14 2015-06-24 海量云图(北京)数据技术有限公司 Method for finding and sorting private data in data table
CN109271377A (en) * 2018-08-10 2019-01-25 蜜小蜂智慧(北京)科技有限公司 A kind of data quality checking method and device
CN109948164A (en) * 2019-04-02 2019-06-28 北京三快在线科技有限公司 Processing method, device, computer equipment and the storage medium of statistical demand information
KR20200070775A (en) * 2018-12-10 2020-06-18 한국전자통신연구원 Apparatus and method for normalizing security information of heterogeneous systems
CN111506731A (en) * 2020-04-17 2020-08-07 支付宝(杭州)信息技术有限公司 Method, device and equipment for training field classification model
CN111539021A (en) * 2020-04-26 2020-08-14 支付宝(杭州)信息技术有限公司 Data privacy type identification method, device and equipment
CN112181936A (en) * 2019-07-03 2021-01-05 北京京东尚科信息技术有限公司 Database detection method and device
WO2021012382A1 (en) * 2019-07-25 2021-01-28 深圳壹账通智能科技有限公司 Method and apparatus for configuring chat robot, computer device and storage medium
CN112434032A (en) * 2020-11-17 2021-03-02 北京融七牛信息技术有限公司 Automatic feature generation system and method

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060036637A1 (en) * 2004-08-13 2006-02-16 Mehmet Sayal System and method for developing a star schema
CN104731976A (en) * 2015-04-14 2015-06-24 海量云图(北京)数据技术有限公司 Method for finding and sorting private data in data table
CN109271377A (en) * 2018-08-10 2019-01-25 蜜小蜂智慧(北京)科技有限公司 A kind of data quality checking method and device
KR20200070775A (en) * 2018-12-10 2020-06-18 한국전자통신연구원 Apparatus and method for normalizing security information of heterogeneous systems
CN109948164A (en) * 2019-04-02 2019-06-28 北京三快在线科技有限公司 Processing method, device, computer equipment and the storage medium of statistical demand information
CN112181936A (en) * 2019-07-03 2021-01-05 北京京东尚科信息技术有限公司 Database detection method and device
WO2021012382A1 (en) * 2019-07-25 2021-01-28 深圳壹账通智能科技有限公司 Method and apparatus for configuring chat robot, computer device and storage medium
CN111506731A (en) * 2020-04-17 2020-08-07 支付宝(杭州)信息技术有限公司 Method, device and equipment for training field classification model
CN111539021A (en) * 2020-04-26 2020-08-14 支付宝(杭州)信息技术有限公司 Data privacy type identification method, device and equipment
CN112434032A (en) * 2020-11-17 2021-03-02 北京融七牛信息技术有限公司 Automatic feature generation system and method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115221893A (en) * 2022-09-21 2022-10-21 中国电子信息产业集团有限公司 Quality inspection rule automatic configuration method and device based on rule and semantic analysis

Also Published As

Publication number Publication date
CN113569005B (en) 2024-02-20

Similar Documents

Publication Publication Date Title
CN110826320B (en) Sensitive data discovery method and system based on text recognition
CN110852856B (en) Invoice false invoice identification method based on dynamic network representation
WO2012034733A2 (en) Method and arrangement for handling data sets, data processing program and computer program product
CN112163553B (en) Material price accounting method, device, storage medium and computer equipment
CN112199512B (en) Scientific and technological service-oriented case map construction method, device, equipment and storage medium
CN111191051B (en) Method and system for constructing emergency knowledge map based on Chinese word segmentation technology
CN112307820A (en) Text recognition method, device, equipment and computer readable medium
CN113569005B (en) Large-scale data characteristic intelligent extraction method based on data content
CN110795607A (en) Equipment guarantee data matching method and system based on multi-stage similarity calculation
CN113409001A (en) Method for controlling automatic pricing of construction engineering quantity list
CN116136955B (en) Text transcription method, text transcription device, electronic equipment and storage medium
CN111815162A (en) Digital auditing tool and method
CN113591485A (en) Intelligent data quality auditing system and method based on data science
CN111723182A (en) Key information extraction method and device for vulnerability text
CN112257425A (en) Power data analysis method and system based on data classification model
CN111754352A (en) Method, device, equipment and storage medium for judging correctness of viewpoint statement
CN114611515B (en) Method and system for identifying enterprise actual control person based on enterprise public opinion information
CN112395854B (en) Standard element consistency inspection method
CN113129057A (en) Software cost information processing method and device, computer equipment and storage medium
CN111859896B (en) Formula document detection method and device, computer readable medium and electronic equipment
CN117827991B (en) Method and system for identifying personal identification information in semi-structured data
CN115221013B (en) Method, device and equipment for determining log mode
CN117272123B (en) Sensitive data processing method and device based on large model and storage medium
CN113158645A (en) Message analysis method and device, electronic equipment and computer storage medium
CN115455054A (en) Method and device for auditing pre-tax deduction materials and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant