CN112817952A - Data quality evaluation method and system - Google Patents

Data quality evaluation method and system Download PDF

Info

Publication number
CN112817952A
CN112817952A CN202110074905.3A CN202110074905A CN112817952A CN 112817952 A CN112817952 A CN 112817952A CN 202110074905 A CN202110074905 A CN 202110074905A CN 112817952 A CN112817952 A CN 112817952A
Authority
CN
China
Prior art keywords
data
evaluation
recommended
quality
incidence relation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110074905.3A
Other languages
Chinese (zh)
Inventor
黄山姗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Mininglamp Software System Co ltd
Original Assignee
Beijing Mininglamp Software System Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Mininglamp Software System Co ltd filed Critical Beijing Mininglamp Software System Co ltd
Priority to CN202110074905.3A priority Critical patent/CN112817952A/en
Publication of CN112817952A publication Critical patent/CN112817952A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a data quality evaluation method and a system, wherein the data quality evaluation method comprises the following steps: and a data type evaluation step: performing quality evaluation on initial data of the starting service to output a first evaluation result: and a data association relation evaluation step: performing quality evaluation according to the incidence relation between the data of the initial start service and outputting a second evaluation result; a detection step: detecting the quality of the data according to the first evaluation result and the second evaluation result; the data includes recommended subject data, material data, and behavior data. The method can accurately and comprehensively know the initial data magnitude, the number of the initial data and the field distribution condition, and judge whether the model effect is caused by too little quantity, too long time for consuming resources in training and inconsistent actual situation of the field distribution due to too much quantity and too much quantity, wherein the training requirement may not be met.

Description

Data quality evaluation method and system
Technical Field
The invention belongs to the field of data quality evaluation methods, and particularly relates to a data quality evaluation method, a data quality evaluation system, computer equipment and a storage medium.
Background
With the continuous and rapid expansion of information, the interest of users is changeable, new contents are rapidly iterated, more and more requirements can rapidly acquire information of direct requirements or potential requirements of users, and the application of recommendation systems under the requirements is more and more extensive.
When recommendation service is carried out on a new position or scene requirement, the most suitable recommendation result is obtained by combining data corresponding to the requirement with an algorithm, under the initial condition, data for starting the service is required to be provided firstly, the initial recommendation result is obtained, and then material information, behavior data and the like are continuously supplemented for updating and iterating. In order to ensure the acquisition and accuracy of the initial recommendation result, certain quality assurance needs to be provided for the data of the provided start service, otherwise, the recommendation result is inaccurate or no recommendable content exists.
Prior art one relating to the present invention;
the technical scheme of the prior art I is as follows:
generally, the providing of the starting data is mainly all provided off-line or partially provided off-line, and comprises a recommended subject, recommended materials and behavior data. Usually, the statistics of the total amount and the number of the deduplicated data are performed on the three pieces of data, for example, the number of pieces of recommended subject data, the number of deduplicated subject master IDs, the number of recommended materials, the number of pieces of deduplicated subject IDs, the number of pieces of behavior data, and the number of pieces of deduplicated data according to the unique behavior ID. When the number of the prompts is small, the recommendation result may be inaccurate or no recommendable result is obtained after screening.
The first prior art has the following defects:
the mode is mainly based on prompting, and the quantity of the statistic starting data of the fracture is not related to all data to be checked or the quality evaluation is carried out by combining the model.
The technical scheme of the prior art II is as follows:
and (4) directly carrying out model training without counting the data of the starting data, and providing a model training result to reflect the quality condition of the training data.
The second prior art has the following defects:
1. the model training needs time, data restoration is performed under the condition that the training result is not ideal, the data uploading and data training processes need to be performed again, and a large amount of time is wasted.
2. The training result of the model cannot reflect which aspect of specific data may have quality problem, and the training data which is provided again may still have the same problem.
Disclosure of Invention
The embodiment of the application provides a data quality evaluation method, a data quality evaluation system, computer equipment and a storage medium, and aims to at least solve the problem of subjective factor influence in the related technology.
The invention provides a data quality evaluation method, which comprises the following steps:
and a data type evaluation step: performing quality evaluation on initial data of the starting service to output a first evaluation result:
and a data association relation evaluation step: performing quality evaluation according to the incidence relation between the data of the initial start service and outputting a second evaluation result;
a detection step: and detecting the quality of the data according to the first evaluation result and the second evaluation result.
The data quality evaluation method comprises the steps of recommending main body data, material data and behavior data.
The data quality evaluation method, wherein the recommended evaluation criteria of the subject data and the material data include:
and evaluating the total number, the number of recommended bodies after deduplication according to the master ID, the number of null values of each field, the distribution of values of the enumerable field, the distribution condition of the values of the non-enumerable field, the size of the data volume, the number of data types which are consistent with the data range/format in the field with the size of the enumerable value and the data range.
The data quality evaluation method described above, wherein the evaluation criterion of the behavior data includes:
the total number, the number of deduplicated data according to the ID identifying the unique behavior, the classification according to behavior class, the size of data volume, the number of data classes that conform to the data range/format in the field where there is an enumeration value and data range, are evaluated.
The data quality evaluation method, wherein the data association relation evaluation step includes:
an integration step: integrating the recommended main body data, the material data and the data reflecting the incidence relation in the behavior data to obtain incidence relation data;
and (3) evaluation of association relation data: and performing quality evaluation on the incidence relation data to output a second evaluation result.
The data quality evaluation method described above, wherein the evaluation criterion of the association data includes:
the number of pieces of the recommended main body after ID duplication removal according to the main body, the total number of pieces of the integrated data, the data of the recommended main body category according to the number of pieces of the materials in the integrated data after ID duplication removal according to the main body, the material category, the label and the characteristic value.
The invention also provides a data quality evaluation system, which comprises:
the data type evaluation unit is used for carrying out quality evaluation on the initial data of the starting service and outputting a first evaluation result:
the data incidence relation evaluation unit is used for carrying out quality evaluation according to the incidence relation between the data of the initial start service and outputting a second evaluation result;
and the detection unit is used for detecting the quality of the data according to the first evaluation result and the second evaluation result.
The data quality evaluation system, wherein the data comprises recommended subject data, material data and behavior data.
The data quality evaluation system described above, wherein the data association relation evaluation unit includes:
the integration module integrates the recommended main body data, the material data and the data reflecting the incidence relation in the behavior data to obtain incidence relation data;
and the incidence relation data evaluation module is used for carrying out quality evaluation on the incidence relation data and outputting a second evaluation result.
The data quality evaluation system, wherein the recommended evaluation criteria of the subject data and the material data include:
the total number, the number of recommended bodies after deduplication according to the main ID, the number of null values of each field, the distribution of values of the enumerable field, the distribution situation of the values of the enumerable field, the size of the data volume, the data type with the number conforming to the data range/format in the field with the size of the enumerable value and the data range are evaluated;
the evaluation criteria of the behavioral data include:
the total number, the number of deduplicated data according to the ID for identifying the unique behavior, classification according to behavior categories, data size, the number of data categories in fields with enumeration values and data ranges conforming to data ranges/formats are evaluated;
the evaluation criteria of the incidence relation data include:
the number of pieces of the recommended main body after ID duplication removal according to the main body, the total number of pieces of the integrated data, the data of the recommended main body category according to the number of pieces of the materials in the integrated data after ID duplication removal according to the main body, the material category, the label and the characteristic value.
The invention has the beneficial effects that:
(1) the initial data magnitude, the number of pieces and the field distribution condition can be accurately and comprehensively known, and whether the training requirement is possibly not met due to too small quantity, too long time for consuming resources due to too large quantity and poor model effect due to the fact that the field distribution has deviation inconsistent with the actual condition are judged;
(2) and through data summarization, the quantity condition and the material condition of the available recommended main bodies are accurately calculated. For example, when the provided recommended subject and material do not exist in the behavior data, even if the data of the provided recommended subject and material is large, it cannot be used.
The method can know the condition of the initial data provided at this time before the model starts to train, and pre-evaluates whether the model can be trained in the next step. And the user can clearly know how to solve the next step when the data has problems by seeing the data.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application.
In the drawings:
FIG. 1 is a flow chart of a data quality assessment method;
FIG. 2 is a flow chart illustrating the substeps of step S3 in FIG. 1;
FIG. 3 is a schematic diagram of data quality assessment conditions;
FIG. 4 is a schematic diagram of the architecture of the system for data quality assessment of the present invention;
FIG. 5 is a block diagram of a computer device according to an embodiment of the present invention;
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.
It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.
Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.
The present invention is described in detail with reference to the embodiments shown in the drawings, but it should be understood that these embodiments are not intended to limit the present invention, and those skilled in the art should understand that functional, methodological, or structural equivalents or substitutions made by these embodiments are within the scope of the present invention.
Before describing in detail the various embodiments of the present invention, the core inventive concepts of the present invention are summarized and described in detail by the following several embodiments.
Referring to fig. 1, fig. 1 is a flowchart of a data quality evaluation method. As shown in fig. 1, the data quality evaluation method of the present invention includes:
data type evaluation step S1: performing quality evaluation on initial data of the starting service to output a first evaluation result:
data association relation evaluation step S2: performing quality evaluation according to the incidence relation between the data of the initial start service and outputting a second evaluation result;
detection step S3: and detecting the quality of the data according to the first evaluation result and the second evaluation result.
Wherein the data comprises recommended body data, material data and behavior data.
Further, the recommended evaluation criteria of the subject data and the material data include:
and evaluating the total number, the number of recommended bodies after deduplication according to the master ID, the number of null values of each field, the distribution of values of the enumerable field, the distribution condition of the values of the non-enumerable field, the size of the data volume, the number of data types which are consistent with the data range/format in the field with the size of the enumerable value and the data range.
Still further, the evaluation criteria of the behavior data include:
the total number, the number of deduplicated data according to the ID identifying the unique behavior, the classification according to behavior class, the size of data volume, the number of data classes that conform to the data range/format in the field where there is an enumeration value and data range, are evaluated.
Referring to fig. 2, fig. 2 is a flowchart illustrating a sub-step of step S1 in fig. 1. As shown in fig. 2, the data association relation evaluation step S2 includes:
integration step S21: integrating the recommended main body data, the material data and the data reflecting the incidence relation in the behavior data to obtain incidence relation data;
association relationship data evaluation step S22: and performing quality evaluation on the incidence relation data to output a second evaluation result.
Still further, the evaluation criterion of the association data includes:
the number of pieces of the recommended main body after ID duplication removal according to the main body, the total number of pieces of the integrated data, the data of the recommended main body category according to the number of pieces of the materials in the integrated data after ID duplication removal according to the main body, the material category, the label and the characteristic value.
Hereinafter, the inter-device user identification method according to the present invention will be described in detail with reference to the following examples.
The first embodiment is as follows:
the method mainly carries out quality evaluation on the initial data of the starting service, and comprises the following steps of respectively evaluating according to data types and integrally evaluating according to the incidence relation among the data:
the respectively evaluating according to the data categories comprises:
the recommended evaluation of the subject data was performed according to the following criteria:
the total number of strips;
the number of recommended subjects after deduplication by master ID;
the number of null values per field;
the distribution of values of the enumerable field;
distribution of values of non-enumerable fields;
the size of the data volume;
there are fields where enumerated values, data ranges, etc. are required, the number of data ranges/formats that are eligible.
Evaluation of material data, evaluation was performed according to the following criteria:
the total number of strips;
the number of recommended subjects after deduplication by master ID;
the number of null values per field;
the distribution of values of the enumerable field;
distribution of values of non-enumerable fields;
the size of the data volume;
there are fields where enumerated values, data ranges, etc. are required, the number of data ranges/formats that are eligible.
Behavioral data evaluation, evaluation was performed according to the following criteria:
the total number of strips;
the number of deduplications is performed according to the ID identifying the unique behavior;
classifying according to behavior categories;
the size of the data volume;
there are fields where enumerated values, data ranges, etc. are required, the number of data ranges/formats that are eligible.
The recommended main body, the material data and the behavior data reflect the related data of the incidence relation:
the recommended main body data generally comprises a main body ID, the condition of each label of the main body and the statistics of the number of corresponding behavior operations of the main body in a period of time.
The material data typically includes material ID, material category, material label, material level, etc.
The behavior data does not generally include information such as a tag and a material tag of the recommended subject, data which can be found in the recommended subject and the material data in the behavior data is obtained in order to check matching conditions of the recommended subject, the material and the behavior data, information such as the tag of the corresponding recommended subject and the like and information such as a category tag of the material data are supplemented, the original three pieces of data are integrated into one piece of data, and then data statistics is carried out. For the initial training data, the training of the model may be affected if the recommended subject does not exist in the behavioral data or the material being pushed out in the behavioral data does not exist.
The integrated data were evaluated according to the following conditions:
integrating the number of pieces of recommended subjects in the data after duplication removal according to the subject IDs;
integrating the number of the materials in the data after the duplication removal according to the main body ID;
integrating the total number of data;
data that categorizes the recommended subject;
classifying the data such as material types and labels;
the number of features is calculated and the cumulative sum of the amounts covering 95% of the unique values after all columns have been deduplicated. For example, column 1 may be interpreted as having 100 unique values when it has 10000000 unique values, accumulated by frequency, that cover 95% of the rows, and 100 unique values. Similarly, column 2 assumes 50 features, and if the model has two columns in total, there are 150 feature values.
Example two:
referring to fig. 4, fig. 4 is a schematic structural diagram of a data quality evaluation system according to the present invention. As shown in fig. 4, the data quality evaluation system of the present invention includes:
the data type evaluation unit is used for carrying out quality evaluation on the initial data of the starting service and outputting a first evaluation result:
the data incidence relation evaluation unit is used for carrying out quality evaluation according to the incidence relation between the data of the initial start service and outputting a second evaluation result;
and the detection unit is used for detecting the quality of the data according to the first evaluation result and the second evaluation result.
Wherein the data comprises recommended body data, material data and behavior data.
Wherein the data association relation evaluation unit includes:
the integration module integrates the recommended main body data, the material data and the data reflecting the incidence relation in the behavior data to obtain incidence relation data;
and the incidence relation data evaluation module is used for carrying out quality evaluation on the incidence relation data and outputting a second evaluation result.
Wherein the recommended evaluation criteria of the subject data and the material data include:
the total number, the number of recommended bodies after deduplication according to the main ID, the number of null values of each field, the distribution of values of the enumerable field, the distribution situation of the values of the enumerable field, the size of the data volume, the data type with the number conforming to the data range/format in the field with the size of the enumerable value and the data range are evaluated;
the evaluation criteria of the behavioral data include:
the total number, the number of deduplicated data according to the ID for identifying the unique behavior, classification according to behavior categories, data size, the number of data categories in fields with enumeration values and data ranges conforming to data ranges/formats are evaluated;
the evaluation criteria of the incidence relation data include:
the number of pieces of the recommended main body after ID duplication removal according to the main body, the total number of pieces of the integrated data, the data of the recommended main body category according to the number of pieces of the materials in the integrated data after ID duplication removal according to the main body, the material category, the label and the characteristic value.
Example three:
referring to FIG. 5, the embodiment discloses an embodiment of a computer device. The computer device may comprise a processor 81 and a memory 82 in which computer program instructions are stored.
Specifically, the processor 81 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.
Memory 82 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 82 may include a Hard Disk Drive (Hard Disk Drive, abbreviated to HDD), a floppy Disk Drive, a Solid State Drive (SSD), flash memory, an optical Disk, a magneto-optical Disk, tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 82 may include removable or non-removable (or fixed) media, where appropriate. The memory 82 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 82 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, Memory 82 includes Read-Only Memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), Electrically rewritable ROM (EAROM), or FLASH Memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a Static Random-Access Memory (SRAM) or a Dynamic Random-Access Memory (DRAM), where the DRAM may be a Fast Page Mode Dynamic Random-Access Memory (FPMDRAM), an Extended data output Dynamic Random-Access Memory (EDODRAM), a Synchronous Dynamic Random-Access Memory (SDRAM), and the like.
The memory 82 may be used to store or cache various data files for processing and/or communication use, as well as possible computer program instructions executed by the processor 81.
The processor 81 implements any of the data quality assessment methods in the above embodiments by reading and executing computer program instructions stored in the memory 82.
In some of these embodiments, the computer device may also include a communication interface 83 and a bus 80. As shown in fig. 5, the processor 81, the memory 82, and the communication interface 83 are connected via the bus 80 to complete communication therebetween.
The communication interface 83 is used for implementing communication between modules, devices, units and/or equipment in the embodiment of the present application. The communication port 83 may also be implemented with other components such as: the data communication is carried out among external equipment, image/data acquisition equipment, a database, external storage, an image/data processing workstation and the like.
Bus 80 includes hardware, software, or both to couple the components of the computer device to each other. Bus 80 includes, but is not limited to, at least one of the following: data Bus (Data Bus), Address Bus (Address Bus), Control Bus (Control Bus), Expansion Bus (Expansion Bus), and Local Bus (Local Bus). By way of example, and not limitation, Bus 80 may include an Accelerated Graphics Port (AGP) or other Graphics Bus, an Enhanced Industry Standard Architecture (EISA) Bus, a Front-Side Bus (Front Side Bus), an FSB (FSB), a Hyper Transport (HT) Interconnect, an ISA (ISA) Bus, an Infini Band Interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a microchannel Architecture (MCA) Bus, a PCI (Peripheral Component Interconnect) Bus, a PCI-Express (PCI-X) Bus, a Serial Advanced Technology Attachment (SATA) Bus, a Video Electronics Bus (audio Electronics Association), abbreviated VLB) bus or other suitable bus or a combination of two or more of these. Bus 80 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.
The computer device may be based on a data quality assessment method, thereby implementing the method described in connection with fig. 1-3.
In addition, in combination with the method for managing data in the foregoing embodiments, embodiments of the present application may provide a computer-readable storage medium to implement. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement a data quality assessment method of the above embodiments.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
In summary, the data quality assessment method provided by the present invention has the beneficial effects that the method can accurately and comprehensively understand the initial data magnitude, the number of pieces, and the field distribution, and determine whether the number is too small and may not meet the training requirement, the number is too large and causes too long resource consumption time for training, and the field distribution has a deviation inconsistent with the actual situation and causes poor model effect. And through data summarization, the quantity condition and the material condition of the available recommended main bodies are accurately calculated. (when the provided recommended subject and the provided material do not exist in the behavior data, the recommended subject and the provided material cannot be used even if the data of the recommended subject and the provided material are large) by the method, the condition of the initial data provided at this time can be known before the model starts to train, and whether the model training of the next step can be carried out or not is evaluated in advance. And the user can clearly know how to solve the next step when the data has problems by seeing the data.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A data quality assessment method, comprising:
and a data type evaluation step: performing quality evaluation on initial data of the starting service to output a first evaluation result:
and a data association relation evaluation step: performing quality evaluation according to the incidence relation between the data of the initial start service and outputting a second evaluation result;
a detection step: and detecting the quality of the data according to the first evaluation result and the second evaluation result.
2. The data quality assessment method of claim 1, wherein said data comprises recommended body data, material data and behavior data.
3. The data quality evaluation method of claim 2, wherein the evaluation criteria of the recommended subject data and the material data include:
and evaluating the total number, the number of recommended bodies after deduplication according to the master ID, the number of null values of each field, the distribution of values of the enumerable field, the distribution condition of the values of the non-enumerable field, the size of the data volume, the number of data types which are consistent with the data range/format in the field with the size of the enumerable value and the data range.
4. The data quality evaluation method of claim 2, wherein the evaluation criterion of the behavior data includes:
the total number, the number of deduplicated data according to the ID identifying the unique behavior, the classification according to behavior class, the size of data volume, the number of data classes that conform to the data range/format in the field where there is an enumeration value and data range, are evaluated.
5. The data quality evaluation method of claim 2, wherein the data association relation evaluation step comprises:
an integration step: integrating the recommended main body data, the material data and the data reflecting the incidence relation in the behavior data to obtain incidence relation data;
and (3) evaluation of association relation data: and performing quality evaluation on the incidence relation data to output a second evaluation result.
6. The data quality evaluation method according to claim 5, wherein the evaluation criterion of the association data includes:
the number of pieces of the recommended main body after ID duplication removal according to the main body, the total number of pieces of the integrated data, the data of the recommended main body category according to the number of pieces of the materials in the integrated data after ID duplication removal according to the main body, the material category, the label and the characteristic value.
7. A data quality evaluation system, comprising:
the data type evaluation unit is used for carrying out quality evaluation on the initial data of the starting service and outputting a first evaluation result:
the data incidence relation evaluation unit is used for carrying out quality evaluation according to the incidence relation between the data of the initial start service and outputting a second evaluation result;
and the detection unit is used for detecting the quality of the data according to the first evaluation result and the second evaluation result.
8. The data quality assessment system of claim 7, wherein said data comprises recommended body data, material data and behavior data.
9. The data quality evaluation system of claim 8 wherein the data association evaluation unit comprises:
the integration module integrates the recommended main body data, the material data and the data reflecting the incidence relation in the behavior data to obtain incidence relation data;
and the incidence relation data evaluation module is used for carrying out quality evaluation on the incidence relation data and outputting a second evaluation result.
10. The data quality evaluation system of claim 9 wherein the evaluation criteria for the recommended body data and the material data comprises:
the total number, the number of recommended bodies after deduplication according to the main ID, the number of null values of each field, the distribution of values of the enumerable field, the distribution situation of the values of the enumerable field, the size of the data volume, the data type with the number conforming to the data range/format in the field with the size of the enumerable value and the data range are evaluated;
the evaluation criteria of the behavioral data include:
the total number, the number of deduplicated data according to the ID for identifying the unique behavior, classification according to behavior categories, data size, the number of data categories in fields with enumeration values and data ranges conforming to data ranges/formats are evaluated;
the evaluation criteria of the incidence relation data include:
the number of pieces of the recommended main body after ID duplication removal according to the main body, the total number of pieces of the integrated data, the data of the recommended main body category according to the number of pieces of the materials in the integrated data after ID duplication removal according to the main body, the material category, the label and the characteristic value.
CN202110074905.3A 2021-01-20 2021-01-20 Data quality evaluation method and system Pending CN112817952A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110074905.3A CN112817952A (en) 2021-01-20 2021-01-20 Data quality evaluation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110074905.3A CN112817952A (en) 2021-01-20 2021-01-20 Data quality evaluation method and system

Publications (1)

Publication Number Publication Date
CN112817952A true CN112817952A (en) 2021-05-18

Family

ID=75858569

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110074905.3A Pending CN112817952A (en) 2021-01-20 2021-01-20 Data quality evaluation method and system

Country Status (1)

Country Link
CN (1) CN112817952A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114610702A (en) * 2022-03-15 2022-06-10 云粒智慧科技有限公司 Real-time quality control method, device, equipment and medium for data management process

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109347668A (en) * 2018-10-17 2019-02-15 网宿科技股份有限公司 A kind of training method and device of service quality assessment model
CN110019154A (en) * 2017-09-28 2019-07-16 阿里巴巴集团控股有限公司 Data processing, data quality accessment, recommended products determine method and relevant device
CN111339406A (en) * 2020-02-17 2020-06-26 北京百度网讯科技有限公司 Personalized recommendation method, device, equipment and storage medium
WO2020233432A1 (en) * 2019-05-20 2020-11-26 阿里巴巴集团控股有限公司 Method and device for information recommendation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019154A (en) * 2017-09-28 2019-07-16 阿里巴巴集团控股有限公司 Data processing, data quality accessment, recommended products determine method and relevant device
CN109347668A (en) * 2018-10-17 2019-02-15 网宿科技股份有限公司 A kind of training method and device of service quality assessment model
WO2020233432A1 (en) * 2019-05-20 2020-11-26 阿里巴巴集团控股有限公司 Method and device for information recommendation
CN111339406A (en) * 2020-02-17 2020-06-26 北京百度网讯科技有限公司 Personalized recommendation method, device, equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114610702A (en) * 2022-03-15 2022-06-10 云粒智慧科技有限公司 Real-time quality control method, device, equipment and medium for data management process

Similar Documents

Publication Publication Date Title
CN110222791B (en) Sample labeling information auditing method and device
CN110874530B (en) Keyword extraction method, keyword extraction device, terminal equipment and storage medium
US11221904B2 (en) Log analysis system, log analysis method, and log analysis program
CN107909088B (en) Method, apparatus, device and computer storage medium for obtaining training samples
CN107784205B (en) User product auditing method, device, server and storage medium
CN109472017B (en) Method and device for obtaining relevant information of text court deeds of referee to be generated
US10825235B2 (en) Data plot processing
US20150106793A1 (en) Detecting Byte Ordering Type Errors in Software Code
JP2022043225A5 (en)
CN111273891A (en) Business decision method and device based on rule engine and terminal equipment
CN111242318A (en) Business model training method and device based on heterogeneous feature library
CN106598997B (en) Method and device for calculating text theme attribution degree
CN111159167B (en) Labeling quality detection device and method
CN112866800A (en) Video content similarity detection method, device, equipment and storage medium
CN110929110B (en) Electronic document detection method, device, equipment and storage medium
CN112817952A (en) Data quality evaluation method and system
CN109977328A (en) A kind of URL classification method and device
CN107688744B (en) Malicious file classification method and device based on image feature matching
CN108255891B (en) Method and device for judging webpage type
CN109409091B (en) Method, device and equipment for detecting Web page and computer storage medium
CN110598115A (en) Sensitive webpage identification method and system based on artificial intelligence multi-engine
CN104615948A (en) Method for automatically recognizing file completeness and restoring
CN108241674B (en) Method and device for extracting webpage release time
CN113656354A (en) Log classification method, system, computer device and readable storage medium
CN113297617A (en) Authority data acquisition method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination