CN117762912A

CN117762912A - Data annotation quality evaluation and improvement system and method

Info

Publication number: CN117762912A
Application number: CN202311680509.0A
Authority: CN
Inventors: 骆靖元; 王乐; 曾智
Original assignee: Chengdu Huizhong Tianzhi Technology Co ltd
Current assignee: Chengdu Huizhong Tianzhi Technology Co ltd
Priority date: 2023-12-07
Filing date: 2023-12-07
Publication date: 2024-03-26

Abstract

The invention provides a data labeling quality evaluation and improvement system, which comprises a user data uploading unit, a data cleaning unit, a data labeling quality evaluation unit, a data labeling quality improvement unit and a data labeling quality management unit, wherein data interaction is carried out between the data labeling quality evaluation unit and the data labeling quality management unit; the data annotation quality evaluation unit is used for evaluating the quality of the data annotation by comprehensively evaluating the accuracy, consistency and integrity of the annotation data by adopting a preset evaluation index and combining the basic characteristics of the annotation data and the weight added value of the annotators to obtain an evaluation weight score; the data labeling quality improvement module automatically adjusts and corrects labeling data according to the evaluation score through preset parameters, and automatically adjusts labeling results through clustering and classification algorithms; the data labeling quality management unit performs sampling comparison analysis according to labeling results, ranks and ranks labeling personnel according to the analysis results, establishes a labeling specification library, and performs unified management and updating on labeling specifications.

Description

Data annotation quality evaluation and improvement system and method

Technical Field

The invention belongs to the field of data annotation, and particularly relates to a system and a method for evaluating and improving quality of data annotation.

Background

Currently, data annotation is an important element in the fields of machine learning and artificial intelligence, which involves manually marking or annotating an original dataset for use in training and evaluating models. The quality of the data annotation directly influences the training and application effects of the subsequent model. However, in practical applications, the quality of data labeling is often difficult to ensure due to subjective factors of labeling personnel, errors in the labeling process, inconsistent labeling specifications, and the like. Accordingly, there is a need for a system and method that can evaluate and improve the quality of data annotation to increase the accuracy and consistency of the data annotation.

At present, the data labeling quality evaluation and improvement method mainly comprises the following steps: manual auditing: the traditional method is to evaluate the quality of the data annotation by manual auditing. Labeling personnel need to check the labeling data one by one, and check the accuracy and consistency of the labeling data. However, this method is time consuming and costly, and it is difficult to ensure objectivity and consistency of the evaluation.

Expert evaluation: another approach is to evaluate the quality of the data annotation by expert evaluation. The expert can evaluate the marked data according to own experience and knowledge to provide professional opinion and suggestion. However, expert resources are limited and there may be subjectivity and individual variability in the assessment results.

Statistical analysis: one common method is to evaluate the quality of the data annotation by statistical analysis. For example, the accuracy, recall, F1 value, etc. of the annotation data may be calculated to measure the accuracy and consistency of the annotation. However, this method may ignore the characteristics of the labeling data and the historical performance of the labeling personnel, and cannot comprehensively evaluate the quality of the data.

The machine learning method comprises the following steps: in recent years, with the development of machine learning and artificial intelligence, some machine learning-based methods are introduced into data annotation quality assessment. For example, the labeling data may be automatically evaluated using a classification algorithm, or consistency analysis may be performed on the labeling data using a clustering algorithm. These methods can improve the efficiency and consistency of the assessment, but still have certain limitations.

The existing data labeling quality evaluation and improvement method has some limitations, such as subjectivity, time consumption, consistency and the like. To solve these problems, we propose a system and method for evaluating and improving the quality of data annotation.

The system comprises a data labeling quality evaluation module, a data labeling quality improvement module and a data labeling quality management module. The data labeling quality evaluation module adopts various evaluation indexes and algorithms to comprehensively evaluate the accuracy, consistency and integrity of labeling data. The data labeling quality improvement module provides automatic or semi-automatic improvement suggestions according to the evaluation result so as to improve the consistency and accuracy of data labeling. The data labeling quality management module is used for grading and ranking labeling personnel and managing and updating a labeling specification library.

The system is innovative in that various evaluation indexes and algorithms are comprehensively utilized, and an automatic improved suggestion and labeling personnel management mechanism is provided. By the system, the quality and efficiency of data annotation can be effectively improved, and a reliable data basis is provided for subsequent machine learning and artificial intelligence application.

Accordingly, there is a need for a system and method for quality assessment and improvement of data annotation.

Disclosure of Invention

The invention provides a data labeling quality evaluation and improvement system and method, which solve the problems that in the prior art, the subjective factors of labeling personnel, errors in the labeling process, inconsistent labeling specifications and the like are difficult to ensure for the data labeling quality, and the quality of the data labeling cannot be ensured effectively. This results in the trained model possibly having erroneous, inconsistent or inaccurate annotation data, which affects the performance and application of the model.

The technical scheme of the invention is realized as follows: the data annotation quality evaluation and improvement system comprises a user data uploading unit, a data cleaning unit, a data annotation quality evaluation unit, a data annotation quality improvement unit and a data annotation quality management unit, wherein data interaction is performed between the data annotation quality evaluation unit and the data annotation quality management unit;

the data annotation quality evaluation unit is used for evaluating the quality of the data annotation by comprehensively evaluating the accuracy, consistency and integrity of the annotation data by adopting a preset evaluation index and combining the basic characteristics of the annotation data and the weight added value of the annotators to obtain an evaluation weight score;

the data labeling quality improvement module automatically adjusts and corrects labeling data according to the evaluation score through preset parameters, and automatically adjusts labeling results through clustering and classification algorithms;

the data labeling quality management unit performs sampling comparison analysis according to labeling results, ranks and ranks labeling personnel according to the analysis results, establishes a labeling specification library, and performs unified management and updating on labeling specifications.

Compared with the prior art, the data annotation quality evaluation and improvement system has the following differences:

and a data cleaning unit: the system comprises a data cleaning unit which is used for cleaning the data before uploading the data. Data cleansing may remove noise, errors, or inconsistencies in the data to ensure the quality and accuracy of the uploaded data. This step can process the data in advance, reducing interference with subsequent evaluation and improvement processes.

The data labeling quality evaluation unit: the data labeling quality evaluation unit in the system adopts preset evaluation indexes, and combines the basic characteristics of the labeling data and the weight added value of labeling personnel to comprehensively evaluate the accuracy, consistency and integrity of the labeling data. By introducing the weight value of the labeling personnel, the performance and influence of the labeling personnel can be evaluated more accurately, so that the quality of the data labeling can be evaluated more accurately.

The data labeling quality improving unit: and the data marking quality improvement unit in the system automatically adjusts and corrects marking data according to the evaluation score and the preset parameters. The labeling result is automatically adjusted through the clustering and classifying algorithm, so that improvement suggestions can be rapidly and accurately provided, and the consistency and accuracy of data labeling are improved.

The data labeling quality management unit: and the data labeling quality management unit in the system performs sampling comparison analysis according to the labeling result, and ranks labeling personnel according to the analysis result. Meanwhile, a labeling specification library is established, and unified management and updating are carried out on the labeling specifications. The step can optimize management and cultivation of labeling personnel and improve quality and efficiency of the whole labeling team.

As a preferred embodiment, the user data uploading unit uploads the data to be subjected to marking quality evaluation and improvement into the data cleaning unit, the user uploads the marking data, the data is preprocessed by the user data uploading unit, and the format of the user uploaded data is converted into a format which can be identified by the data cleaning unit.

As a preferred embodiment, the data cleansing unit firstly verifies the data sent by the user data uploading unit, verifies whether the data meets the cleansing requirement format, if not, feeds back to the user data uploading unit, and if yes, performs data processing on the cleansing data.

As a preferred embodiment, the data cleaning unit checks whether there is a missing value in the data when performing data processing, and if there is a missing value, fills the missing value after performing data matching through the history database, and deletes a sample containing the missing value when there is no data matching in the history database; after the missing value processing is completed, noise data are identified through clustering, data denoising is performed, and then the data are output to a data labeling quality evaluation unit after data consistency verification and data deduplication.

A method for quality assessment and improvement of data annotation, the method comprising the steps of: the user uploads the data needing marking quality evaluation and improvement through a user data uploading unit, the user data uploading unit performs format verification on the uploaded data to ensure that the data accords with the processing format of a data cleaning unit, the data is comprehensively evaluated through a data marking quality evaluation module after being processed through the data cleaning unit, the marking data is correspondingly adjusted and corrected through a data marking quality improvement module, and the marking specification is uniformly managed and updated through a data marking quality management unit according to the marking result by sampling comparison analysis.

As a preferable implementation mode, the data labeling quality management unit displays management data to a user through a visual interface, wherein a table and a chart format are arranged in the visual interface, and when the management data are displayed, the management data are displayed through the table and the chart.

After the technical scheme is adopted, the invention has the beneficial effects that:

the accuracy of data annotation is improved: by comprehensively evaluating the accuracy, consistency, completeness and other aspects of the data annotation, the system can help identify and correct errors and inaccuracy in the annotation data, so that the accuracy of the data annotation is improved.

And improving the consistency of data annotation: by evaluating and improving consistency of the labeling data, the system can reduce inconsistent labeling results and ensure more consistent and reliable labeling results among different labeling personnel.

The efficiency of data annotation is improved: by providing automatic or semi-automatic improvement suggestions, the system can help labeling personnel to quickly adjust and correct labeling data, thereby improving the efficiency of data labeling and reducing the repeated labor and time cost.

Optimizing label personnel management: through the grading and ranking of the labeling personnel, the system can stimulate the labeling personnel to improve the labeling quality and provide rewards and training opportunities, so that management and cultivation of the labeling personnel are optimized, and the quality and efficiency of the whole labeling team are improved.

Providing a reliable data base: through effective data labeling quality evaluation and improvement, the system can provide a more reliable data base and provide accurate and consistent labeling data for subsequent model training and application, thereby improving the performance and application effect of the model.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.

FIG. 1 is a flow chart of the system of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Examples:

as shown in FIG. 1, the system for evaluating and improving the quality of data annotation comprises a user data uploading unit, a data cleaning unit, a data annotation quality evaluating unit, a data annotation quality improving unit and a data annotation quality management unit for data interaction;

The working principle and the working flow of the data labeling quality evaluation and improvement system are as follows: user data uploading unit: and uploading the data which needs to be subjected to marking quality evaluation and improvement to the system by a user through a user data uploading unit provided by the system. The user may select a single file or upload multiple files in bulk.

And a data cleaning unit: the system may perform a cleaning process on the data prior to uploading the data. The data cleaning unit is used for removing noise, errors or inconsistencies in the data so as to ensure the quality and accuracy of the uploaded data.

The data labeling quality evaluation unit: and the uploaded data enters a data labeling quality evaluation unit after being cleaned. The unit adopts a preset evaluation index, combines the basic characteristics of the marked data and the weight added value of the marked personnel, and comprehensively evaluates the accuracy, consistency and integrity of the marked data. The evaluation result will obtain an evaluation weight score.

The data labeling quality improving unit: and according to the evaluation score and the preset parameter, the data labeling quality improvement unit automatically adjusts and corrects the labeling data. And automatically adjusting the labeling result through a clustering and classifying algorithm to improve the consistency and accuracy of data labeling.

The data labeling quality management unit: and the data labeling quality management unit performs sampling comparison analysis according to the labeling result, and ranks labeling personnel according to the analysis result. Meanwhile, a labeling specification library is established, and unified management and updating are carried out on the labeling specifications. Therefore, management and cultivation of labeling personnel can be optimized, and quality and efficiency of the whole labeling team are improved.

Through the working principle and the working flow, the system realizes the functions of data uploading, data cleaning, data marking quality evaluation, data marking quality improvement and data marking quality management. After the user uploads the data, the system finally provides the improved data and the improved evaluation result through links such as data cleaning, evaluation and improvement so as to improve the quality and efficiency of the data annotation. The workflow can effectively improve the accuracy, consistency and management effect of data annotation, and provides a reliable data basis for subsequent machine learning and artificial intelligence application.

The user data uploading unit uploads the data needing marking quality evaluation and improvement to the data cleaning unit, the user uploads marking data, the user data uploading unit preprocesses the data, and the user uploading data is converted into a data format which can be identified by the data cleaning unit. The user data uploading unit is a component in the data marking quality evaluation and improvement system, and the main function of the user data uploading unit is to upload data which is required to be subjected to marking quality evaluation and improvement by a user to the data cleaning unit. The user uploads the marking data through the unit, and the system can preprocess the uploaded data so as to ensure that the data can be correctly identified and processed by the data cleaning unit. In the user data uploading unit, the user can select the annotation data file to be uploaded, and the annotation data file can be a single file or a plurality of files are uploaded in batches. The uploaded data may be in the form of text, images, video, etc., depending on the design and application requirements of the system.

Once the user selects and uploads the annotation data, the system pre-processes the data. One important preprocessing step is data format conversion, among others. The system will convert the data uploaded by the user into a data format recognizable by the data cleansing unit. For example, for image data, the system may convert image files of different formats into a standard image format supported by the system; for text data, the system may convert text files of different encoding formats to a unified text format.

In the data cleaning unit, the system can correctly identify and process the data uploaded by the user, and perform subsequent operations such as data cleaning, evaluation, improvement and the like. The data cleansing unit may remove noise, errors or inconsistencies in the data to ensure quality and accuracy of the uploaded data. Through preprocessing and data format conversion, the annotation data uploaded by the user can smoothly enter the data cleaning unit, and a reliable data basis is provided for subsequent evaluation and improvement.

The function of the user data uploading unit is to preprocess the labeling data uploaded by the user and convert the data into a data format which can be identified by the data cleaning unit so as to ensure that the data can be correctly processed and evaluated by the system. This step is an important step in the quality assessment of the data annotation and in improving the data flow in the system, providing the necessary preparation for subsequent data processing and quality assessment.

The data cleaning unit firstly verifies the data sent by the user data uploading unit, verifies whether the data meets the cleaning requirement format, if not, feeds back the data to the user data uploading unit, and if yes, performs data processing on the cleaning data. The data cleaning unit is a component in the data labeling quality evaluation and improvement system, and the main function of the data cleaning unit is to verify and process the data sent by the user data uploading unit.

First, the data cleansing unit verifies the data received from the user data uploading unit. The purpose of the verification is to ensure that the data is in a format that meets the cleaning requirements. For example, for image data, verification may include checking whether the format, resolution, color space, etc. of the image file meets system requirements; for text data, verification may include checking whether the encoding format, character set, etc. of the text file meets system requirements.

If the verification finds that the data does not meet the format of the cleaning requirement, the data cleaning unit sends feedback information to the user data uploading unit to prompt the user that the data format is incorrect and require the user to upload the data meeting the requirement again. Therefore, the data cleaning unit can be ensured to only process the data meeting the requirements, the abnormal or wrong data is prevented from being processed, and the accuracy and the efficiency of data processing are improved.

If the data is verified to be in a format that meets the cleansing requirements, the data cleansing unit will process the cleansing data. The specific operation of the data processing depends on the design and application requirements of the system. Common data processing operations include removing noise from the data, processing missing values, deduplication, normalization, and the like. The data cleaning aims at improving the quality and accuracy of the data and providing a reliable data base for the quality evaluation and improvement of the subsequent data labeling.

The data cleaning unit plays a key role in the data labeling quality evaluation and improvement system. The method ensures that only data meeting the requirements is processed by verifying whether the data uploaded by a user meets the format of the cleaning requirements; meanwhile, the data meeting the requirements are processed, noise is removed, missing values are processed, and the like, so that the quality and accuracy of the data are improved. Therefore, the follow-up data processing and evaluation can be performed based on high-quality data, and the accuracy and the effect of data marking quality evaluation and improvement are improved.

The data cleaning unit is used for checking whether missing values exist in data when the data processing is carried out, filling the missing values after the data matching is carried out on the missing values through the historical database, and deleting samples containing the missing values when no data matching is carried out in the historical database; after the missing value processing is completed, noise data are identified through clustering, data denoising is performed, and then the data are output to a data labeling quality evaluation unit after data consistency verification and data deduplication. The data cleaning unit adopts a series of steps to improve the quality and accuracy of data when processing the data.

The data cleansing unit checks whether there is a missing value in the data. If there is a missing value, the system will use the historical database to perform a data match to find the data associated with the missing value. Through data matching of the historical database, missing values can be filled in, so that the data is more complete. If there is no data in the history database that matches the missing values, the system will delete the samples that contain the missing values to ensure the integrity and accuracy of the data.

The data cleaning unit performs data denoising processing. Through a clustering algorithm, the system can identify noise in the data, i.e., outliers or outliers. By identifying and deleting noise data, the quality and accuracy of the data can be improved, and the influence of noise on subsequent data processing and evaluation is avoided.

After the missing value processing and the data denoising are completed, the data cleaning unit performs data consistency verification and data deduplication. Data consistency verification may ensure that logical relationships and constraints in the data are satisfied, e.g., for classifying tags, consistency and correctness of tag values are ensured. Data deduplication may detect and delete duplicate records in the data to avoid the impact of duplicate data on subsequent analysis and evaluation.

After data processing, consistency verification and deduplication, the data cleaning unit outputs the processed data to the data labeling quality evaluation unit. Thus, the quality and accuracy of the data are improved through the processing of the data cleaning unit, and a more reliable data basis is provided for the subsequent quality evaluation of the data labeling.

The data cleaning unit processes and optimizes the data through the steps of missing value processing, data denoising, data consistency verification, data duplication removal and the like so as to improve the quality and accuracy of the data. The steps can effectively remove noise, fill missing values, ensure data consistency, ensure data integrity and consistency, and provide high-quality data for subsequent data labeling quality evaluation.

A method for quality assessment and improvement of data annotation, the method comprising the steps of: the user uploads the data needing marking quality evaluation and improvement through a user data uploading unit, the user data uploading unit performs format verification on the uploaded data to ensure that the data accords with the processing format of a data cleaning unit, the data is comprehensively evaluated through a data marking quality evaluation module after being processed through the data cleaning unit, the marking data is correspondingly adjusted and corrected through a data marking quality improvement module, and the marking specification is uniformly managed and updated through a data marking quality management unit according to the marking result by sampling comparison analysis. The data annotation quality assessment and improvement method comprises the following steps of:

uploading user data: the user uploads the data which needs to be subjected to marking quality evaluation and improvement through a user data uploading unit. This may be a single file or a batch upload of multiple files. The user data uploading unit performs format verification on the uploaded data to ensure that the data meets the processing format requirement of the data cleaning unit.

And (3) data cleaning: and the uploaded data enters a data cleaning unit for processing after passing through a user data uploading unit. The data cleaning unit cleans and preprocesses the data, and removes noise, errors or inconsistencies in the data so as to ensure the quality and accuracy of the data.

Data annotation quality assessment: the data after the data cleaning treatment enters a data labeling quality evaluation module. The module carries out comprehensive evaluation on the uploaded data, adopts preset evaluation indexes and algorithms, and combines basic characteristics of the data to evaluate the accuracy, consistency, integrity and the like of the data. The evaluation result can obtain an evaluation weight score for subsequent data labeling quality improvement.

Data labeling quality improvement: and according to the evaluation score and the preset parameters, the data labeling quality improvement module correspondingly adjusts and corrects the labeling data. The module can improve the labeling data in an automatic or semi-automatic mode, such as providing correction suggestions or automatically adjusting labeling results by using clustering and classifying algorithms, so as to improve the consistency and accuracy of data labeling.

And (3) data labeling quality management: and carrying out sampling comparison analysis on the data with improved data marking quality through a data marking quality management unit. The analysis can rank and rank the labeling personnel according to the labeling result, and establish a labeling specification library to uniformly manage and update the labeling specification. Therefore, management and cultivation of labeling personnel can be optimized, and quality and efficiency of the whole labeling team are improved.

The data marking quality evaluation and improvement method ensures the quality and accuracy of data through the steps of uploading user data, cleaning the data, evaluating the data marking quality, improving the data marking quality, managing the data marking quality and the like, and provides corresponding improvement measures and management mechanisms so as to improve the quality and efficiency of the data marking. The method provides a reliable data base for subsequent machine learning and artificial intelligence application, and improves the performance and application effect of the model.

The data labeling quality management unit displays management data to a user through a visual interface, a table and a chart format are arranged in the visual interface, and when the management data are displayed, the management data are displayed through the table and the chart. The data labeling quality management unit displays management data to a user through a visual interface, and the visual interface is internally provided with a table and a chart format so as to display the data more intuitively.

In the data labeling quality management unit, management data can be displayed in a table form through a table format. The form may contain multiple columns, each corresponding to a different data field or index, such as the name of the labeling person, a rating, a labeling quality score, etc. The user can view and compare the ratings and the marking quality scores of different marking personnel through the form so as to know the performance and the quality of the marking personnel.

The visual interface may also present management data in a chart format. The graph may take many forms, such as bar graphs, line graphs, pie charts, etc., to more intuitively present the distribution, trend, and scale of the data. For example, a histogram may be used to show the distribution of ratings of different labeling personnel, a line graph may be used to show the trend of variation of labeling quality scores, or a pie chart may be used to show the duty cycle of different ratings. In this way, the user can more intuitively understand the characteristics and trends of the management data through the chart form.

By means of the display modes of the tables and the charts, the visual interface can provide more visual and easy-to-understand data display. The user can analyze, compare and decide the management data through the table and the chart, so that the data labeling quality is better managed and optimized. The visual display mode enables a user to acquire the information of the management data more quickly and accurately, and provides reference and support for subsequent management decision and labeling quality improvement.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. The data labeling quality evaluation and improvement system is characterized by comprising a user data uploading unit, a data cleaning unit, a data labeling quality evaluation unit, a data labeling quality improvement unit and a data labeling quality management unit for data interaction;

2. A data annotation quality assessment and improvement system as claimed in claim 1, wherein: the user data uploading unit uploads the data needing marking quality evaluation and improvement to the data cleaning unit, the user uploads marking data, the user data uploading unit preprocesses the data, and the user uploading data is converted into a data format which can be identified by the data cleaning unit.

3. A data annotation quality assessment and improvement system as claimed in claim 1, wherein: the data cleaning unit firstly verifies the data sent by the user data uploading unit, verifies whether the data meets the cleaning requirement format, if not, feeds back the data to the user data uploading unit, and if yes, performs data processing on the cleaning data.

4. A data annotation quality assessment and improvement system as claimed in claim 3 wherein: the data cleaning unit is used for checking whether missing values exist in data when the data processing is carried out, filling the missing values after the data matching is carried out on the missing values through the historical database, and deleting samples containing the missing values when no data matching is carried out in the historical database; after the missing value processing is completed, noise data are identified through clustering, data denoising is performed, and then the data are output to a data labeling quality evaluation unit after data consistency verification and data deduplication.

5. A method for evaluating and improving the quality of data annotation, which is characterized by comprising the following steps: the user uploads the data needing marking quality evaluation and improvement through a user data uploading unit, the user data uploading unit performs format verification on the uploaded data to ensure that the data accords with the processing format of a data cleaning unit, the data is comprehensively evaluated through a data marking quality evaluation module after being processed through the data cleaning unit, the marking data is correspondingly adjusted and corrected through a data marking quality improvement module, and the marking specification is uniformly managed and updated through a data marking quality management unit according to the marking result by sampling comparison analysis.

6. The method for evaluating and improving quality of data annotation according to claim 5, wherein the data annotation quality management unit displays management data to a user through a visual interface, wherein a table and a chart format are built in the visual interface, and when the management data annotation quality management unit displays the management data, the management data through the table and the chart.