CN116821647A

CN116821647A - Optimization method, device and equipment for data annotation based on sample deviation evaluation

Info

Publication number: CN116821647A
Application number: CN202311078686.1A
Authority: CN
Inventors: 李常宝; 顾平莉; 王书龙; 艾中良; 袁媛; 李茜
Original assignee: CETC 15 Research Institute
Current assignee: CETC 15 Research Institute
Priority date: 2023-08-25
Filing date: 2023-08-25
Publication date: 2023-09-29
Anticipated expiration: 2043-08-25
Also published as: CN116821647B

Abstract

The embodiment of the specification discloses an optimization method, device and equipment for data annotation based on sample deviation evaluation. The method comprises the following steps: screening sample distribution areas which do not accord with the data distribution density based on the obtained data set and the distribution density diagram of the sample set, and adding unlabeled data in the sample distribution areas which do not accord with the data distribution density into the sample set to be processed; adding unmodified or confirmed data in data with the accuracy rate change value being more than or equal to a preset threshold value to the sample set to be processed based on the accuracy rate change value of the distribution density maps of the data set and the sample set through a preset time interval; and if the number of the sample sets to be processed is greater than or equal to the required scale of the samples to be processed, or the first deviation coefficient is greater than or equal to a first preset value and the second deviation coefficient is less than or equal to a second preset value, outputting the sample sets to be processed and the sample deviation indexes of the sample sets to be processed so as to optimize the data labeling.

Description

Optimization method, device and equipment for data annotation based on sample deviation evaluation

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, and a device for optimizing data annotation based on sample deviation evaluation.

Background

Sample data is a marked data set and can be used for training a specific model, and the quality of the sample data directly influences the recognition effect of the model.

In the prior art, an automatic labeling technology is generally adopted for sample data labeling, a corresponding data labeling model is trained mainly by using a labeling sample set formed by manual labeling of a user, labeling data in the field is further produced, and the conversion from manual labeling of the data labeling to automatic labeling of a machine is realized by integrating user experience, so that the efficiency of the data labeling is improved. The existing data labeling method only focuses on manual labeling to automatic labeling of a machine, but does not focus on the quality of sample data, and finally influences the recognition effect and accuracy of the model.

Based on this, in order to improve the effect of model recognition, an optimization method of data annotation based on sample deviation evaluation is required to perfect sample data in the sample data annotation process.

Disclosure of Invention

The embodiment of the specification provides a method, a device and equipment for optimizing data annotation based on sample deviation evaluation, which are used for solving the following technical problems: the existing data labeling method only focuses on manual labeling to automatic labeling of a machine, but does not focus on the quality of sample data, and finally affects the recognition effect and accuracy of the model.

In order to solve the above technical problems, the embodiments of the present specification are implemented as follows:

the embodiment of the specification provides an optimization method for data annotation based on sample deviation evaluation, which comprises the following steps:

dividing a data set and a sample set based on data attributes of the data set to obtain a distribution density map of the data set and the sample set;

screening sample distribution areas which do not accord with the data distribution density based on the data set and the distribution density diagram of the sample set, and adding unlabeled data in the sample distribution areas which do not accord with the data distribution density into a sample set to be processed;

adding unmodified or confirmed data in the data with the accuracy rate change value being more than or equal to a preset threshold value to the sample set to be processed based on the accuracy rate change value of the distribution density maps of the data set and the sample set through a preset time interval;

outputting the sample set to be processed and a sample deviation index of the sample set to be processed if the number of the sample set to be processed is larger than or equal to the required scale of the sample to be processed or the first deviation coefficient is larger than or equal to a first preset value and the second deviation coefficient is smaller than or equal to a second preset value;

And labeling the sample of the sample set to be processed based on the sample set to be processed and the sample deviation index of the sample set to be processed, and forming a supplementary sample so as to optimize data labeling.

The embodiment of the specification also provides an optimizing device for data annotation based on sample deviation evaluation, which comprises the following components:

the data visualization module is used for dividing the data set and the sample set based on the data attribute of the data set to obtain a distribution density map of the data set and the sample set;

the first discovery module of the sample to be processed screens a sample distribution area which does not accord with the data distribution density based on the data set and the distribution density diagram of the sample set, and adds unlabeled data in the sample distribution area which does not accord with the data distribution density into the sample set to be processed;

the second discovery module of the sample to be processed adds unmodified or confirmed data in the data with the accuracy rate change value being more than or equal to a preset threshold value into the sample set to be processed based on the data set and the accuracy rate change value of the distribution density map of the sample set at preset time intervals;

the sample analysis module to be processed outputs the sample collection to be processed and the sample deviation index of the sample collection to be processed if the number of the sample collection to be processed is larger than or equal to the required scale of the sample to be processed or the first deviation coefficient is larger than or equal to a first preset value and the second deviation coefficient is smaller than or equal to a second preset value;

And the sample post-processing module is used for marking the sample of the sample set to be processed based on the sample set to be processed and the sample deviation index of the sample set to be processed, and forming a supplementary sample so as to optimize data marking.

The embodiment of the specification also provides an electronic device, including:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to:

According to the data annotation optimization method based on sample deviation evaluation, the data set distribution and the sample set distribution are mapped to the same one-dimensional or two-dimensional space, the distribution difference between the data set and the sample set is found in an intuitive and visual mode, the rapid discovery of the deviation sample and the capability evaluation when the current sample is applied to model training are realized by combining accuracy change analysis, the subsequent optimization and perfection of the sample are performed, so that the conversion from 'intangible' data collection to 'tangible' data distribution can be realized, the deviation sample area is accurately found, low-quality sample data and the capability evaluation when the current sample is applied to model training are rapidly found, the subsequent operations such as sample correction are realized, and the optimization of the data annotation is realized.

Drawings

In order to more clearly illustrate the embodiments of the present description or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some of the embodiments described in the present description, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a system architecture of an optimization method for data annotation based on sample deviation assessment according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart of an optimization method for data annotation based on sample deviation evaluation according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram illustrating an optimization method for data annotation based on sample deviation estimation according to an embodiment of the present disclosure;

FIG. 4 is a flowchart of a core algorithm of an optimization method for data annotation based on sample deviation evaluation according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of an optimizing apparatus for labeling data based on sample deviation evaluation according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of another optimization apparatus for labeling data based on sample bias evaluation according to an embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions in the present specification better understood by those skilled in the art, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.

The automatic labeling technology essentially belongs to the labeling data technology, and the labeling model is trained by using a user to label the sample, so that the automatic labeling of the sample by a machine is realized, but the method does not involve the quality judgment and correction of the sample, and is not beneficial to the accuracy of the subsequent model training.

Based on this, the embodiment of the specification provides an optimization method for data annotation based on sample deviation evaluation, which is to map data set distribution and sample set distribution to the same one-dimensional or two-dimensional space, discover the distribution difference between the data set and the sample set in an intuitive and visual manner, combine accuracy change analysis, realize quick discovery of the deviated sample and capability evaluation when the current sample is applied to model training, and perform subsequent optimization and perfection of the sample, so as to realize conversion from 'intangible' data set to 'tangible' data distribution, accurately discover deviated sample area, quickly discover low-quality sample data and capability evaluation when the current sample is applied to model training, and realize subsequent operations such as sample correction, and realize optimization of data annotation.

Fig. 1 is a schematic system architecture diagram of an optimization method for data labeling based on sample deviation evaluation according to an embodiment of the present disclosure. As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The terminal devices 101, 102, 103 interact with the server 105 via the network 104 to receive or send messages or the like. Various client applications can be installed on the terminal devices 101, 102, 103. For example, a special program such as optimization of data labeling based on sample deviation evaluation is performed.

The terminal devices 101, 102, 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be a variety of special purpose or general purpose electronic devices including, but not limited to, smartphones, tablets, laptop and desktop computers, and the like. When the terminal devices 101, 102, 103 are software, they can be installed in the above-listed electronic devices. Which may be implemented as multiple software or software modules (e.g., multiple software or software modules for providing distributed services) or as a single software or software module.

The server 105 may be a server providing various services, such as a back-end server providing services for client applications installed on the terminal devices 101, 102, 103. For example, the server may perform optimization of the data annotation based on the sample deviation evaluation so as to display the optimized result of the data annotation on the terminal device server 101, 102, 103, or the server may perform optimization of the data annotation based on the sample deviation evaluation so as to display the optimized result of the data annotation on the terminal device 101, 102, 103.

The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server. When server 105 is software, it may be implemented as multiple software or software modules (e.g., multiple software or software modules for providing distributed services), or as a single software or software module.

Fig. 2 is a flowchart of an optimization method for data labeling based on sample deviation evaluation according to an embodiment of the present disclosure. From the program perspective, the execution subject of the flow may be a program installed on an application server or an application terminal. It is understood that the method may be performed by any apparatus, device, platform, cluster of devices having computing, processing capabilities. As shown in fig. 2, the optimization method includes:

Step S201: and dividing the data set and the sample set based on the data attribute of the data set to obtain a distribution density map of the data set and the sample set.

In the embodiment of the present disclosure, the data set and the sample set are both a set of structured data or a set of data that can be converted into structured data, specifically, the data type may be image data, text data, video data, audio data, or the like. In particular, the dataset is an unlabeled original dataset and the sample set is a set of samples labeled based on the dataset. In the embodiment of the present disclosure, the source of the sample set is not limited, and may be a sample set marked by a user, a sample set marked by a machine, or other sources. In the embodiment of the present specification, the data in the data set may be a key, and the sample set data may be in the form of a key-value.

In this embodiment of the present disclosure, the dividing the data set and the sample set based on the data attribute of the data set to obtain a distribution density map of the data set and the sample set specifically includes:

based on the data attribute of the data set, carrying out two-dimensional meshing division on the data set and the sample set to obtain a two-dimensional data distribution state diagram of the data set and the sample set;

Or alternatively

And based on the data attribute of the data set, performing linear piecewise division on the data set and the sample set to obtain a one-dimensional data distribution state diagram of the data set and the sample set.

selecting a first data attribute and a second data attribute in data attributes of the data set by adopting two-dimensional meshing division, equally dividing (Max (first data attribute) -Min (first data attribute)) into m1 segments, equally dividing (Max (second data attribute) -Min (second data attribute)) into n1 segments, constructing an m1 x n1 meshing matrix, and enabling the data set and the sample set to fall into the meshing matrix according to attribute values of the first data attribute and attribute values of the second data attribute to form the two-dimensional data distribution state diagram;

or alternatively

And adopting linear piecewise division, selecting a third data attribute in the data attributes of the data set, equally dividing (Max (third data attribute) -Min (third data attribute)) into m2 segments, constructing a linear interval of the m2 segments, and falling the data set and the sample set into the linear interval according to the attribute value of the third data attribute to form the one-dimensional data distribution state diagram.

In an embodiment of the present disclosure, the data attributes of the data set include: data source information, data acquisition time, data definition information and data size. When two-dimensional meshing division is performed on the data set and the sample set based on the data attribute of the data set, or linear piecewise division is performed, the data in the data set and the data in the sample set can be better dispersed, preferably can be relatively uniformly distributed.

In the embodiment of the present disclosure, the first data attribute and the second data attribute are different, and the third data attribute may be the same as or may be different from the first data attribute or the second data attribute.

In a specific embodiment, the first data attribute, the second data attribute and the third data attribute may be obtained from the data attributes of the data set by manual or automatic filtering.

Step S203: and screening sample distribution areas which do not accord with the data distribution density based on the data set and the distribution density diagram of the sample set, and adding unlabeled data in the sample distribution areas which do not accord with the data distribution density into a sample set to be processed.

In this embodiment of the present disclosure, the screening, based on the data set and the distribution density map of the sample set, a sample distribution area that does not conform to the data distribution density, and adding unlabeled data in the sample distribution area that does not conform to the data distribution density to a sample set to be processed specifically includes:

Calculating a first sample deviation distribution value of each data block based on the first record number of the data set and the first record number of the sample set in each data block in the two-dimensional data distribution state diagram, and marking the data block with the first sample deviation distribution value not larger than a deviation distribution coefficient as a first sample deviation block;

selecting unlabeled data meeting deviation preset conditions from the first sample deviation block and adding the unlabeled data to the sample set to be processed;

or alternatively

Calculating a second sample deviation distribution value of each data interval based on the second record number of the data set and the second record number of the sample set of each data interval in the one-dimensional data distribution state diagram, and marking the data interval with the second sample deviation distribution value not larger than the deviation distribution coefficient as a first sample deviation interval;

and adding unlabeled data meeting the deviation preset condition in the first sample deviation interval to the sample set to be processed.

In the embodiment of the present disclosure, a sample distribution area that does not conform to a data distribution density is screened based on a data set and a distribution density map of the sample set, and unlabeled data in the sample distribution area that does not conform to the data distribution density is added to a sample set to be processed, where the unlabeled data in the sample distribution area that does not conform to the data distribution density is data derived from the data set, that is, the data is local data in the data set.

In the embodiment of the present specification, the first sample deviation distribution value is a ratio of the first record number of the sample set to the first record number of the data set in each data block in the two-dimensional data distribution state diagram, that is, the first sample deviation distribution value=the first record number of the sample set in each data block in the two-dimensional data distribution state diagram/the first record number of the data set in each data block in the two-dimensional data distribution state diagram.

Deviation distribution coefficient= (count of sample set/count of data set) first deviation coefficient, i.e. first deviation distribution coefficient is ((count (S)/count (D)). R ₁ ）（r ₁ For the first deviation coefficient, the initial value is 0.5, r ₁ The initial value may be adjusted according to a specific traffic scenario), where count (S) is the count of the sample set and count (D) is the count of the data set.

Deviation from preset condition = (count of sample set/count of data set) × first deviation coefficient × number of data set records-number of sample set records.

In particular to the embodiment, in the two-dimensional data distribution state diagram, for each data block d in the m1 x n1 meshing matrix _ij Calculate d _ij The number w of data set records falling in _d And the number of sample set records w _s Calculate d _ij Is a sample deviation distribution value p _ij =w _s /w _d If p _ij ≤（（count（S）/count（D））*r ₁ ) Then data block d _ij Marked as sample offset blocks, where ((count (S)/count (D))) r is randomly selected ₁ *w _d -w _s ) Unlabeled data is added to the set of samples O to be processed.

In the embodiment of the present disclosure, the second sample deviation distribution value is the second record number of the sample set and the second record number of the data set for each data interval in the one-dimensional data distribution state diagram, that is, the second sample deviation distribution value=the second record number of the sample set for each data interval in the one-dimensional data distribution state diagram/the second record number of the data set for each data interval in the one-dimensional data distribution state diagram.

In particular to the embodiment, in the one-dimensional data distribution state diagram, the data is distributed in a linear mode aiming at m2 segmentsEach data interval d _i Calculate d _i The number w of data set records falling in _d And the number of sample set records w _s Calculate d _i Is a sample deviation distribution value p _i =w _s /w _d If p _i ≤（（count（S）/count（D））*r ₁ ）（r ₁ For the first deviation coefficient, the initial value is 0.5, r ₁ Initial value can be adjusted according to specific service scene), the data interval d is set _i The label is a sample deviation interval, and ((count (S)/count (D))ris randomly selected therein ₁ *w _d -w _s ) The strip unlabeled data is added to the set of samples O to be processed.

In the embodiment of the present disclosure, a sample set to be processed is used for labeling a subsequent sample, where the sample set to be processed includes: unlabeled data obtained based on the distribution density maps of the data set and the sample set, and/or unmodified or validated data obtained based on the distribution density maps of the data set and the sample set.

Step S205: and adding unmodified or confirmed data in the data with the accuracy rate change value being greater than or equal to a preset threshold value to the sample set to be processed based on the accuracy rate change value of the distribution density map of the data set and the sample set at preset time intervals.

In this embodiment of the present disclosure, the adding unmodified or confirmed data in the data with the accuracy rate variation value greater than or equal to a preset threshold to the sample set to be processed based on the accuracy rate variation value of the distribution density map of the data set and the sample set at the preset time interval specifically includes:

based on the first record number of the sample set in each data block in the two-dimensional data distribution state diagram and the third record number of the sample set modified by the user in the first record number of the sample set after a first preset time interval;

Determining a first accuracy change value based on the sample set third record number and the sample set first record number;

marking the data block with the first accuracy rate change value being greater than or equal to a second deviation coefficient as a second sample deviation block, and adding unacknowledged or modified data in the second sample deviation block to the sample set to be processed;

or alternatively

Based on the second record number of the sample set in each data interval of the one-dimensional data distribution state diagram and the fourth record number of the sample set modified by the user in the second record number of the sample set after a second preset time interval;

determining a second accuracy rate change value based on the fourth record number of the sample set and the second record number of the sample set;

marking the data block with the second accuracy rate change value being greater than or equal to a second deviation coefficient as a second sample deviation interval, and adding unacknowledged or modified data in the second sample deviation interval to the sample set to be processed.

In the embodiment of the present specification, the first accuracy rate change value is a ratio of the third record number of the sample set to the first record number of the sample set, that is, the first accuracy rate change value=the third record number of the sample set/the first record number of the sample set. The preset time interval defaults to 24 hours, and can be specifically adjusted according to the service scene.

In a particular embodiment, for each data block d in the m1 x n1 meshing matrix _ij Calculate d _ij The number w of sample set records falling in _s After t time, counting w _s Number of samples v modified by the user _s Calculate d _ij The sample accuracy variation value q of (2) _ij =v _s /w _s If q _ij ≥r ₂ （r ₂ For the second deviation coefficient, the initial value is 0.2, r ₂ Initial value can be adjusted according to specific service scene), then data block d _ij Marked as a block of sample deviations and the unmodified or confirmed sample data in the data block dij is added to the set of samples O to be processed.

In the embodiment of the present specification, the second accuracy rate change value is a ratio of the fourth record number of the sample set to the second record number of the sample set, that is, the second accuracy rate change value=the fourth record number of the sample set/the second record number of the sample set.

In a particular embodiment, for each data block d in the m2 segment linear distribution _i Calculate d _i The number w of sample set records falling in _s After t time, counting w _s Number of samples v modified by the user _s Calculate d _i The sample accuracy variation value q of (2) _i =v _s /w _s If q _i ≥r ₂ （r ₂ For the second deviation coefficient, the initial value is 0.2, r ₂ Initial value can be adjusted according to specific service scene), then data block d _i Marked as sample deviation interval and d _i Is added to the set of samples O to be processed.

In the embodiment of the present disclosure, unmodified or confirmed data in data whose accuracy rate change value exceeds a preset threshold value is added to a sample set to be processed based on the accuracy rate change value of the distribution density map of the data set and the sample set at preset time intervals, wherein the unmodified or confirmed data in the data whose accuracy rate change value exceeds the preset threshold value is data derived from the sample set.

Step S207: and outputting the sample set to be processed and the sample deviation index of the sample set to be processed if the number of the sample set to be processed is larger than or equal to the required scale of the sample to be processed or the first deviation coefficient is larger than or equal to a first preset value and the second deviation coefficient is smaller than or equal to a second preset value.

In the embodiment of the present specification, the first preset value is 1, and the second preset value is 0.1. The sample to be processed needs the scale to be set by the user according to the business scene.

In this embodiment of the present disclosure, if the number of the to-be-processed sample sets is greater than or equal to the to-be-processed sample requirement scale, or the first deviation coefficient is greater than or equal to a first preset value and the second deviation coefficient is less than or equal to a second preset value, outputting the to-be-processed sample sets and the sample deviation indexes of the to-be-processed sample sets, further including:

If the number of the sample sets to be processed is smaller than the sample requirement scale to be processed, further (the first deviation coefficient is smaller than the first preset value and/or the second deviation coefficient is larger than the second preset value), updating the first deviation coefficient according to a first preset gradient to obtain an updated first deviation coefficient, and/or updating the second deviation coefficient according to a second preset gradient to obtain an updated second deviation coefficient;

and continuing to screen unlabeled data in sample distribution data which does not accord with the data distribution density based on the updated first deviation coefficient and/or the updated second deviation coefficient, and adding the unlabeled data in the sample distribution data to the sample set to be processed.

In the embodiment of the present specification, when the following condition is not satisfied: and if the number of the sample sets to be processed is greater than or equal to the required scale of the samples to be processed, or (the first deviation coefficient is greater than or equal to a first preset value and the second deviation coefficient is less than or equal to a second preset value), updating the first deviation coefficient and/or the second deviation coefficient. That is, when the number of sample sets to be processed is smaller than the sample requirement scale to be processed, further (the first deviation coefficient is smaller than the first preset value and/or the second deviation coefficient is larger than the second preset value), the first deviation coefficient and/or the second deviation coefficient needs to be updated. Specifically, if the number of the sample sets to be processed is smaller than the required scale of the samples to be processed, further judging whether the first deviation coefficient is larger than or equal to a first preset value and whether the second deviation coefficient is smaller than or equal to a second preset value at the moment, so as to determine that the first deviation coefficient is updated and/or the second deviation coefficient is updated. When the first deviation coefficient is smaller than the first preset value and/or the second deviation coefficient is larger than the second preset value, the deviation coefficient which does not meet the preset value needs to be updated. If the first deviation coefficient is smaller than the first preset value and the second deviation coefficient is larger than the second preset value, updating the first deviation coefficient and the second deviation coefficient to obtain an updated first deviation coefficient and an updated second deviation coefficient. If the first deviation coefficient is larger than or equal to the first preset value, but the second deviation coefficient is larger than the second preset value, the first deviation coefficient is kept unchanged, the second deviation coefficient is updated, and the updated second deviation coefficient is obtained. If the first deviation coefficient is smaller than the first preset value, but the second deviation coefficient is smaller than or equal to the second preset value, the second deviation coefficient is kept unchanged, the first deviation coefficient is updated, and the updated first deviation coefficient is obtained.

In the embodiment of the present disclosure, when two-dimensional meshing division is adopted, the calculation formula of the sample deviation index is as follows:

h=∑（|p _ij -（count（S）/count（D））|+q _ij ）

or alternatively

When linear piecewise division is adopted, the calculation formula of the sample deviation index is as follows:

h=∑（|p _i -（count（S）/count（D））|+q _i ）

wherein:

h is a sample deviation index;

p _ij for each data block d in the two-dimensional data distribution state diagram _ij Is deviated from the distribution value;

count (S) is a count of the sample set;

count (D) is a count of the dataset;

q _ij for each data block d in the two-dimensional data distribution state diagram _ij A sample accuracy rate variation value of (2);

p _i for each data interval d of the one-dimensional data distribution state diagram _i Is deviated from the distribution value;

q _i for each data interval d of the one-dimensional data distribution state diagram _i Is a sample rate of change value.

In this embodiment of the present disclosure, updating the first deviation coefficient according to a first preset gradient to obtain an updated first deviation coefficient, and/or updating the second deviation coefficient according to a second preset gradient to obtain an updated second deviation coefficient specifically includes:

taking the sum of the first deviation coefficient and the first preset gradient as the updated first deviation coefficient;

And/or

And taking the difference between the second deviation coefficient and the second preset gradient as the updated second deviation coefficient.

In a specific embodiment, the first preset gradient is 0.05 and the second preset gradient is 0.02. Updated first deviation factor = first deviation factor + first preset gradient, i.e. r ₁ +=0.05. Updated second deviation factor = second deviation factor-second preset gradient, i.e. r ₂ -=0.02。

In the present embodiment, the deviation index is used to indicate the degree of deviation, and the higher the deviation index, the higher the degree of deviation.

Step S209: and labeling the sample of the sample set to be processed based on the sample set to be processed and the sample deviation index of the sample set to be processed, and forming a supplementary sample so as to optimize data labeling.

In the embodiment of the present specification, the sample set to be processed includes an offset sample based on the data set and the distribution density map of the sample set, and an offset sample based on the change in accuracy of the distribution density map of the data set and the sample set.

In a specific embodiment of the present disclosure, the offset samples in the set of samples to be processed based on the data set and the distribution density map of the sample set are labeled by manual or machine labeling, so as to form a complementary sample.

And marking, modifying or confirming the deviation sample based on the change of the accuracy of the data set and the distribution density map of the sample set in the sample set to be processed in a manual verification mode to form a supplementary sample.

In a specific embodiment, the sample set to be processed may be evaluated by the deviation sample index of the sample set to be processed to determine the data quality of the sample set to be processed, and further determine the labeling mode.

For further understanding of the optimization method for data annotation based on sample deviation assessment provided in the embodiments of the present specification, the following description will be given with reference to specific working diagrams.

Fig. 3 is a schematic working diagram of an optimization method for data labeling based on sample deviation evaluation according to an embodiment of the present disclosure, as shown in fig. 3, firstly, performing spatial division of data, specifically, performing two-dimensional meshing division or linear piecewise division on a sample set and a data set; and secondly, dividing the distribution state of the sample, specifically, calculating the distribution density, and obtaining a data set and a distribution density diagram of the sample set. And thirdly, performing sample deviation analysis, specifically, sample deviation analysis based on distribution density and sample deviation analysis based on accuracy change. And fourthly, carrying out result fusion calculation on a sample deviation analysis result based on the distribution density and a sample deviation analysis result based on the change of the accuracy, deviating from a data set and deviating from an index, and further carrying out subsequent processing of the deviated sample.

Fig. 4 is a core algorithm flowchart of an optimization method for data labeling based on sample deviation evaluation according to an embodiment of the present disclosure. As shown in fig. 4, firstly, a data set D, a sample set S and a sample requirement scale S to be processed input by a user are received; dividing a data space of a data set and a sample set in a one-dimensional or two-dimensional mode; initializing a first deviation coefficient r ₁ Second deviation coefficient r ₂ Then, sample deviation discovery based on distribution density is carried out in a one-dimensional or two-dimensional mode, and the sample deviation discovery is incorporated into a sample set O to be processed; sample deviation discovery based on accuracy change is carried out in a one-dimensional or two-dimensional mode, and the sample deviation discovery is incorporated into a sample set O to be processed; if the number of the sample sets to be processed O is greater than or equal to the sample requirement scale S to be processed, or (the first deviation coefficient is greater than or equal to the first preset value and the second deviation coefficient is less than or equal to the second preset value), outputting the sample set to be processed O and the sample deviation index h to perform the post-processing of the deviation samples. If the number of the sample sets to be processed is smaller than the required scale of the samples to be processed, further judging (the first deviation coefficient is smaller than the first preset value and/or the second deviation coefficient is larger than the second preset value), updating the first deviation coefficient according to the first preset gradient to obtain an updated first deviation coefficient, and/or updating the second deviation coefficient according to the second preset gradient Updating the second deviation coefficient to obtain an updated second deviation coefficient; and continuing to screen unlabeled data in the sample distribution data which does not accord with the data distribution density based on the updated first deviation coefficient and/or the updated second deviation coefficient, and adding the unlabeled data in the sample collection to be processed.

According to the optimization method for the data annotation based on the sample deviation evaluation, provided by the embodiment of the specification, the data set and the sample set are subjected to one-dimensional/two-dimensional space division, the selection of division attributes is supported, the visual distribution state of sample data on the data set is displayed in a visual mode, the conversion from an intangible data set to a tangible data distribution is realized, and the visual data state display can be supported. The sample deviation discovery method based on the distribution density and the accuracy rate change automatically screens the deviation samples in the running process of the system to form a sample set to be processed, automatically calculates the sample deviation index, maximally reduces the user operation, and can realize multi-means and automatic analysis of the deviation samples.

The foregoing details an optimization method for data annotation based on sample deviation evaluation, and accordingly, the present disclosure also provides an optimization apparatus for data annotation based on sample deviation evaluation, as shown in fig. 5. FIG. 5 is a schematic diagram of an optimizing apparatus for labeling data based on sample deviation evaluation according to an embodiment of the present disclosure, where the apparatus includes:

The data visualization module 501 divides the data set and the sample set based on the data attribute of the data set to obtain a distribution density map of the data set and the sample set;

a first discovery module 503 of a sample to be processed, based on the data set and the distribution density graph of the sample set, screening a sample distribution area which does not conform to the data distribution density, and adding unlabeled data in the sample distribution area which does not conform to the data distribution density to the sample set to be processed;

a second discovery module 505 of samples to be processed adds unmodified or confirmed data in the data with the accuracy rate change value being greater than or equal to a preset threshold value to the samples to be processed based on the data set and the accuracy rate change value of the distribution density map of the sample set at preset time intervals;

the sample analysis module 507 outputs the sample set to be processed and the sample deviation index of the sample set to be processed if the number of the sample set to be processed is greater than or equal to the required scale of the sample to be processed or the first deviation coefficient is greater than or equal to a first preset value and the second deviation coefficient is less than or equal to a second preset value;

the sample post-processing module 509 is configured to label the sample of the sample set to be processed based on the sample set to be processed and the sample deviation index of the sample set to be processed, and form a supplementary sample to optimize data labeling.

In order to further understand the optimizing device for the data annotation based on the sample deviation evaluation provided by the embodiment of the specification, the embodiment of the specification also provides a schematic diagram of another optimizing device for the data annotation based on the sample deviation evaluation. FIG. 6 is a schematic diagram of another optimization apparatus for labeling data based on sample bias evaluation according to an embodiment of the present disclosure. The data visualization module can realize data visualization of the data set and the sample set, and can display the data set and the sample set to a user in a sample distribution state. Based on the data of the data visualization module, a set of samples to be processed can be obtained by an off-sample first discovery module and an off-sample second discovery module, which off-sample set is also presented to the user. A sample deviation index is then obtained based on the deviation sample analysis module and presented to the user. And finally, carrying out offset sample post-processing on the sample set to be processed and the offset sample index by an offset sample post-processing module.

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,,

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus, electronic devices, non-volatile computer storage medium embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to the description of the method embodiments.

The apparatus, the electronic device, the nonvolatile computer storage medium and the method provided in the embodiments of the present disclosure correspond to each other, and therefore, the apparatus, the electronic device, the nonvolatile computer storage medium also have similar beneficial technical effects as those of the corresponding method, and since the beneficial technical effects of the method have been described in detail above, the beneficial technical effects of the corresponding apparatus, the electronic device, the nonvolatile computer storage medium are not described here again.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing one or more embodiments of the present description.

It will be appreciated by those skilled in the art that the present description may be provided as a method, system, or computer program product. Accordingly, the present specification embodiments may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description embodiments may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing description is by way of example only and is not intended as limiting the application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.

Claims

1. An optimization method for data annotation based on sample deviation assessment, the optimization method comprising:

2. The optimization method according to claim 1, wherein the dividing the data set and the sample set based on the data attribute of the data set to obtain the distribution density map of the data set and the sample set specifically comprises:

Or alternatively

3. The optimization method according to claim 2, wherein the dividing the data set and the sample set based on the data attribute of the data set to obtain the distribution density map of the data set and the sample set specifically comprises:

or alternatively

4. The optimization method according to claim 3, wherein the screening of the sample distribution area not conforming to the data distribution density based on the data set and the distribution density map of the sample set, and adding unlabeled data in the sample distribution area not conforming to the data distribution density to the sample set to be processed specifically includes:

or alternatively

5. The optimization method according to claim 3, wherein the adding unmodified or confirmed data in the data with the accuracy rate variation value being greater than or equal to a preset threshold to the sample set to be processed based on the accuracy rate variation value of the distribution density map of the data set and the sample set at preset time intervals specifically comprises:

or alternatively

6. The optimizing method according to claim 3, wherein if the number of the sample sets to be processed is equal to or greater than a required scale of the samples to be processed, or a first deviation coefficient is equal to or greater than a first preset value and a second deviation coefficient is equal to or less than a second preset value, outputting the sample sets to be processed and a sample deviation index of the sample sets to be processed, further comprising:

7. The optimization method of claim 3, wherein the sample deviation index is calculated by using two-dimensional meshing division as follows:

h=∑（|p _ij -（count（S）/count（D））|+q _ij ）

or alternatively

When the linear piecewise division is adopted, the calculation formula of the sample deviation index is as follows:

h=∑（|p _i -（count（S）/count（D））|+q _i ）

wherein:

h is a sample deviation index;

count (S) is a count of the sample set;

count (D) is a count of the dataset;

8. The optimization method of claim 1, wherein the data attributes of the data set comprise: data source information, data acquisition time, data definition information and data size.

9. An optimization apparatus for data annotation based on sample bias evaluation, the optimization apparatus comprising:

10. An electronic device, comprising:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,,